Thanks for posting this here :)
One thing I forget to mention in the article is that the RISC-V Vector Extension doesn't really have to be used to compute long vectors. In theory you could set a vector length of 4 and depending on the architecture you'd still get good performance. But in that case you'd also lose some of the power efficiency advantages I talk about...