So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.