Most of the bandwidth comes from cache hits, but for those rare workloads larger than the caches, Apples products may be 2-8x faster?
However, Strix Halo, which has a much bigger GPU, is designed for a maximum power consumption for CPU+GPU of 55 W or more (up to 120 W), while Lunar Lake is designed for 17 W, which explains the choices for the memory interfaces.
On the other hand, I do think it's fair to compare to the Max too, and it loses by a lot to that 512 bit bus.
Not 8000 or 8500 MT/s and thus the frequency is halved?
However that should have been obvious, because there will be no LPDDR5x of 16000 MT/s. That throughput might be reached in the future, but a different memory standard will be needed, perhaps a derivative of the MRDIMMs that are beginning to be used in servers (with multiplexed ranks).
MHz and MT/s is really the same unit. What differs is the quantity that is measured, e.g. the frequency of oscillations and the frequency of transfers. I do not agree with the method of giving multiple names to a unit of measurement in order to suggest what kind of quantity has been measured. The number of different quantities that can be measured with the same unit is very large. If the method of giving multiple unit names had been applied consistently, there would have been a huge number of unit names. I believe that the right way is to use a unique unit name, but to always specify separately what kind of quantity had been measured, because having only a numeric value and the unit is never sufficient information, without having also which quantity has been measured.
And there's very little benefit to widening the memory bus past 128-bit unless you have a powerful GPU to make good use of that bandwidth. There are comparatively few consumer workloads for CPUs that are sufficiently bandwidth-hungry.
200+ Gb/sec (IIRC Anandtech measured 240 Gb/sec) per a cluster.
Pro is one cluster, Max is two clusters and Ultra is four clusters, so the accumulative bandwidth is 200, 400 and 800 Gb/sec respectively[*].
The bandwidth is also shared with GPU and NPU cores within the same cluster, so on combined loads it is plausible that the memory bus may become fully saturated.
[*] Starting with M3, Apple has pivoted Pro models into more Pro and less Pro versions that have differing memory bus widths.
I suspect consumer workloads to rise
Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.
At those price levels, PC laptops have discrete GPUs with their own RAM with 256 GB/s and up just for the GPU.
95GB/s is 24GB/s per core, at 4.8Ghz that's 40 bits per core per cycle. You would have to be doing basically nothing useful with the data to be able to get through that much bandwidth.
The resemblance to the behavior of GPUs is not a coincidence, but it is because the GPUs are also mostly doing array operations.
So the general rule is that the programs dominated by array operations are sensitive mostly to the memory throughput.
This can be seen in the different effect of the memory bandwidth on the SPECint and SPECfp benchmark results, where the SPECfp results are usually greatly improved when memory with a higher throughput is used, unlike the SPECint results.
This processor has two lanes over 4 P-cores. Something like an EPYC-9754 has 12 lanes over 128 cores.
Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."
Any number of cores with any count and any width of SIMD functional units can reach the maximum throughput, as long as it can be ensured that the data can be found in the L1 cache memories at the right time.
So the limitations on the number of cores and/or SIMD width and count are completely determined by whether in the applications of interest it is possible to bring the data from the main memory to the L1 cache memories at the right times, or not.
This is what must be analyzed in discussions about such limits.
The right way to express a matrix multiplication is not that wrongly taught in schools, with scalar products of vectors, but as a sum of tensor products between the column vectors of the first matrix with those row vectors of the second matrix that share with them the same position of the element on the main diagonal of the matrix.
Computing a tensor product of two vectors, with the result accumulated in registers, requires a number of memory loads equal to the sum of the lengths of the vectors, but a number of FMA operations equal to the product of the lengths (i.e. for square matrices of size NxN, there are 2N loads and N^2 FMA for one tensor product, which multiplied with N tensor products give 2N^2 loads and N^3 FMA operations for the matrix multiplication).
Whenever the lengths of both vectors are are no less than 2 and at least one length is no less than 3, the product is greater than the sum. With greater vector lengths, the ratio between product and sum grows very quickly, so when the CPU has enough registers to hold the partial sum, the ratio between the counts of FMA operations and of memory loads can be very great.
I bought a wifi7 card & tried it in a bunch of non-Intel systems, straight up didn't work. Bought a wifi6 card and it sort of works, ish, but I have to reload the wifi module and sometimes it just dies. (And no these are not cnvio parts).
I think Intel has a great amazing legacy & does super things. Usually their driver support is amazing. But these wifi cards have been utterly enraging & far below what's acceptable in the PC world; they are not fit to be called PCIe devices.
Something about wifi really brings out the worst in companies. :/
https://github.com/intel/linux-npu-driver/blob/main/docs/ove...
The Linux driver is specific to Linux, but the software on top of that like oneAPI and OpenVINO are cross platform I think.
Can someone put this in context? The values seem order of magnitude higher than here: https://www.anandtech.com/show/16143/insights-into-ddr5-subt...
The anandtech article is latencies for parts of a memory operation, between the memory controller and the ram. End to end latency is going to be a lot more than just CAS latency, because CAS latency only applies once you've got the proper row open, etc.
You could read the article on the latest AMD top of the line desktop chip to compare: https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... (although that's a desktop chip, the original article compares the Intel performance to 128 ns of DRAM latency for AMD's mobile platform Strix Point)
The way CAS has been widely understood as "memory latency" is just wrong.
And once more than like three Zen 5 laptops come out.
I’d expect lunar lake’s position to improve a bit in coming months as they tweak scheduling, but AMD should be good at this point.
Edit: around 16 mark, https://youtu.be/5OGogMfH5pU?si=ILhVwWFEJlcA3HLO. The laptops came with 24H2
The correct solution is that from the parent article, to continue to call the L1 cache memory as the L1 cache memory, because there is no important difference between it and the L1 cache memories of the previous CPUs, and to call the new cache memory that has been inserted between the L1 and L2 cache memories as the L1.5 cache memory.
Perhaps Intel did this to give the very wrong impression that the new CPUs have a bigger L1 cache memory than the old CPUs. To believe this would be incorrect, because the so called new L1 cache has a much lower throughput and a worse latency than a true L1 cache memory of any other CPU.
The new L1.5 is not a replacement for an L1 cache, but it functions as a part of the L2 cache memory, with identical throughput as the L2 cache, but with a lower latency. As explained in the article, this has been necessary to allow Intel to expand the L2 cache to 2.5 MB in Lunar Lake and to 3 MB in Arrow Lake S (desktop CPU), in comparison with AMD, which has an only 1 MB L2 cache (but a bigger L3 cache).
According to rumors, while the top AMD desktop CPUs without stacked cache memory have an 80 MB L2+L3 cache (16 MB L2 + 64 MB L3), the top Intel model 285K might have 78 MB of cache, i.e. about the same amount, but with a different distribution on levels: 2 MB L1.5 + 40 MB L2 + 36 MB L3. Nevertheless, until now there is no official information from Intel about Arrow Lake S, whose launch is expected in a month from now, so the amount of L3 cache is not certain, only the amounts of L2 and L1.5 are known from earlier Intel presentations.
Lunar Lake is an excellent design for all applications where adequate cooling is impossible, i.e. thin and light notebooks and tablets or fanless small computers.
Nevertheless, Intel could not abstain from not using unfair marketing tactics. Almost all the benchmarks presented by Intel at the launch of Lunar Lake have been based on the top model 288V. Both top models 288V and 268V are likely to be unobtainium for most computer models, while at the few manufacturers that will offer this option they will be extremely overpriced.
Most available and affordable computers with Lunar Lake will not offer any better CPU than 258V, which is the one tested in the parent article. 258V has only 4.8 GHz/2.2 GHz turbo/base clock frequencies, vs. 5.1 GHz/3.3 GHz of the 288V used in the Intel benchmarks and in many other online benchmarks. So the actual experience of most Lunar Lake users will not match most published benchmarks, even if it will be good enough in comparison with any competitors in the same low-power market segment.
Q: What do you call Windows with its UI translated to Hebrew? A: The L10N of Judah