If you care about latency, a modern 8-or-more core x86 with its L1/L2 cache segmentation and penalized-but-shared L3 cache is almost as complex. It becomes even more complex if you use the CPU topology to make inferences hyperthreading shared caches or need to deal with the shared FPU on older AMD processors.
My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.