It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.
Yep, this also reflects the design target of GPUs targeting much larger working sets, so have higher main memory bandwidth and rely less on caches. CPUs rather have few fast threads of execution working on hot cached data than many slow ones talking to main memory (because N-way thread level parallelism often splits your cache N ways, to N working sets)