That is awesome.
I've been using g++-4.6 lately and with the new lambdas I'm migrating more and more to this async task application model.
One point that seems frequently understated with the lamentations about the absence of a true double-pointer-sized DCAS is that on machines with a 64 bit CAS (e.g. x86 and x64) you could still implement something useful CAS on pairs of 32-bit handles.
2^32 concurrent work items "ought to be enough for anybody", right? :-) Well, it might be useful to those of us with less than 100 GB of RAM.