I've (badly) written code like you describe -- usually by abusing fork to do my initial data structure setup, then using C pipes to pass jobs around. I suspect there are much better ways of doing it, but that parallelised well enough for the stuff I was doing. I'd be interested to know if there are good libraries (or best practices) for doing this kind of parallelism.