Just compare this with pushing a computation through an operation-annotated DAG: A good serial program computes a topological order and annotates each node with the result so far. This is efficient, because each node is considered as little as possible (once for the computation, once for each successor), but requires pre-computations and allows only a single thread to compute things.
The program with the least possible parallel runtime just assigns a computation unit to each and every node and during each parallel step, each unit computes the value if it is possible. This requires syncronization and does unnecessary steps, but overall, this scales pretty good with more units.
I think that is a neat coincidence :)
In some situations (servers) we just don't care about how long it takes to do this part of the boot: my SMTP/HTTP/SSH server in the basement has been up 155 days. I think it only goes down when I loose (residential) power.
I'd rather have a correct startup than a fast, subtly incorrect startup. I expect that this would introduce a lot of concurrency-related bugs, and make dinking with startup scripts into something of a Black Art.