Indeed we ran a ~2m user webmail service as a CGI written in C++ ~20 years ago. We addressed latency aggressively by statically linking and never explicitly freeing memory except if we really had to - the processes were short lived; better to let the OS just dispose of everything at once.
The process overhead was not a big deal even on 20yo hardware, and it saved us from dealing with all kinds of awful isolation issues. We discussed fastcgi or the like and dismissed it because the latency savings were much smaller than one might expect exactly for the reasons you mention: The problem was much less the process creation overhead than the overhead of dynamic runtimes.
People also seem to have forgotten what was expected back then. The time it takes to load Gmail for example would have been totally unacceptable. Our biggest latency limitation was not the web server / CGI, but optimizing the mail storage backends, so that is where we spent our effort.