Even worse; it's more like that plus extracting the raw microarchitectural state of a CPU, serializing it in a somewhat arbitrary way, trying to shove that blob into a different CPU and still expecting everything to continue running.
I'm not necessarily complaining, just pointing out this significant difference WRT software programs running on CPUs.
> There are upstream formats targeting FPGAs that can be shared, although yes redoing place and route is slow.
Can you show me an example? I'd like to see this. You do not mean FPGA overlays, correct?
> Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?
Like you say, at the very least you will need to re-do place and route. But actually the problem is much worse than this. Different FPGAs have different physical resources. Not just differing amounts of logic area, but different amounts of block RAM, different DSP blocks and in varying numbers, high-speed transceivers, etc. This necessitates making different design trade-offs. Simply shoehorning the same design into different FPGAs, even if it were kind of possible, will not work well.
> Alternatively, would building whole images for many families of FPGA make sense?
Currently I think that's the only real option. But the extreme overhead, duplication of effort and maintenance burden make it very unattractive.
My napkin sketch is some sort of generalized array of partial reconfiguration regions with standardized resources in each region. Accelerator applications can distribute versions targeting different numbers of regions (e.g. one version for FPGAs supporting up to 8 regions, one for FPGAs supporting up to 16 regions, etc.). The FPGA gets loaded with a bitstream supporting a PCIe endpoint and management engine, and some sort of crossbar between regions. At accelerator load time, previously mapped, placed, and routed logical regions used in the application are placed onto actual partial reconfiguration regions and connections between regions are routed appropriately. The idea is to pre-compute as much of the work as possible, leaving a lower dimension problem to solve for final implementation. Timing closure and clock management are left as exercises for the reader :P.