One of my prior projects involved working with a lot of ex-FPGA developers. This is obviously a rather biased group of people, but I saw a lot of feedback around that was very negative about FPGAs.
One comment that's telling is that since the 90s, FPGAs were seen as the obvious "next big technology" for HPC market... and then Nvidia came out and pushed CUDA hard, and now GPGPUs have cornered the market. FPGAs are still trying to make inroads (the article here mentions it), but the general sense I have is that success has not been forthcoming.
The issue with FPGAs is you start with a clock rate in the 100s of MHz (exact clock rate is dependent on how long the paths need to be), compared with a few GHz for GPUs and CPUs. Thus you need a 5× performance win from switching to an FPGA just to break even, and you probably need another 2× on top of that to motivate people going through the pain of FPGA programming. Nvidia made GPGPU work by being able to demonstrate meaningful performance gains to make the cost of rewriting code worth it; FPGAs have yet to do that.
Edit: It's worth noting that the programming model of FPGAs has consistently been cited as the thing holding back FPGAs for the past 20 years. The success of GPGPU, despite the need to move to a different programming model to achieve gains there, and the inability of the FPGA community to furnish the necessary magic programming model suggests to me (and my FPGA-skeptic coworkers) that the programming model isn't the actual issue preventing FPGAs from succeeding, but that FPGAs have structural issues (e.g., low clock speeds) that prevent their utility in wider market classes.
However, some applications do not map well to GPUs. Particularly those applications with a great deal of bit-level parallelism can achieve enormous speedups with bespoke hardware. For those applications where it doesn't make sense to tape out an ASIC, FPGAs are beautiful--even if they only operate at a few hundred MHz.
I think the "programming model" is actually the biggest barrier to wider adoption. Your comment is suffused with what I believe is the source of this disagreement: The idea that one programs an FPGA. One designs hardware that is implemented on an FPGA. The difference may sound pedantic, but it really is not. There is a massively huge difference between software programming and hardware design, and hardware design is downright unnatural for software developers. They are completely different skill sets.
On top of that add all the headaches that come with implementing a physical device with physical constraints (the article complains about P&R times but this is far from the only burden) and it becomes clear that FPGAs are quite frankly a massive pain in the ass compared to software running on CPUs or GPUs.
(Also, in general, FPGA tools are just some of the lowest quality garbage out there... and that is saying something. They're that bad. This is a completely unnecessary speedbump.)
The rebuttal to your objection is always tools like "HLS" (High-Level Synthesis), or in English it's "C to HDL" (FPGAs are 'programmed' in the two Hardware Definition Languages VHDL (bad) or Verilog (worse, but manageable if you learn VHDL first).) These are not programming languages, they are hardware definition languages. That means things like "everything in a block always executes in parallel". (Take that, Erlang?) In fact, everything on the chip always executes in parallel, all the time, no exceptions; you "just" select which output is valid. That's because this is how hardware works.
This model maps very, very poorly to traditional programming languages. This makes FPGAs hard to learn for engineers and hard to target for HLS tools. The tools can give you decent enough output to meet low- to mid-performance needs, but if you need high performance -- and if not, why are you going through this masochism? -- you're going to need to write some HDL yourself, which is hard and makes you use the industry's worst tools.
Thus, FPGAs languish.
The tools, yes, because it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs with half the features that us software developers have, and with a crapload of vendor lock-in... but I digress.
I find working in Verilog to be pretty pleasant. Yes I can see that with sufficient complexity it wouldn't scale out well. But SystemVerilog does give you some pretty good tools for managing with modularity.
On the other hand, I've never particularly enjoyed working with GPUS, CUDA, etc.
So I would agree with your statement that the structural issues prevent their utility in wider market classes -- and those really are as you say ... lower clock speeds, cost, but also vendor tooling.
FPGAs could really do with a GCC/LLVM type open, universal, modular tooling. I use fusesoc, which is about as close to that as I will get (declarative build that generates the Vivado project behind the scenes), but it's not perfect, still.
> it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs
Hardware engineers feel pain just like you do. The reason why they put up with those awful software suites is because they have features they need that aren't available elsewhere. In particular, they interface with IP blocks and hard blocks, including at a debug + simulation level. Those tend to evolve quickly and last time I looked -- which admittedly was a while ago -- the open source FPGA tooling pretty much completely ignored them, even though they're critical to commercial development.
If you are content to live without gigabit transceivers, PCIe controllers, DRAM controllers, embedded ARM cores, and so on, I suspect it would be relatively easy to use the open source tooling, but you would only be able to address a small fraction of FPGA applications.
The main challenge I had was compilation time. It can sometimes take overnight to compile a simple application if there's a lot of nested looping, only to have it run out of gates. This can be a royal pain.
I'd expect most HPC scenarios would have lots of nested looping, and probably memory accesses, and thus have to spend a lot of time writing state machines to get around gate count limitations and wait for memory responses, at which point you're basically designing a 200 MHz CPU.
So I don't see it as being very useful for general purpose acceleration, but could be a good CPU offload for some very specific use cases that are more bit-banging than computing. Azure accelerates all its networking via FPGA, which seems like the ideal use case.
Verilog and VHDL have basically nothing in common with any language you've ever used.
Compilation can take multiple days. This means that debugging happens in simulation, at maybe 1/10000th of the desired speed of the circuit.
If you try to make something too big, it just plain won't fit. There is no graceful degradation in performance; an inefficient design will just not function, come Hell or high water.
The existing compilers will happily build you the wrong thing if you write something ill-defined. There are a ton of things expressible in a hardware description language that don't actually map onto a real circuit (at least not one that can be automatically derived). In any normal language anything you can express is well-defined and can be compiled and executed. Not so in hardware.
Timing problems are a nightmare. Every single logic element acts like its own processor, writing directly into the registers of its neighbours, with no primitives for coordination. Imagine if you had to worry about race conditions inside of a single instruction!
Maybe if all these problems are solved FPGAs still wouldn't catch on, but let's not pretend the programming model isn't a problem. Hardware is fundamentally hard to design and the tooling is all 50 years out of date.
I'd argue FPGAs aren't programmed and don't have a programming model. Complaints that the programming model of FPGAs holds their adoption back are thus conceptually ill-founded. (The tooling still sucks).
The iCE40 series is almost there but not quite. It's a bit pricey (this is sometimes okay, sometimes a dealbreaker) but its care and feeding is too annoying. Who wants to source a separate configuration memory? Sometimes I don't have the space for that crap.
If any company can bring a small, cheap, low power FPGA to the market, preferably with onboard non-volatile configuration memory, a microcontroller-like peripheral mix (UART, I2C, SPI, etc.), easy configuration (re)loading, and with good tool and dev board support, they'll sell a lot of units. They don't even have to be fast!
Their development environment is Eclipse based with numerous libraries such as audio processing, interface management, DFU etc. They use a variant of C (xc) that lets you send data between channels/tiles, and easily parallelize processing.
An example use is in voice assistants where multiple microphones need to be analyzed simultaneously, echo and background noise has to be eliminated, and the speaker isolated into a single audio stream. I've used it for an audio processing product that needed match hardware timers exactly, provide USB access, matched input and output etc.
So, for FPGAs to be the next big thing in HPC, you'd need to find a class of workloads that benefit from the FPGA architecture, for long enough and with high enough volume to be worth the work to move over, and are also unstable or low volume enough that it's not worth making them their own chip.
For example timing protocols on backbone equipment handling 100-400Gbps. Depending on how its configured you may need to do different things. Additionally you probably don't want to replace 6 figure hardware every generation.
Another example is test equipment where you can't run the tests in parallel. A single piece of hardware can be far more portable / cost effective.
There's one more big one: the ability to update the logic in the field.
It's so easy that it's quite common to see people pass off work onto the fpga if it involves some slightly heavier data processing, which is exactly how it should be.
https://github.com/xupgit/FPGA-Design-Flow-using-Vivado/tree...
https://www.xilinx.com/support/university.html
https://www.xilinx.com/video/hardware/getting-started-with-t...
There are others thst cover the SDK side of things, but the HW side/Vivado is well documented.
If some FPGA company comes along and throws out conventional market wisdom (the old Henry Ford quote seems pertinent: "If I'd asked customers what they wanted, they would have said "a faster horse"") and makes a FPGA with software tools that are fast, non-buggy, with good UI/UX, I think they would be able to steal significant market share. Early FPGA patents should be expiring by now...
I guess the one place where GPGPU-based solutions wouldn't work, is when the code you want to accelerate is necessarily acting as some kind of Turing machine (i.e. emulation for some other architecture.) However, I can't think of a situation where an FPGA programmed with the netlist for arch A, running alongside a CPU running arch B, would make more sense than just getting the arch-B CPU to emulate arch A; unless, perhaps, the instructions in arch-A are very, very CISC, perhaps with analogue components (e.g. RF logic, like a cellular baseband modem.)
You saw correctly, work is indeed being done to build "shells" that can accept workloads without the user having to go through the FPGA tooling/build process.
So it's unlikely ever to gain broad acceptance because the software vendors would have to support such a high number of permutations and the return can be questionable. This is why you see far more accelerators based on ASICs that have higher clock speeds and baked-in circuitry for specific tasks, with standardized APIs.
But sure, there's nothing preventing you from buying an FPGA board, hooking it up to your PC, creating a few images that do the accelerations you want, and writing software that uses them, swapping the image in when your program loads. You could even write a smart driver that swaps the image only if it's not in use by another app, or whatever. It's just unlikely you'll ever find a bunch of third-party software that supports it.
I could imagine that Apple will include something like this in their Apple Silicon SOC for ARM macs.
The Afterburner Card is not user programmable, but maybe it may in the future and this was just the first try to get the hardware in the field.
They are good at a lot of things that are in a smaller scales. Like general prototyping/testing/simulation, telecom, special-purpose real-time computing etc.
The behind-scene logic is that FPGAs can never make things as flexible as software. And flexible software always offset the inefficiency in a non-configurable chips. Just comparing FPGAs and CPUs/GPUs will never teach FPGAs vendors the reality, or they choose to ignore after all...
- The first one is the FPGA programming. Now using OpenCL and HLS is much easier compared to VHDL/verilog to design your own accelerators.
- The second one is the FPGA deployment and integration. Until now it was very difficult to integrate your design with applications, to scale-out efficiently and to share it among multiple threads/users. The main reason was the lack of an OS_layer (or abstraction layer) that would enable to treat FPGAs as any other computing resource (CPU, GPU).
This is why at inaccel we developed a unique vendor-agnostic orchestrator for FPGAs. The orchestrator allows much easier integration, scaling and resource sharing of FPGAs.
That way we have managed to decouple the FPGA designer from the software developer. The FPGA designer creates the bitstream and the software developer just call the function that wants to accelerate. No need to define the bitstream file, no need to define the interface or the memory buffer allocation.
And the best part: It is vendor and platform agnostic. The FPGA designer creates multiple bitstream for different platform and the software developer couldn't care less. The developer just call the function and the inaccel FPGA orchestrator magically configure the right FPGA for the right function.
Really? I'm assuming if this is true it can only be for tiny parts of the design, or they have some gigantic wafer-scale FPGA that they're not telling anyone about :-) Anyway I thought they mainly used software emulation to verify their designs.
1. It's not just a single FPGA but a large box full of them. for example: https://www.synopsys.com/verification/emulation/zebu-server....
2. Software models are employed for parts of the system (For example, the southbridge and all the peripherals connected to it are generally a software model which communicates with the hardware emulated portion in the FPGA via a PCIe model which is partly in hardware and partly in software.) This saves a lot of gates in the FPGA - those parts have already been well tested anyway so no need to put them into the hardware emulation.
- modern FPGAs are huge.
- when an asic design won't fit in a single FPGA, it's usually possible to partition the design into multiple FPGAs
- software emulation/ simulation is not guaranteed to be "more accurate". FPGAs can interact with a real-world environment in ways that simulation simply cannot
- simulations run 1000s of times slower than FPGAs. Months of simulation time can be covered in minutes on the FPGA
Edit: to be clear, they all use simulation too, but FPGAs are used to accelerate the verification process
We had 10 such boards, good for millions of dollars in hardware, and a small team to keep it running.
These platform were mostly used by the firmware team to develop everything before real silicon came back. It could run the full design at ~1 to 10MHz vs +500MHz on silicon or 10kHz in simulation.
After running for a while, that FPGA platform crashed on a case where a FIFO in a memory controller overflowed.
Our VP of engineering said that finding this one bug was sufficient to justify the whole FPGA emulation investment.
One of the nicer stories about the first ARM chip is that they built a software simulator to verify the design and as a result they found plenty of bugs in the hardware before committing to silicon. The first delivered chips worked right away.
Also, there are prototyping boards specifically built for emulation that integrate multiple FPGAs, although this does introduces a partitioning problem that has to be solved either manually or via dedicated emulator software.
First off, mapping an entire CPU to an FPGA cluster is a design challenge itself. Assuming you can build an FPGA cluster large enough to hold your CPU, and reliable enough to get work done on it, you have the problem of partitioning your design across the FPGA's. Second problem: observability. In a simulator, you can probe anywhere trivially, with an FPGA cluster, you must route the probed signal to something you can observe. (I am not even going to talk about getting stimulus in and results out, since with FPGA or simulator, either way you have that problem, it is just different mechanics.)
The big problem is that an FPGA models each signal with two states: 1 and 0. A logic simulator can use more states, in particular U or "unknown". All latches should come up U, and getting out of reset (a non-trivial problem), to grossly oversimplify, is "chasing the U's away". An FPGA model could, in theory, model signals with more than two states. The model size will grow quickly.
Source: Once upon a time I was pre-silicon validation manager for a CPU you have heard of, and maybe used. Once upon a time I was architect of a hardware-implemented logic simulator that used 192 states (not 2) to model the various vagaries of wired-net resolution. Once upon a time I watched several cube-neighbors wrestle with the FPGA model of another CPU you have heard of, and maybe used.
Note: What would 3 state truth tables look like, with states 0,1,U? 0 and 1 is 0. 0 and U is 0. 1 and U is U -- etc. You can work out the rest with that hint, I think.
Edit to add: Why are U's important? They uncover a large class of reset bugs and bus-clash bugs. I once worked on a mainframe CPU where we simulated the design using a two-state simulator. Most of the bugs in bring-up were getting out of reset. Once we could do load-add-store-jump, the rest just mostly worked. Reset bugs suck.
Indeed they do. And even if you have working chips you get the next stage: board level reset bugs. A MC68K board I helped develop didn't want to boot, some nasty side effect of a reset line that didn't stay at the same level long enough stopped the CPU from resetting reliably when everything else did just fine. That took a while to debug.
That's a really narrow market. Telecom equipment and lab equipment, basically.
If I need volume, I need at least an ASIC. If I need to manage power, I need a full custom design.
Or you might imagine a chip that has an FPGA on the side (I expected Intel would ship this after acquiring Altera, but it never happened). But the FPGA would somehow have to have access to the paths that caused the vulnerability, which is highly unlikely, and would also be really slow compared to what they actually do which is hacking around it by microcode changes.
They did: https://www.anandtech.com/show/12773/intel-shows-xeon-scalab...
But I get the sense this part was aimed at a few very specific customers. It required some PCB-level power delivery changes, so you couldn't even drop it into a standard server motherboard.
I don't think they are very popular though. Maybe they are used sometimes for machine learning?
Spark/k8s integration Abstraction of popular cores Python APIS Serverless deployments Etc