Except for Intel, which publishes lots of technical documentation on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/
You can also find the i810/815 manuals elsewhere online, but except for an odd gap between that and the 965 (i.e. missing the 855/910/915/945) for some reason, they've been pretty consistent with the documentation.
Includes full ISA documentation of their current and past offerings, though look like they tend to be aimed at implementors rather than "high level" description for interested enthusiasts.
There is almost no explanation about how they are intended to be used and about the detailed microarchitecture of their GPUs. For that, the best remains to read the source code of their Linux drivers, though even that is not always as informative as it could be, as some of the C code may have been automatically generated from some other form used internally by AMD.
The Intel documentation is much more complete.
Nevertheless, AMD has promised recently in several occasions that they will publish in the near future additional GPU documentation and additional open-source parts of their GPU drivers, so hopefully there will be a time when the quality of their documentation will match again that of Intel, like it did until around 2000.
[The Thirty Million Line Problem - Casey Muratori](https://www.youtube.com/watch?v=kZRE7HIO3vk)
I know the terminology has gotten quite loose in recent years with Nvidia & Co. selling server-only variants of their graphics architectures as GPUs, but the "graphics" part of GPU designs make up a significant part of the complexity, to this day.
That's not a good definition, since a CPU or a DSP would count as a GPU. Both have been used for such purpose in the past.
> There's still use for GPUs even if they're not outputting anything.
The issue is not their existence, it about calling them GPUs when they have no graphics functionality.
Here's another: https://github.com/jbush001/NyuziProcessor
What size run would be needed for TSMC or some other fab to produce such a processor economically?
If you abstract away the complexities and focus on building a system using some pre-built IP, FPGA design is pretty easy. I always point people to something like MATLAB, so they can create some initial applications using HDL Coder on a DevKit with a Reference design. Otherwise, there's the massive overhead of learning digital computing architecture, Verilog, timing, transceivers/IO, pin planning, Quartus/Vivado, simulation/verification, embedded systems, etc.
In short, start with some system-level design. Take some plug-and-play IP, learn how to hook together at the top level, and insert that module into a prebuilt reference design. Eventually, peel back the layers to reveal the complexity underneath.
1. Read Harris, Harris → Digital Design and Computer Architecture. (2022). Elsevier. https://doi.org/10.1016/c2019-0-00213-0
2. Follow the author's RVFpga course to build an actual RISC-V CPU on an FPGA → https://www.youtube.com/watch?v=ePv3xD3ZmnY
I might add these:
- Computer Architecture, Fifth Edition: A Quantitative Approach - https://dl.acm.org/doi/book/10.5555/1999263
- Computer Organization and Design RISC-V Edition: The Hardware Software Interface - https://dl.acm.org/doi/10.5555/3153875
both by Patterson and Hennessy
Edit: And if you want to get into CPU design and can get a grip on "Advanced Computer Architecture: Parallelism, Scalability, Programmability" by Kai Hwang, then i'd recommend that too. It's super old and probably some things are made differently in newer CPUs, but it's exceptionally good to learn the fundamentals. Very well written. But I think it's hard to find a good (physical) copy.
1. Clone this educational repo https://github.com/yuri-panchul/basics-graphics-music - a set of simple labs for those learning Verilog from the scratch. It's written by Yuri Panchul who worked at Imagination developing GPUs, by the way. :) 2. Obtain one of the dozens supported FPGA boards and some accessories (keys, LEDs, etc). 3. Install Yosys and friends. 4. Perform as many labs from the repo as you can, starting from lab01 - DeMorgan.
You can exercise labs while reading Harris&Harris. Once done with the labs and with the book, it's time to start your own project. :)
PS: They have a weekly meetup at HackerMojo, you can participate by Zoom if you are not in the Valley.
1. https://learn.saylor.org/course/CS301
If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or RISC-V RV3I, then you could use an iCE40 or similar FPGA to provide a big acceleration by just using the DSPs and the big SPRAM.
Basically add the custom compute operations and space that doesn't exist in the MCU, operations that would take several, many instructions to do in SW. Also, just by offloading to the FPGA AI 'co-processor' frees up the MCU to do other things.
The kernel operations in the Tiny GPU project is actually really good examples of things you could efficiently implement in an iCE40UP FPGA device, resulting in substantial acceleration. And using EBRs (block RAM) and/or the SPRAM for block queues would make a nice interface to the MCU.
One could also implement a RISC-V core in the FPGA, thus having a single chip with a low latency interface to the AI accelerator. You could even implement the AI acceleator as a set of custom instructions. There are so many possible solutions!
An ice40UP-5K FPGA will set you back 9 EUR in single quantity.
This concept of course scales up to performance and cost levels you talk about. With many possible steps in between.
It could be as simple as using a CNN instead of a V matrix. Yes, this makes the architecture less efficient, but it also makes it easier for an accelerator to speed it up, since CNNs tend to be compute bound.
Was? https://opencores.org/projects?language=VHDL. Or is that not the same but similar?
It's so easy to write "DIV: begin alu_out_reg <= rs / rt; end" in your verilog but that one line takes a lotta silicon. But the person simulating this might not never see that if all they do is simulate the verilog.
The project stops at simulation, making real hardware out of this requires much more work.
More specifically [1] we could also call CPUs long / deep data dependency processors (LDDPs) and GPUs wide / flat data dependency processors (WDDPs).
[0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]: https://en.wikipedia.org/wiki/Data_dependency
But as you observe, we are stuck in a local optimum where GPUs are optimized for throughput and CPUs for latency sensitive work.
Tensors are just n-dimensional arrays
Then you can run software (firmware) on top of the TPU to make it behave like a GPU.
I'll be honest I've never heard the AIA acronym used in this way. It seems all acronyms for all processors need to end in PU, for better or for worse.
What is it exactly that sets these units apart from CPUs? Something to do with the parallel nature of the hardware?
...I for my part want to say thanks for the findings! :-)
[Setting:Weekendmodus]
> In real GPUs, individual threads can branch to different PCs, causing branch divergence where a group of threads threads initially being processed together has to split out into separate execution.
Whoops. Maybe this person should try programming for a GPU before attempting to build one out of silicon.
Not to mention the whole SIMD that... isn't.
(This is the same person who stapled together other people's circuits to blink an LED and claimed to have built a CPU)
https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p...