Well, let me rephrase. A GPU these days has two distinct features, graphics-processing, and GPGPU. I'm less interested in the graphics part (since that pipeline could be studied in software, and in hardware, it's very specialized/ASICy).
So I'm really interested in the massively-parallel GPGPU aspect of a GPU.
These kinds of projects always take you up to where CPUs were in the early 1950s on Large Systems or the 1970s in the home computing world: Single-issue processors with no memory protection or privilege levels. They work, in that you can write useful software for those systems, but taken as a way to explain how CPUs work in a holistic fashion they fall well short. They simply aren't complex enough to explain why Meltdown happens, for example: Since there is no concept of privilege to begin with, you can't use them to explain privilege level violations. More prosaically, you can't explain a "cold cache" when the processor doesn't have a cache which can be cold.
This is demoralizing for the poor sods who think they're going to learn how CPUs work and end up with a CPU design which is decades out of date and no way to extend it to even a thirty-year-old design. "You can't get there from here" is the bane of tutorials which explain the basics and then stop.
Heartbleed is because of memory and cache shenanigans. More like how things can get wrong if you optimize too hard. While important, it feels like another line of thinking.
Moving to an FPGA based CPU might be the next step. There are soft CPUs with cache, MMU, etc. (https://github.com/SpinalHDL/VexRiscv for example)
1. Output framebuffer.
2. Polygon rasterization (often limited to points and triangles).
3. Texture sampling. This is accessible in CUDA (I have ~zero experience with other GPGPU systems).
4. Afaik also for blending. This might have stopped now.
5. Video codecs (MPEG-2, H.263, H.264, H.265, VP8, VP9, soon AV1) decoding, and also encoding for some of them.
Nvidia RTX also include ray tracing hardware that handles that task more efficienly (I presume by using fixed logic for dispatching memory/cache-aware computations like e.g. content-addressable memory and such).
Most things are handled by the shader cores. They are 1024bit SIMD with lane-masking until Volta, and a more flexible/arbitrary fork/join since Turing (not all Turing has the ray tracing hardware), which also brought a scalar execution port with it (like amd64 getting traditional RAX/RDX/etc. with their opcodes after only having AVX instructions). AMD GCN afaik has a quite explicit SIMD architecture, with a scalar execution port since inception. Also 1024bit iirc.
Thanks
It really does do all that with just the 74172s. The 74172 is a register file containing 8x 2-bit words, with multiple ports for reading and writing, which are split up into a couple of sections.
Section 1 has independent read and write ports. The write port consists of data input DA[1..0], address AA[2..0] and write enable ~WEA. If ~WEA is low, data is written from DA to the register selected by AA on the positive edge of the clock. The read port consists of data output QB[1..0], address AB[2..0], and read enable ~REB. When ~REB is low, the contents of the register selected by AB are output on QB.
Section 2 has another set of read and write ports, but this time with a common address. Read port is DC[1..0], write port is QC[1..0], address is AC[2..0], and read and write enables are ~REC and ~WEC.
In the PISC, there are eight of these chips with all their control lines tied together, so you get a single 8x16 register file with all of the features described above.
> In a single clock cycle, the following occurs:
> a) one register is output to the Address bus and the ALU's A input;
...using section 1's read port.
> b1) another register may be output to the Data bus and the ALU's B input; or
...using section 2's read port.
> b2) data from memory may be input to another register;
...using section's 2 write port.
> c) an ALU function is applied to A (and perhaps B) and the result is stored in the first (address) register.
...using section 1's write port.
Or the Magic1. Similar total count of 74x chips (~200), but he ported Minix2 to it. http://www.homebrewcpu.com/