Interestingly, in a modern environment (unlike in the 80s-00s), both of these seem like choices that could be made either way.
Re #1 — all the major OS manufacturers (Microsoft, Apple, Google) now have their own flagship devices running their OS, and could standardize their app-store certification processes to involve perf testing on their own hardware if they wanted. Especially for the mobile hardware: there's no reason the QA process for e.g. releasing a game on iOS can't involve certifying zero stutter on a given iPhone, and then restricting the game to only be playable on that iPhone or newer (the way consoles effectively work right now.)
Re #2 — there's no reason, other than inertia, that personal computer (including mobile) OSes still put every single task on the system into one big (usually oversubscribed, from a realtime perspective) scheduling bag. We could—using the hardware hypervisor built into all modern architectures—split PC OSes into two virtual machines: one for the foreground "app", and one for everything else; and give the foreground-app VM fixed minimum allocations of all the computer's resources. Then you could make real guarantees† about an app or game's performance, as long as it was said foreground app. It'd be very similar to the way some game consoles (e.g. the Wii) have separate "application processors" and "OS service processors", to ensure nothing steals time from the application.
† And those guarantees would also be requirements: if the foreground app says it needs its VM to have 5GB of RAM, you literally won't be able to run it if you don't have 5GB of free physical memory to hand it (though the OS would probably first try to OOM-kill some sleeping apps to give it that memory, like iOS does.) Much clearer than the current "this game will be really slow if your computer is more than four years old, but it's not exactly clear which part of the computer is below-spec" we have today.
To guarantee that, testing has to go through _all_ possible game states. For almost any non-trivial game, that's infeasible.
"and then restricting the game to only be playable on that iPhone or newer"
There is no guarantee that newer hardware would be faster for every possible program execution, and even if it were, timing differences could affect game play.
There also is no guarantee that newer hardware produces the exact same results. For example, better anti-aliasing or fonts drawn at double resolution could affect hit detection.
This even isn't guaranteed on the 'same' hardware. For example, there might be C64's that don't have the bugs that this demo exploits.
The C64 demoscene is at this point a pure case of magicians developing tricks for other magicians. Compare to 20 years ago, when the tricks were made to wow users familiar with the platform and its limitations.
Now someone without intimate knowledge of the C64 would not understand what is the hard part of this demo part. Letters scroll up and sometimes expand or shrink a bit. We've seen lots of different scroll text parts. Is this hard?
Consider the brute force approach. The text scroll area is 192 pixels by 200 pixels. If this was just a bitmap, that is 4800 bytes to update. Pure unrolled code to move bytes would do
LDA src,x
STA dest,x
for a minimum of 8 cycles per byte (extra if 256-byte page boundaries are crossed, and to update the x index, and do the logic to pick out which letter to draw) or 38,400 cycles to update that bitmap. But there are just 19,656 cycles free per frame! The best a brute force approach would get then is one update per 3 frames, or 17fps.So all the cleverness is getting the machine to do something at 50fps that naively it could do at best at 17fps. This is by racing the display raster and playing tricks with the hardware bugs in the cpu/video chip interface.
"Make the trick more trouble than it's worth."
I have read that article many times, it's a wonderfully illuminating piece of writing.
Maybe it's that one form is being deceptive to entertain, the other to steal, but they have a lot of commonality.
So much effort was lost because of the communications barriers.
The Atari 2600's TIA had lots of sharp corners caused by the reliance on polynomial counters, which saved silicon but made for lots of seemingly-random edge cases (after all, polynomial counters are also used to generate pseudo-random noise!)
You just described most of the new effects in C64 demos over the last 30+ years...
But, yes, this is one of the nuttier.
When I was a kid and trying to do demos, the "simple" stuff like tricking the VIC to keep the borders open was still amongst the more exciting things you could do (opening the top and bottom border is trivial once you know how; opening the left/right border was harder - especially if moving sprites in the Y-direction as it affects the number of bus cycles available to the CPU).
That was quite literally childs play compared to this one.
An example: Rendering graphics (i.e. sprites) at the far horizontal edges of the screen would require the CPU to perform some shenanigans to trick the video circuits; this would need to be done on every scanline and required the timing of the CPU to be in close sync with the video hardware. I understood that. I had also experienced how sprites and every eight scanline (aka "badline") would mess up the carefully planned timings. Eventually, I kinda understood that concept. I had also seen, from code that I copied, how triggering a badline could be used to force the CPU in sync with the raster beam but it was akin to black magic for me. Wasn't until years later, programming on the Amiga, that the penny dropped for me.
And of course, grasping the concept and implications of DMA was pretty basic stuff compared to what's going on in this article. I don't think that I'll ever devote the time to understand it in detail but I find it fascinating how people keep discovering new unintended features in the old C64 architecture.
1. On each raster scan line, at precisely cycle 15, you need to clear the Y-expand register (vertical pixel size doubler thingy that sprites can do). This throws the hardware into confusion, the internal registers keeping track of where in memory a sprite is being drawn from is scrambulated and you wind up with the interesting graph presented on the right as to how the indexes progress from scanline to scanline. Y-expand is a single byte register where each bit belongs to one of the eight sprites. Simply clearing a sprite's Y-expand bit on clock 15 every scan line is sufficient to introducing glitch pandemonium.
2. On some rows you want your sprite's Y-expand to be cleared to trigger the glitch and sometimes not to have the next row be read in sequence, so before the scanline ends, we need to toggle Y-expand for a sprite to 1 on a per-sprite basis. How did the author do this efficiently? ...
3. ... by using sprite-to-playfield character collision detection! He put the sprites in the background, behind the character graphics and placed a single pixel vertical bar using redefined characters or bitmap graphics to cover the right-most pixel of the sprite. In the sprite definition's right-most pixel for the current row, he would encode either a 0 or 1 to decide if the next row's Y-expand should be 0 or 1. The natural collision detection of the sprite hardware would transcribe a 0 or 1 into the sprite-to-playfield collision register for all 8 sprites when both the sprite and the playfield had a filled bit. You'd wind up with a byte that is ready-made for the Y-expand setting for the NEXT scanline for all 8 sprites. (I assume before the end of the scanline, the idea is to read the collision register and write its value to Y-expand.) What a clever way to save a memory fetch! Also by reading from the collision register before the end of the scanline, it resets the VIC chip's readiness to test collisions and collisions will be tested again on the next scan line, so the whole process can repeat.
I find this hack to be really beautiful. That use of collision to dynamically build the next scanline's Y-expand values kindof reminds me how modern 3D games may encode all kinds of different scene information into various layer buffers and color channels as a game frame is rendered over many passes.
As a kid I reused the 64 sprite hardware over and over to fill the screen with sprites. As I recall, I lost a lot of CPU time because the VIC chip was hogging access to the memory more than it normally would. This trick would have let me fill the screen with more sprites than ever. One of my dream goals as a kid was to reuse the sprite hardware faster, to change their colors and be able to get more colors on the screen.
I recall trying to find a way to get the sprites to be only one scanline high and spend all my time simply changing colors and memory locations on the sprites. It never worked, it was like the VIC chip was locked in on the settings for a sprite until it was done drawing. And that in fact is the trick with this Y-expand thing - you can trick the sprite hardware to finish a sprite in fewer scanlines than should be possible (apparently as few as 4, according to the author!) Once the hardware thinks the sprite is finished, the hardware is relinquished and can be commanded to reuse that sprite on subsequent scanlines to paint more pictures.
It seems the demoscene may achieve my dream someday - perhaps my fancy super-color-1-pixel-high-sprite bitmap display mode hack may one day become a reality!