For this kind of testing, you frequently want a 'watchdog' - which is hardware that can auto-reboot a computer if the software locks up or malfunctions. The way it works is that software 'pets' the watchdog every few minutes, and if the software malfunctions, then it doesn't pet the watchdog, and the system gets hard reset and boots a known good kernel.
Unfortunately, while nearly all computer hardware has a watchdog, frequently the linux kernel doesn't implement support for it. That in turn means that testing buggy kernels on real hardware automatically frequently ends up with the hardware getting stuck and a human needs to reset stuff manually.