The solution I've got (in alpha) is a basic webcam that detects when you're looking at it.
The cam is positioned higher than most things in the room to reduce triggering it unnecessarily.
When it triggers (currently using just simple cvv facial landmark detection) it emits a beep and then listens for a verbal command.