Something to look at is the classic image processing algorithms that can be effective and more importantly behave predictably.
In your example, take a film of the factory floor when it is empty, then once work begins use a approximately human sized/shaped rectangular sliding window and look for areas that exceed a threshold of difference to the image of the empty floor.
You can then use that window as input to a classifier which will be easier due to the considerable dimension reduction or perhaps you can get sufficient performance using further deterministic techniques.