I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.
e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...
The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.