>Great point, thank you. However, I think this leaves out many companies - those that don't have job postings
Unless you're planning to cold call people and get them to pinky swear to tell you honestly what they're using or you have some other plan then you're somewhat stuck anyway for companies that don't post jobs.
Also worth noting but plenty of companies won't even really tell you anyway. E.g plenty of companies will have language like "systems programming language" or "object oriented language". When they could be using anything from C-family to Haskell (leaving aside how secretive many Haskell jobs are or being hidden in custom dialects) You are going to be running into all kinds of human BS, it'll be fun but a can of worms nonetheless.
>I think a job board like startup.jobs would solve this by creating a job archive - then it would be prime scraping material. But it's only a job board with (mainly) current jobs
Not sure how much experience you have modelling data but this can also be trickier than expected to capture postings by date even leaving aside the fun of unstructured data and differences in models between platforms and your judgement calls needed to decide where you're crawling.
Having cut my teeth scraping property listings of competitor websites you come to realise most boards incentivise people to delete and repost ads so they boost their recency score and appear higher in the search. So now you will have duplicates messing up your data which you want to deal with if you're trying to create value off your data.
The classified site also doesn't like this so will try to stop this gaming of the system so that game of cat and mouse will normally mess up your scraping and dedupe logic too.
As said it's a potentially fun can of worms to open. I was just making a joke about HN commenters tendency to massively underestimate the oceans of complexity that seperates their hello world project from an enterprise grade "just a CRUD app" system that people pay for. E.g all the people that could totally build twitter with a sqlite DB and some bash scripts + sellotape etc.