Why won't it match the real schedule?
Caltrain also runs extra trains for special events such as baseball games. Those are planned, but not on the regular schedule.
Totally doable I should say.
By the end of the day, you're going to be more than a few minutes off from how you started.
Nonstop long-distance rail service can exactly match its schedule because there are relatively few places for entropy to creep in - you just hold a constant speed across miles and miles of track that you pretty much have to yourself. A commuter rail system is much more complex and there is much more room for entropy.
Of course, many American systems are operating at a pretty severe disadvantage, being hamstrung by poor equipment and infrastructure, a lack of funding, understaffing, a hostile political environment, and even pervasive cultural attitudes that dismiss railroads as being something worth investing in. I suppose given all that, it's a wonder they do as well as they do...
But still, I think it's important to never forget: it's absolutely possible to do much better.
Another crowdsourced caltrain twitter account is https://twitter.com/caltrain You can see some of the more granular delays there. All these crowdsourced status accounts should be proof that caltrain SHOULD publish the raw data for us to use.
I did something similar using a GoPro and computer vision when I lived next to one of the 101 off-ramps in SF. I got it to work for most daylight hours (headlights screwed it up hardcore) before our landlord raised our rent by $1000/mo and we moved.
I figured it could have been a way to calculate ad impressions for billboards, but I also figured Clear Channel probably already knows those numbers.
Their commerce system is not mobile friendly and is a pain for mobile users. It could be much more efficient.
My solution, since I didn't want to do any screen scraping or make trying to identify individual busses/trains a project in and of itself, was to use Portland's TriMet API. That API acutally return specific route numbers, and estimated and scheduled times for each stop (interpolated in the case of non time points). I'm originally from the Portland Area, so I'm pretty familiar with the geography and roads.
From what I remember in the 511.org Google developer group, people have raised this exact issue, i.e. Caltrain train numbers. The guy responding from the MTA said they'd try and integrate it in the future, but these posts were like back in 2012 (IIRC).
NextBus is the source of bus position and predicted arrival times for MUNI, and appears to be the same for many other transit agencies. I can verify that (as of three minutes ago) it's still returning reasonable data.
However, if you're looking for the SFMTA schedule [0], I don't think you can get it through the NextBus API. I do know that you can get it through a GTFS "feed" found here: http://sfmta.com/about-sfmta/reports/gtfs-transit-data
Also, you might be interested in this, if you haven't seen it already: http://bdon.org/transit/ (SF MUNI transit delays. [This isn't my work.])
[0] Why would you want MUNI's schedule? It's not like any of the drivers care about it! ;)
RE: scraping - instead of putting logic in your scraper, just download the entire section you need, store it in file format. Then parse and shove into database whenever you feel like it. You could rerun the parsing since you'll have all the historically scraped website data on disk.