I can't stress just how important (and how hard) it is to get a great source of data for airports -- I've now built 3 travel-related projects (the latest, Wanderlog [https://wanderlog.com], keeps people's flight reservations, so uses it for an autocomplete), and it's been a key building block for all of them.
The main datasets we use are:
- OpenFlights [1]: mentioned in this post, but this dataset was great since it had timezone too.
- OurAirports [2]: no timezone here, but the "type" and "scheduled_service" columns in this dataset are essential. "Type" lets you distinguish between small/medium/large airports, and "scheduled_service" lets you easily filter out airports without real flights (which you often might not care about).
- Random other GitHub Gist [3]: I have no idea where this data comes from, but it was surprisingly complete and has a few golden nuggets like "num_flights" and "runway_length" in addition to "timezone". The presence of a "woeid" suggests Yahoo-related origins, but it's hard to be sure.
- We now supplement this with airports from autocomplete APIs like Skyscanner's, because they're still the most up-to-date.
Long story short, it'd be AWESOME to have one complete, updated database with all this data in one place. This kind of data really should be public and a public service, but until then it's unfortunately up to the community.
[0] https://www.iata.org/en/publications/store/airline-coding-di...
[1] https://github.com/jpatokal/openflights/
Every source was definitely useful, but I think ultimately crawling Wikipedia was the most useful and highest quality set of data for me (after some significant data cleaning). The List of Airports By IATA Code [0] is almost as comprehensive as the official list from IATA, and you can follow the links to crawl info about the airport and city served. Getting info about what city the airport is considered to "serve" is so useful, as most airports are technically not in the city people consider them to be the major airports of, and some "serve" multiple cities.
Of course the difficult part there is that Wikipedia data isn't really clean or standardized. The page HTML isn't standard, even things that look very standardized like the sidebar will have 30 variations when you crawl all the airport pages. There is WikiData, but I found it still wasn't simple to get the data from there, and it also didn't include most of the page content which I wanted. [1]
Nowadays we have direct relationships with the airlines/GDS/so on, and also a department of people to add and manage the data ourselves, because even the direct source gives you pretty poor quality data. The project was way more fun when I was wrangling data from a dozen places around the web :) Now it's more of an enterprise CRUD webapp with some fancy localization and GIS tooling.
[0] https://en.wikipedia.org/wiki/List_of_airports_by_IATA_airpo...
[1] This was a while ago, so maybe WikiData has changed
FlightAware have a similar API[1].
These aren’t free or open mind you, but are at least readily accessible for those that need/want it.
Not sure what you get with the commercial services, but even the free services are pretty good. It's what we used in 1st CAV to track the redeployment of the last units to leave Iraq in 2011.
Since finding and booking flights is actually trivially easy for the consumer already, it's actually in the interests of airlines (as well as the middlemen) to be cautious about who gets access to which API function, especially when it comes to actually selling tickets.
At the extreme end of the scale, some low cost carriers can only be booked on their website, because they make more in upsell from the booking flow than they could make with extra bookings from other channels
But other than that, I assume there's a lot of money in partnerships with sites like Kayak and Priceline. But I'm not even sure which direction that money flows.
There's also https://www.adsbexchange.com/ which doesn't filter their data (probably much to the chagrin of various businesses and governments). If you see/hear a weird plane above and you can't find it on the commercial services above, check ADSB Exchange.
"top tier" may be overstating it but setting up a RPi and $20 USB ASDB receiver will get you the $90/month Enterprise feed [1]. Still a great deal if this is a topic that interests you.
I wish I was able to track more frequently than every 15 minutes (free version api max, etc), because some aircraft pass overhead before they're picked up, so it's not the most accurate, but a rough figure to/from O'Hare, Midway, and Gary
... for my project, I actually got some historical paper schedules of the official aviation guide, basically they’re phone books. I hope to find a decent/affordable database for more recent data. (MIT students/alumns actually get access to a database going back to 1979, but alas no access for outsiders...)
The actual seats bit is surprisingly complex if you want accurate figures, as the same aircraft type can have wildly different numbers of seats depending on layout and class configuration. OAG/Innovata's standard schedule product has the aircraft variant normally assigned to a route shown, and they survey the airlines on the seating configurations of their aircraft calculate capacity and ASKS. I believe Cirium now cross reference this with flight tracking data to get data based on the actual aircraft used (which solves edge cases like substitutions or an airline operating aircraft with differently configured A330-200s on different routes) - doing that was part of the masterplan when I worked for them before they acquired Flightstats.
You may already be aware of this but if you want real-time ADS-B, check out PiAware (https://flightaware.com/adsb/piaware/) as a low cost option to run your own ADS-B ground station via a raspberry pi.
I wish I had access to the GDS data to get realtime seat/award availability, but I couldn't find any pricing information to get that information through Sabre's API.
Does anyone know how much that costs, or if there are any services which provide it as an API? I use ExpertFlyer for personal use, but ideally I'd want to get that information at the source…
More info on API here https://github.com/amadeus4dev/hackathon-starter/blob/master...
Disclaimer I work for Amadeus, but actually never used this API service, I'd be interested in your feedback
"Open train data" is a bit vague without mentioning where these trains might be, but I did find the London Tube schedule[1] in GTFS[2] format, as well as the bus schedule[3] also in GTFS format. Look for your city or country name followed by "open data" and you might find interesting datasets. In the UK the National Public Transport Data Repository (NPTDR) publishes a database of every public transport journey in Great Britain for a selected week in October each year[4] (only goes until 2011 though).
[1] Tube, scheduled trips: https://hash.ai/@tfl/tfl-gtfs
[2] GTFS is a CSV-based transit data format: https://developers.google.com/transit/gtfs/reference
[3] Buses, scheduled trips: https://data.bus-data.dft.gov.uk/timetable/download/
[4] NPTDR database: https://data.gov.uk/dataset/d1f9e79f-d9db-44d0-b7b1-41c216fe...
They say that they get their data from Cirium.
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=23...
But it's cumbersome to work with.
I am working (on and off) on a DBMS benchmark based on this data. As part of that endeavor, I have a script which:
* Automates downloading the CSVs.
* Creates an appropriate SQL database schema.
* Performs a bit of rudimentary cleaning (e.g. invalid character codes; optional)
* Loads the CSV files into the database.
So that, from the command-line, you could get the flight on-time performance data by merely typing in something like:
/path/to/usdt-ontime-tools/scripts/setup-usdt-ontime-db -r -db-name ontime --first-year 2019 --last-year 2020
it's available within this repository:https://github.com/eyalroz/usdt-ontime-tools
the caveat is that, for now, the only DBMS supported directly is MonetDB: https://www.monetdb.org/ , a FOSS analytics-oriented columnar DBMS.
An adaptation of the script for other systems (MySQL/Maria, PostgreSQL) should be straightforward, since the commands are SQL'ish after all. If you're interested in that, open an issue or write me.
Live data: https://globe.adsbexchange.com
I'm still working on bit.io and would love feedback so hit me.
For some reason they make you register in order to download the data, and the site is a bit confusing, but the data seems good.
Since the pandemic I've found plenty of airlines selling tickets and systematically cancel the flight a few days before. I was looking to scrape some data to avoid this kind of unreliable flights.
The USDT on-time performance data goes back as far as October 1987 (and you can specify the period to the download script with the --first-year , --first-month , --last-year , --last-month command-line switches).
Once the data is loaded you can use spiffy SQL to print out routes the way you like them. Unfortunately the data is also a bit dirty (which is something I'm working on).
source: started with OpenFlights but had to switch to OurAirports for my project https://flightnotebook.com
Anyone build it yet/ needs something like this?