Open-Meteo offers free weather APIs for a while now. Archiving data was not an option, because forecast data alone required 300 GB storage.
In the past couple of weeks, I started to look for fast and efficient compression algorithms like zstd, brotli or lz4. All of them, performed rather poor with time-series weather data.
After a lot of trial and error, I found a couple of pre-processing steps, that improve compression ratio a lot:
1) Scaling data to reasonable values. Temperature has an accuracy of 0.1° at best. I simply round everything to 0.05 instead of keeping the highest possible floating point precision.
2) A temperature time-series increases and decreases by small values. 0.4° warmer, then 0.2° colder. Only storing deltas improves compression performance.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
4) Although zstd performs quite well with this encoded data, other integer compression algorithms have far better compression and decompression speeds. Namely I am using FastPFor.
With that compression approach, an archive became possible. One week of weather forecast data should be around 10 GB compressed. With that, I can easily maintain a very long archive.
Radar-Data: Find the most obvious gaps in predicting short-term weather are related to radar data. Obviously radar datasets would be require massive storage space, but curious if you have run across any free sources for archival radar data or APIs for real-time streams; or open source code from scrapping existing services radar feeds.
Many weather variables like precipitation or pressure are very easy to compress. Variables like solar radiation are more dynamic and therefore less efficient to compress.
Getting radar data is horrible... In some countries like the US or Germany, it is easy, but many other countries do not offer open-data radar access. For the time being, I will integrate more open datasets first
Edit: just read the 0.04-0.02 range , if I understand right only putting 1 real temp and then deltas could fit 12 first int and 16 temp following ints? Quick napkin math , could be wrong:)
It only works well for integer compression. For text-based data, results are not use-full.
SIMD and decompression speed is an important aspect. All forecast APIs use the compressed files as well. Previously I was using mmap'ed float16 grids, which were faster, but took significantly more space.
If you encoded nearby grid cells as audio channels, flac would even handle the correlation like it does for stereo audio.
Can you expand on this?
Inside the 5x5 chunk, I subtract all grid-cells from the center grid-cell. The borders will then contain only the difference to the center grid-cell.
Because the values of neighbouring grid-cells are similar, the resulting deltas are very small and better compressible.
Please investigate if you would like to work for this company: https://www.energymeteo.de/ueber_uns/jobs.php
(1) Maybe it’s just me, but the “current jobs” are only available in German, if you switch to English, Spanish or French — the page gets translated, but the three “current jobs” drop down lists get removed; super confusing, since it gets reset to German if you click “current jobs” from any of the other pages;
(2) HN is an English site, would be nice if you were linking to the English page, not German;
(3) if you’re affiliated with the company, which I believe you are, you should say so and noting it in your profile with contact information would be nice too.
(4) Reminder that HN has free job postings every month if you are affiliated with the company:
I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.
In the past I used InfluxDB, TimescaleDB and ClickHouseDB. They also offer good solutions for time-series data, but add a lot of maintenance overhead.
Storing each weather forecast individually to a performance evaluation for "how good a forecast in 5 days is", would require a lot of storage. Some local weather models update every 6 hours.
But even with a continuous time-series, you can already tell how good or bad a forecast compared to measurements are. Assuming, your measurements are correct ;-)
I have two questions:
1) How does the spatial resolution come into this? Is it constant data all across the 2kmx2km (?) parcel with an abrupt change, or is it interpolated in some way? Can I query the coordinates of the mesh?
2) How 'historical' does it get? How far back can I go with this?
Thank you!
1) Data are coming from multiple weather models. Primary data source is the German Weather service DWD with the ICON weather model. In my past experiences, the DWD ICON model performs best for many regions. DWD ICON has a global (~13 km), European (7 km) and a Central Europe (1-2 km) "domain". A higher resolution can improve forecast accuracy, but this is not guaranteed.
For Open-Meteo APIs, multiple models are mixed together. Typically high resolution domains only provide 3-5 days of forecast, afterwards they are combined with a global model.
For North American locations, I am going to add high resolution domains from NOAA as-well.
2) For now, only couple of months archive are available. There will be no limit of how much data can be stored. Data is fairly well compressed while still maintaining good read performance.
I am working on a long term archive as well. ECMWF provides a reanalysis dataset called ERA5 [1] with data from 1959. It will still take me a couple of weeks to process it. With 23 weather variables, it requires around 20 TB disk space (Gridded float32 with deflate compression).
[1] https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
I don’t think I understand the resolution then, could you explain a bit more? Say, I request data along a 100km line, every 10m. Do I get the same numbers if it’s in the same mesh cell with a sudden change when it’s crossed, or do I get some (bilinear?) interpolation?
https://open-meteo.com/en/docs#latitude=52.52&longitude=13.4...
https://api.open-meteo.com/v1/forecast?latitude=40.71&longit...
Temperature, clouds, etc, seem fine
EDIT: Issue identified and will be fixed in the next days! Thanks!
This is why the 14th Weather Squadron creates Wind Stratified Conditional Climotology tables. Past performance is indicative of future results, especially when you're not under the influence of a frontal system.
You explain your API offers historic data using the "past_days" parameter. Could you also offer a "date" parameter for a given day, or are you only keeping a rolling window of data?
If end_date is not specified, it would return start_date with 7 days forecast
excited to play w some of this data
What forecast models do you use for Australia?
so far I do not offer commercial options, just to keep me out of any potential legal issues. In the next weeks I will review everything, make sure attributions and licenses are correct, and remove the non-commercial limitation.
Australia is currently only covered by a global weather model from German Weather service DWD. I will check if BOM offers some open-data models
In you’re curious to read the license text: https://gourdian.net/g/eric/noaa_gsod.global_summary_of_day#...