- Forward filling missing short periods of missing values. Why keep this in when you explictly mention this is not normal? Either remove it all or don't impute anything
- Claiming superiority over classic models and then not mentioning any in the results table
- Or let's not forget, the cardinal sin of using MAPE as an evaluation metric
Long answer: Is the metric for people with subject-matter knowledge? Then (Weighted)RMSSE, or the MASE alternative for a median forecast. WRMSSE is is very nice, it can deal with zeroes, is scale-invariant and symmetrical in penalizing under/over-forecasting.
The above metrics are completely uninterpretable to people outside of the forecasting sphere though. For those cases i tend to just stick with raw errors; if a percentage metric is really necessary then a Weighted MAPE/RMSE, the weighing is still graspable for most, and it doesn't explode with zeroes.
I've also been exploring FVA (Forecast Value Added), compared against a second decent forecast. FVA is very intuitive, if your base-measures are reliable at least. Aside from that i always look at forecast plots. It's tedious but they often tell you a lot that gets lost in the numbers.
RMSLE i havent used much. From what i read it looks interesting, though more for very specific scenarios (many outliers, high variance, nonlinear data?)
But for a model to make out-of-distribution predictions does not make it a foundation model for time series, really that's just the basic task that all time series forecasting models do. A more interesting question is, does an LLM architecture seem to improve the task of univariate or multivariate time-series prediction? I don't think the answer is yes, although, depending on your domain, being able to use language inputs to your model may have a positive impact, and the best way to incorporate language inputs is certainly to use a transformer architecture, but that isn't what is addressed in this post.
https://github.com/Mcompetitions/M4-methods
https://en.wikipedia.org/wiki/Makridakis_Competitions
Makridakis' conclusion remained true for many years: "statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones."
Maybe things have changed?
(side: Nixtla showed a simple ensemble outperforming Chronos, and the Chronos team responded, but there's some back and forth in the comments: https://www.linkedin.com/pulse/extended-comparison-chronos-a...)
That sums it up and it’s no surprise why Datadog’s toto model performed exceptionally well.
The results would have been much more useful had they opted for a heterogenous mix of data sets. I am thinking of census data and statistics, or financial forecasting (GDP, interest rates), or clinical trial drop-out rates etc. So many interesting problems out there.
But for general purpose time-series forecasting, benchmarks mentioned in other comments like GIFT or M4 might come in handy. We might include them in the follow-up experiment.
Did you or do you plan to publish any of your code or data sets from this?