Stan is amazing in that you can fit pretty much any model you can describe in an equation (given enough time and compute, of course)!
More on Stan here: http://mc-stan.org/
Cloned and tried to build it but I'm getting an error regarding uncomplicate:commons:0.3.0-SNAPSHOT being unavailable on clojars. Is that something you currently have installed to your local maven repo? I don't see it here: https://clojars.org/repo/uncomplicate/commons/
I can get it to build with 0.2.2 but it is missing the "releaseable?" function.
In any case this looks awesome and I'll be keeping an eye on it / playing with it for some new projects.
EDIT: I was able to get it building by cloning your commons library and running "lein install". :)
[1] http://camdavidsonpilon.github.io/Probabilistic-Programming-...
Edit: I wouldn't recommend Probabilistic Programming and Bayesian Methods for Hackers. When I tried using it, I felt that too much was glossed over. The book that I recommend excels at conveying a strong intuition for how these various techniques work.
That's the foundation. The way you set up your model is by nodes and edges that specify the flow of influence (directed or undirected). Then it seems that there are general methods for inference and learning on any kind of graph one might pose.
For simple graphs (and simple is something one might want when modelling) the methods should be fairly effective.
Unfortunately, the biggest book on the subject that I know (Koller & Friedman) isn't accessible. Koller's course is also not that accessible.
I am puzzled how they managed to release Prophet under BSD with such a dependency.
The wikipediatrend R package relies on http://stats.grok.se/, which in turn relies on https://dumps.wikimedia.org/other/pagecounts-raw/ which has been deprecated.
The new dump is located at https://dumps.wikimedia.org/other/pageviews/
Data is available in hourly intervals.
* pageviews-20170227-050000
en Peyton_Manning 58 0
[edit] There is a wikipedia-hosted OSS viewer for these logs, e.g. Swedish crime stats:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.or...
An intro by Felipe Hoffa (Google): https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_...
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.w...
What's up with Java? (Set "logarithmic scale" to improve the visualization)
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.or...
Very cool though --- I would be interested to dive into the methods they've implemented sometime in the near future!
Probably quite poorly (due to stocks appearing "random" at scale), especially for indexes, which are a sum of their parts.
On the other hand, this would probably be quite useful for things that have non-random trends (like the Global Energy Forecasting Competition: http://www.drhongtao.com/gefcom)
-Mandelbrot, in the foreward to Multifractals and 1/f Noise.
it's worth saying that Mandelbrot was apparently a large influence to E Fama, who proposed the efficient market hypothesis in the first place.
That doesn't sound right. Let me clear that up for you. Since 1950:
S&P 500 Annual Price Change: 7.2%
S&P 500 Annual Div Dist: 3.6%
S&P 500 Annual Total Return: 11.0%
Annual Inflation: 3.8%
Annual Real Price Change: 3.3%
Annual Real Total Return: 7.0 %
Buying the straight S&P 500 beats inflation by seven percent, on average, every year. You're welcome!That assumes that the efficient-market hypothesis holds true, but it has yet to be thoroughly proven or disproven... (and funds like Medallion would strongly suggest otherwise for the medium term: https://www.bloomberg.com/news/articles/2016-11-21/how-renai...)
> df['y'] = np.log(df['y'])
https://gist.github.com/llimllib/385230f38c3f9b70c3e46158e60...
Its a completely managed solution. No need to setup anything yourself.Just upload the data and predict next week's data, today itself. There is a free trial and if anyone here is looking for an extended trial, they can reach out to me.
https://papers.ssrn.com/sol3/papers2.cfm?abstract_id=2841267
All my attempts thus far have pointed me to something called Gaussian Proceeses that I am still working through grokking.
It's major benefit is that it figures out relationship to the target time series by itself, so you can just throw in all time series and see what comes out.
Language is Clojure, 20kloc, incanter, encog. If anyone is interested in working for/with it, let me know. I currently develop a Rest Api for it and plan to release it as open source once the major code smells are dealt with.
If you define anomaly as something unexpected then yes. In this case, if the reality differs significantly from the forecast (=expectation) then it is an anomaly (according to our definition). In numeric univariate case, there could be positive anomalies where you get more than expected, and negative anomaly where you get less than expected.
That being said this claim in point #1 baffles me:
> Prophet makes it much more straightforward to create a reasonable, accurate forecast. The forecast package includes many different forecasting techniques (ARIMA, exponential smoothing, etc), each with their own strengths, weaknesses, and tuning parameters. We have found that choosing the wrong model or parameters can often yield poor results, and it is unlikely that even experienced analysts can choose the correct model and parameters efficiently given this array of choices.
The forecast package contains an auto.arima function which does full parameter optimization using AIC which is just as hands free as is claimed of Prophet. I have been using it commercially and successfully for years now. Maybe prophet produces better models (I'll definitely take a look myself), but to claim that it's not possible to get good results without experience seems a bit disingenuous.
As an aside, anybody interested in a great introductory book on time series forecasting should check out Rob Hyndman's book which is freely available online. https://www.otexts.org/fpp
True. fwiw, I worked on the same project at Twitter 4 years back - the Facebook folks call it capacity planning at scale, we called it capacity utilization modeling. The goal was the same - there are all these "jobs" - 10s of 1000s of programs running on distributed clusters, hogging CPU, memory and disk. Can we look at a snapshot in time of the jobs usage, and then predict/forecast what the next quarter jobs usage would be ? If you get these forecasts right ( within reasonable error bounds ), the folks making purchasing decisions ( how many machines to lease for the next quarter for the datacenters) can save a bundle.
From an engineering pov, every job would need to log it's p95 and p99 CPU usage, memory stats, disk stats...Since Twitter was running some 50k programs back then (2013ish) on these Mesos clusters, the underlying C++ API had hooks to obtain CPU and memory stats, even though the actual programs running were all coded up in Scala (mostly), or python/Ruby (bigger minority), or C/Java/R/perl ( smaller minority ). There's an interesting Quora discussion on why Mesos was in C++ while rest of Twitter is Scalaland...mostly because you can't do these sort of CPU/memory/disk profiling in the jvmland as well as you can in C++.
OK, so you now have all these CPU stats. What do you do with them ? Before you get to that, you have the usual engineering hassles - how often should you obtain the CPU stats ? Where would you store them ?
So at Twitter we got these stats every minute ( serious overkill :) and stored them in a monstrous JSON ( horrible idea given 50000 programs * number of minutes in day * all the different stats you were storing :))
So every day I'd get a gigantic 20gb JSON from infra, then I'd have to do the modeling.
In those days, you couldn't find a single Scala JSON parser that would load up that gigantic JSON without choking. We tried them all. Finally we settled on GSON - Google's JSON parser written in Java, that handled these gigantic jsons with no hiccups.
Before you get to the math, you would have to parse the JSON and build a data structure that would store these (x,t) tuples in memory. You had 50k programs, so each program would get a model, each model originated from a shitton of (x,t) tuples, the t being minutely and the fact that some of these programs had been running for years, meant you had very large datasets.
The math was relatively straightforward...I used so called "LAD" - least absolute deviation from mean, as opposed to simple OLS, because least squares wasn't quite predictive for that use case. Building the LAD modeling thing in Scala was somewhat interesting...Most of the work was done by the commons math Apache libraries, I mostly had to ensure the edge cases wouldn't throw you off, because LAD admits multiple solutions to the same dataset - it's not like OLS where you give it a dataset and it finds a unique best fit line. Here you'd have many lines sitting in an array, depending on how long you let the Simplex solver run. Then came the problem of visualizing these 50,000 piecewise line models using javascript heh heh. The front end guys had a ball with the models I spit out.
If someone's doing this from scratch these days, NNs would be your best bet. Regime changes are a big part of that.
I'm guessing you already know about this based on the way you described the situation, but the Hyndman Forecasting book [1] discusses various models at length for doing multivariate forecasting models. It's loaded with code and samples in R.
We're planning to add forecasting to our SaaS analytics product (https://chartmogul.com) later this year, I'm going to look and see if we can use this in our product now.
I may do a test implementation into Airbnb Superset actually to see how it flies.
I see this being applicable to analysts when deciding on on a company's credit worthiness.
I would think if you're already assigning credit ratings, you can set that as your dependent variable and use things like company revenue, number of employees, age of company, etc. as your independent variables. You can use a number of different models to assess credit worthiness based on this data. Evaluate several to determine the most accurate.
A few days ago I was asked to do some forecasting with a daily revenue series for a client. Due to her business' nature the series was really tricky with weekdays and months/semesters having some specific effects on the data. I as many use Hyndman's forecast package, but I threw this data at prophet and it delivered a nice plot with the (correct) overall trend and seasonalities. Very cool and easy to do something.
I've been using CasualImpact by Google [0] for months. This seems pretty straightforward.
Between this and Stan I think my free time for the next week is gone.
You talk about having to choose the best algorithm but it seems like Prophet is just another algorithm to choose from. Is there some kind of built in grid-search or are you just stating that results from your AM have been more accurate than ARIMA?
Some feedback: it'd be nice to see you actually quantify how accurate Prophet's forecasts are on the landing page for the project. In the Wikipedia page view example, you go as far as showing a Prophet forecast, but it'd be nice to have you take it one step further and quantify its performance. Maybe withhold some of the data you use to fit the model and see how it performs on that out of sample data. It's nice that you show qualitatively that it captures seasonality, but you make bold claims about its accuracy and the data to back those claims up is conspicuously absent. Related, it might be worth benchmarking its performance against existing automated forecasting tools.
I'll definitely be checking it out!
I got really excited for a second. Actually, I'm still pretty excited about this even if it was something else entirely.
See also the R vignette, which shows that the data is returned per-column which gives it a lot of flexibility if you only want certain values: https://cran.r-project.org/web/packages/prophet/vignettes/qu...
Has anyone managed to get this working on windows with Juypter (Anaconda build) struggling with Pystan errors. Any guidance welcomed.