Nik Cubrilovic recently demonstrated how information leakage can be used to trade stocks, eg estimating the growth rate of the Adobe Creative Cloud customer base based on assigned customer ID numbers, before Adobe announced the figures themselves:
http://www.itnews.com.au/news/how-an-aussie-hacker-used-info...
It was more subtle than inside information. He implied that he could actually influence the outcome and hurried me on from that point.
Mind blown.
Sure, there are a few that do illegal things (and inevitably get caught since there is so much monitoring going on).
They have a free, community, curated data set of ~3200 stocks.
I work with both Quandl and Zack's pretty frequently, let me know if you're interested in buying large amounts of data from Zack's, and I can perhaps get you a discount from the listed prices on the Quandl website.
For example, "I split the data set into 5 random segments and then trained a model on 4 of the 5 segments and then tested it on 5th." Such data is serially correlated (it's not good old iid) so already it looks like you have poisoned the test set with information from the training set.
The hard part is not "feature engineering" or "ensemble methods", the hard part is controlling the entropy that you feed these things because they are voracious monsters and will absolutely eat all of it.
Kind of. If it was that simple making money off of an autoregressive model would be trivial -> everyone would do it -> serial correlation would disappear.
I agree with your observation that figuring out what to feed the beast is one of the bigger challenges though. Case and point: train a mean reversion model on the last seven years of S&P data to buy dips and train a momentum model to buy higher highs. That equity curve would look very encouraging. Do it on a fifteen year basis, and not so much. Now the question becomes: how long of a lookback do you use when training your models? Chopping up data at random will mux out useful correlations. Subsetting into periods leads to poorly generalized models. Not fun.
http://52.11.211.67/recommend/app/hidden_connections?query=h...
http://52.11.211.67/recommend/historical-trends/index-contra...
The biggest problem with things like this, which almost nobody talks about in the context of investing, is publication bias.
100 people try to develop a profitable trading algorithm. 1 comes up with one that looks great on back-tests at a 1% confidence (in other words, exactly what you'd expect from random chance alone over 100 trials).
That person writes an article/pitch/business plan based on their algorithm. You never see results from the 99 who failed.
Going forward, the successful algorithm is no more likely to work than the failed 99, but from the perspective of the general public it sure looks like a winner!
It's much worse than this with machine learning approaches. Imagine a million people trying to find a profitable algo, all on your laptop, and you are choosing the best one out of all of those.
If you are used to pen-and-paper trading strategies, or even excel spreadsheets, machine learning is just a completely different level to this. And probably how it works will be unintelligible to anyone. I don't even see how someone can write a business plan based on this.
Welcome any thoughts, in part because legally beating the market is possible, just don't get the SEC & OPSEC aspect.
If they do audit you, how will they discover how you are generating your trading decisions? Their remit is to make sure you aren't doing something illegal. There's no reason they would understand what you were doing in anything other than a superficial way.
Also, something can be profitable, and obviously so, without being easily reproducible. For instance there are firms that do simple footrace arbitrage on the same security between different exchanges. Not hard to understand, but you still can't do it. There's a whole spectrum of strategies that are on a frontier on the map of easy-to-understand vs easy-to-implement.
Besides all that, I think even if you were to learn about a way to beat the market, the way you found out might lead you to be very skeptical of whatever was proposed. If a guy is selling it on a website, you will probably not believe him, right? And if he showed you backtests that worked, you would suspect they were generated from a random generator of some sort. And if he then shows you the math, you would almost certainly find fault with it. Why did he do this or that transformation on the data? Must be random...
The difficulty is gaining confidence in your algo and determining when to move from paper trading to actual trading.
You run into counter-intuitive things while training a neural net, for example. You'd think more training data would be good, but when training neural nets, you actually want to use as little data as possible while still creating an ideal ROC curve.
An algorithm that works would also include the ability to limit losses. An algorithm might be correct 9 out of 10 times, but may lose more in a single transaction than what it earned in those 9 winning transactions.
Remember the maxim, past performance is not a guarantee of future results. You can develop strategies based on past data that will beat the market, but, the nature of markets is to adapt to kill your edge. Markets adapt constantly and your edge stops working at an unknown point in time. It's unknowable when that WILL happen because past data can't show that.
The other reason is transaction costs. In gambling called vig. Let's say I'm betting NFL games. NFL home teams win 51% of games. Even flipping a coin I've read come up heads 50.1% of the time. These are profitable systems. But you're paying the bookie 10% on each loss. You could find someone to bet you on coin tosses and bet heads each time. You have a positive expected return, although you need a huge number of flips to make money!
In trading of course costs is commissions. Why do you think there was a rise in HFT? The strategies are consistently profitable. (Besides the flashing/manipulation tactics) It is ONLY profitable because of extremely low commission costs that are not available to the retail (or even semi-professional) trader.
Systems that can pull $0.0001 out of every share traded overall on high volume can be (pretty easily) created, but you can't trade them profitably. In fact, you will find commissions (semi-pros who pay about $3 per 1000 shares) priced right at the point of an edge you could be expected to develop.
If you are a low volume, small time trader, the market isn't going to move as quickly to adapt to you. If you have $100,000, for example, and return 30% a year, you aren't on anyone's radar.
Data mining is useful because it gives you things that are predictive that you might not have considered at first, but make sense after. This is mainly due to combinatorial explosion in the potential number of formulas.
You generally have a vague idea of what might be predictive, eg cheapness vs earnings and cash flow, but there's a huge number of ways that might show up in the data, and there's a huge number of ways it might hide in the data.
So for instance an old school analyst might do a ranking of price/earnings as well as cash flow, or whatever bespoke formula desired.
A data mining approach could take all the fundamentals and generate formulas mixing the variables, yielding a number that seem to be effective. Out of those, you'd look at them and decide that they capture some thesis (low P/E, upward trend in earnings). Then you'd look at whether the formula is sensitive to small tweaks. For instance, if you regressed the last 6 earnings and it had phenomenal performance, but with 5 or 7 it wasn't, you probably conclude it's some sort of random result.
There's funds that take the mass approach to an extreme. They have huge databases, with a genetic algorithm that generates expression trees, and a battery of stats (incl backtests) to decide what works. They end up with many thousands of strategies that are a great deal more effective than your standard one-trick pony fund.
You can use ML to make money on encrypted stock data for free. Think Kaggle but the winning models are used to trade.
I'm Kris, the guy who wrote the article that started this thread. Thanks to all who have read my article and taken the time to comment. In the context of my motivation for starting my blog, it means a lot. I'm an engineer who became interested in quantitative finance and machine learning a few years ago. I learned how to code and apply my maths and stats knowledge to finance independently - no formal training whatsoever. This meant that for a long time I was conducting research and developing trading systems in a vacuum; I had no one to bounce ideas off or learn from. So I started writing about what I was doing in the hopes of getting some feedback. So thank you all for providing some. The insights were immensely valuable and I learned a lot.
I thought it would be useful to respond to some of the comments.
mathgenius brought up the extremely valid point that regular k-fold cross validation in a time series context doesn't make sense since the data is autocorrelated, not iid. I no longer use this approach for time series data, instead favoring Rob Hyndman's time series cross validation approach, also known as forward chaining. I believe this approach is the best representation of a real trading environment. The issue becomes deciding how large the rolling window of training data should be - older data may be obsolete, but excluding too much history can lead to not enough training instances.
dpweb raises a good point too, namely that just because your model performed well on past data, even if that data was out of sample, there is no guarantee that the future will be sufficiently like the past, meaning that your model may well become useless at some point in time (possibly very quickly). This is a valid point, but no reason to abandon the markets. It does however require that any algorithm's live performance be objectively monitored such that the level of deviation from expected performance can be statistically quantified. Once a pre-determined confidence level in the model's obsolescence is reached, it should be removed from the portfolio.
mcbrown's comment about publication bias is a good one too. Even worse, I've personally developed hundreds of trading systems that I haven't published. Other bloggers and publishers have most likely also done the same. This form of selection bias is very likely rampant, and is especially applicable to models 'discovered' using machine learning techniques that may not be rooted in traditional economic or financial principles. The moral: absent some form of robust accounting for selection bias, view all of these types of systems with a healthy dose of skepticism, and the published performance as a theoretical upper limit to what could be achieved in practice.
hendzen's point about partnering with a fund or proprietary trading company rather than running your reliable, alpha generating strategy yourself is also a valid one. I have happily found this out for myself recently.
Also, lordnacho is spot on regarding his take on the utility of data mining in finance.
Thanks again for all the comments!