[1] https://class.coursera.org/ml-003/lecture [2] https://class.coursera.org/pgm-003/lecture
I've also found reading papers to be illuminating: often the first article about a given classifier is fairly well written and accessible if you have a strong background in math.
This is also a useful thing to keep in mind: http://scikit-learn.org/stable/tutorial/machine_learning_map...
Then on your prediction servers, ensure that you have a copy of `all_model_filenames` in the same folder. You can then load the model with `model = joblib.load(filenames[0], mmap_mode='r')`. This will make it possible to use shared memory (memory mapping) for the model parameters of a large random forest so that all the Gunicorn, Celery or Storm worker processes running on the same server will use the same memory pages, making it a very efficient way to deploy large models on RAM constrained servers.
You can even use docker to ship the model as part of a container and treat the model as binary software configuration.
If you are more familiar with Python than Java, like me, then that would be a more attractive option.
Yes, this variable is usually set after the fact. For instance, a given transaction may have led to a chargeback, or may be done by a known fraudster. These models are usually trained on historical data, so we can know with some certainty which transactions are fraud.
It could be that a transaction was fraudulent but has not yet led to a chargeback (maybe the real cardholder hasn't yet seen their statement?), so there's still some uncertainty, but hopefully that approaches a minimum after some time passes.
Besides that, let me rephrase here.
At <west coast startup, essentially a copy of earlier successful European businesses such as HouseTrip, but with access to stupid amounts of US capital and therefore more profitable>, we <make superfluous, keyword-laden, unverifiable claim about ourselves in the future>. We <continue to integrate feel-good community pronouns>. We <here discuss something only tangential to our core business and assert that we have allocated at least two people to this area>. We <have nothing better to do than write it up, because quite frankly, there's nothing more pressing for us to work on in an already automated business of relative simplicity>.
OK, so that's a bit harsh, but there's some points toward reality in there. Sorry, as someone who used to run a complex travel industry business (3200+ hotel contracts... all of them in Chinese, all business by digital fax (no convenience here!), constant rate changes, in 6 human languages and multiple currencies with a real time call center) and who co-pitched for VC with HouseTrip's management in London in 2009, I just have very little respect for AirBNB.
Sure, but only high level. I would hazard a guess that fraud prevention is a lot more complex for us at https://www.kraken.com/ ... dealing with many cryptographic currencies and conventional currencies spread across probably over a hundred legal jurisdictions is not easy. We likely have to consider far more factors than these guys. We have recently added two more quants from programs highly regarded in the conventional finance industry to our team, plus we have over seven figures of investment in legal and training programs in the area. We also use R.
Basically, it's inputs (behavior), processing (metric extraction, risk model), output (boolean choices, statistical cluster membership, etc.)... where a series of such outputs may feed in to a heirarchy of scores for different elements within a system. Some applications may be real time, others after-the-fact.
At a high level, which is mostly where my involvement is in hiring people, fraud prevention is not dissimilar to spam or intrusion detection: you can basically use a combined, constantly tweaked set of inputs to a Bayesian-style scoring algorithm. Inputs include both static rules and statistical anomaly detection.
Yes, PMML isn't perfect (being kind), but it continues to be extended and is the one shared lingua franca we have across model creation systems, short of (sigh) SAS code and "recode the model in generic C", both of which I see too often.
I suspect in the future we'll see "standard" architecture with pipelines with multiple parallel feeds and runtime engines into ensembles, each of which allows various model types in "native" format (sklearn and other pythonics, R, java, etc.) which would be interesting, instead of having to cram all into PMML. Just a thought.
So as it turns out I spend my days building the very product you're describing (yhathq.com; a REST API-ifier for R and Python). The scikit-learn community alone are a wonderful group who do a hell of a job. It's kinda crazy that most products won't let you use that awesomeness and instead choose to build out their own machine learning libraries to work within their system.
This article got passed around the office this morning and it seems to encompass the general theme of most ML tools. They empower you to do cool things with machine learning/general data analysis, but at the expense of being able to use the libraries that most people use to do machine learning/general data analysis. Don't know if I'd consider that poor design, but yeah, it's definitely a tradeoff.
Hmm, maybe I should be reaching out to airbnb's data science team?
So, what's left? Collusion between buyer and space provider -- in all likelihood, they are one in the same, or identities have been stolen. For example, I list my condo on AirBnB for $100/night. Someone books it for the weekend, and then doesn't show up. AirBnB owes me $200 -- after all, I gave up other options to profit from its use. An honest buyer pays up. But, maybe the buyer is dishonest -- he used a stolen credit card, etc. In this case, AirBnB eats the loss and pays me as the space provider. Now, wouldn't it be convenient if I was also the buyer? Cash from stolen credit cards, funneled through AirBnB (much akin to the way online poker sites were used to transfer stolen money via bad heads-up play). This would work until AirBnB noticed that my listing seems to have a suspicious propensity to attract fraudulent buyers. Then, they'll shut me down. So, I'll pop-up elsewhere. After all, no need to actually have a space because no one I accept will ever show up!
I bet the usage patterns of the party/parties involved in this fraud are drastically different than those of legitimate market participants. Someone with a fraudulent listing could out himself by rejecting a bunch of legitimate AirBnB buyers, and this behavior would stand-out as it's the opposite of the behavior expected of an honest seller. So, he must protect against this risk by making his listing unappealing (high price, bad photos/description, unpopular location, etc.). The behavior of users browsing AirBnB when viewing this property could identify its relative undesirability (few clicks, etc.), and price outliers could be identified by comparing similar offerings by date/location/type. The click stream of the "buyer" likely is most revealing. Someone selecting an unappealing property without doing much comparison shopping likely isn't a legit buyer.
What other stuff might predict fraud? Vague descriptions might indicate a fraudulent listing. Most space providers love to tell buyers what's special about their offering. Could some scoring of a listing's prose prove a strong predictor? I've never listed with AirBnB. What do they do to verify listings? As a buyer, they verified my identity. Could this serve multiple purposes? Certainly, I'd feel better listing my guest room if I know that AirBnB will know the identity of the guy who rented the room and then stabbed me at 3AM. But, in addition, does identifying market participants in strong ways help keep fraudsters from repeating their crimes by setting up multiple accounts? Obviously, newer market participants are more risky than established ones, especially those who have interacted with known legit, long-time users. The social graph comes to the rescue here. Even astroturfing ought to show up as a small, disconnected graph unless legit users' identities are stolen.
Of course, this comment is all just conjecture. Obviously, AirBnB can't tell the public about specific fraud methods or how they identify suspicious activity. However, I like the concreteness of considering actual fraud scenarios, so I decided to put forth some ideas for discussion.