Skip to content

Top Best Ask Show New Jobs

What do Data Scientists use to train models fast?

6 pointsmoridin00710y ago7 comments

i'm training a machine learning model using SVM in python and it took aages for it to happen on my local machine. (with 10% of the data that i have) i'm getting a 80-90% correct prediction score on the same subjects data so now want to add in the rest of the data. (11 more subjects)

i thought of offloading it to my ec2 instance but i'm on a budget so cant just take a 30CPU instance.. on top of everything the code just uses 1 CPU at 100% always so i'm not sure about how effective it would be.

what do you guys use to train these models?

7 comments

5 comments · 3 top-level

facorreia10y ago· 2 in thread

One approach is to convert the code to use parallelism. For an example of how to do it in Python using joblib see this article: http://blog.dominodatalab.com/simple-parallelization/

Even if you can't afford a 32-core instance, you might get to use 4 cores in your laptop.

syllogism10y ago

The problem's pretty obviously the Python implementation...Throwing more cores at it isn't really going to help.

Is it even easy to parallelise SVM training?

moridin007OP10y ago

so the solution would be move from python to go? like python isn't that good for ML algorithms?

syllogism10y ago

Speed comes from two things: implementation and algorithm. Algorithmically, the way to learn quickly is to use some sort of stochastic gradient method, i.e. learn from examples one-by-one, as opposed to as a batch.

As far as implementation goes, you need dense arrays. A native Python implementation will usually be lists of Python objects, which is very slow.

If you just need an SVM implementation, libsvm is pretty good. I'm assuming you need a non-linear kernel. If you're using a linear kernel then there's not really a difference between SVM and MaxEnt (well, there is but not much).

If your data is very sparse then there aren't many general-purpose implementations that are any good. The scipy.sparse module has some key stuff implemented in pure Python, and doesn't interoperate properly with the rest of the PyData ecosystem. I had to implement my own sparse data structures, in Cython.

rajacombinator10y ago

How much data and how long are you talking about? If it fits in memory, then the slowness is likely due to other coding errors causing a bottleneck, not the SVM training. (Unless you wrote that as well.)

j / k navigate · click thread line to collapse