Berkeley offers its data science course online for free (opens in new tab)

(news.berkeley.edu)

801 pointsseycombi8y ago158 comments

158 comments

I find it curious that there are so many courses for data-science related subjects, which superficially seem to cover the same material, and relatively few courses covering more traditional CS topics such as computer systems, networks, OS. I suppose it has to do with the market, but also feels like colleges are skating to where the puck is, rather than where it will be (or perhaps, where it could be).

sigi458y ago

This is a general issue i have seen as well.

We live in 2018 and there is no open source course for everything. Instead there are probably 10-30k universities who have similar courses and professors who give the same lecture every year.

They get paid often enough by countries to create and do those courses. In germany most of our unversities are paid by all of us germans anyway.

And what do you find online? Always the starter verion like 101 computer science or videos with bad audio or video, no proper exercises, no solution helper etc. Nothing. You have to go to different sites to sometimes pay or sometimes not.

there are no local locations to meet up with people.

There should be a global initiative for global free and open access learning. Sponsored and supported by companies and countries. Build upon a core of a knowledge graph based on topics or 'snippets of knowledge'. Like for example: math -> add -> sub

Something like 'The Map of Mathematics' (https://www.youtube.com/watch?v=OmJ-4B-mS-Y)

And when you wanna get the global accepted math 101 level, you have to take specific topics / snippets.

And those snippets can than be filled with different people who are making a lecture for that topic and you can choose whom you like more or who is better in explaining it to you.

What do i do instead? I ask around for the lecture scripts because they are always behind a simple password protected area or have multiple links to different pages of different universietses who offer different courses for free as videos for there students in sometimes/often bad quality and / or bad video players etc.

It sucks and this is stupid.

rmorey8y ago

>There should be a global initiative for global free and open access learning. Sponsored and supported by companies and countries. Build upon a core of a knowledge graph based on topics or 'snippets of knowledge'.

I like this idea and the framing of it a lot

dpflan8y ago

Have you seen this? Does this meet your criteria/have enough sufficient resources/curriculum?

> http://datasciencemasters.org/

sigi458y ago

Nope, not at all.

I'm not looking for the next github page with collections of tons of different sites with different courses.

I still hope for one platform where all those smart people out there are working together to optimize learning.

tyu1008y ago

I forget if its the Verge or some other popular podcast but they are always suggesting Apple just put together a fully open and accredited online university.

The problem is that there are a lot of people getting paid a lot of money all over the world to work in post-secondary education whom control the keys to accreditation and whom have proven very resilient at resisting any optimization efforts.

JoshMnem8y ago

Apple isn't open. Their stuff has tended to not work on Linux.

1 more reply

YPBS8y ago

Here is a list of CS courses including many graduate level courses:

https://github.com/Developer-Y/cs-video-courses

Double_a_928y ago

Check out khanacademy.

resolaibohp8y ago

I have also found this interesting. What I don't understand is that the amount of data science jobs are no where near the levels that people make it seem. I am not sure where all these people will end up working if they want to be a data scientist. There is not a need to hire huge teams of data scientists like you might for dev roles, it doesn't scale the same way.

mr_toad8y ago

Every job that involves data in any way is being relabeled a data science job. Most of them are just generating dashboards and posters in Excel or Tableau for people who are data illiterate. I know many people with maths/stats/comp-sci backgrounds who end up in these sorts of jobs.

“Just add a bunch of green up arrows and red down arrows, your manager will love it” was advice from a co-worker of mine. Sadly, she was right.

wakkaflokka8y ago

It's actually become somewhat hilarious to me. Like you said, the data science label is being applied quite liberally (no judgment, I'm not the world's authority on how it should be applied), so here you have companies paying $100k or more to have people do Excel work or Tableau visualizations.

1 more reply

amrx1018y ago

This. You are exactly right. I work for a startup not in anyway related to Data Science or ML etc. We use Python here. My flatmates work for a big named DataScience company and most of the time its numpy and a bit of data visualisation.Pandas and Numpy to the rescue. I am like dude, I can do that in blink.

Natsu8y ago

In an engineering class, when dealing with a problem about factories that produce widgets and consume resources/widgets from other machines, I made a complex Excel spreadsheet that was animated to pass little numbers around that represented the items produced or consumed.

It didn't actually give correct results, I'm not sure it could have worked (I needed to do two updates on the same cycle and never did figure out how), and I documented that it was buggy.

But it looked cool and I got good marks. So my experience pretty much agrees with yours.

PhasmaFelis8y ago

Where can I get a job like that? I'd be happy to take it at this point.

kwillets8y ago

Data engineering as well.

madman28908y ago

Haha, so much yes.

linuxftw8y ago

Every graduate that doesn't find a high paying job in the field is an excellent candidate for the next level of education.

After you have given a university hundreds of thousands of dollars and a decade-plus of your life, you will then be ready to teach the next crop of students.

throwawayjava8y ago

> Every graduate that doesn't find a high paying job in the field is an excellent candidate for the next level of education.

1. You really think Berkeley's top-ranked PhD program is recruiting people who couldn't find jobs? No. Not only can 100% of successful top-tier PhD applicants find jobs, 100% of them are strong candidates for the top echelon on entry-level jobs.

If you disagree, go look up the people in Berkeley's CS PhD program. Point out a single person you think didn't turn down mid-100s job offers to attend Berkeley.

Getting into one of these top-5-to-10 PhD programs is no small thing...

2. MOOCs aren't PhD programs and in nearly all cases aren't designed to feed into PhD programs.

3. Finally, at least in CS, at least for the moment, educators are in extremely high demand. And again, at least for CS, that demand isn't being manufactured by the academy.

2 more replies

linkmotif8y ago

It’s a pyramid sceme you can get in on with financing.

eksemplar8y ago

I disagree with this. Every enterprise company has an analytical department, even more so in the public sector. My municipality has 8 guys working on analytics for instance.

They are mostly economics or (I’m not sure what it’s called in English, but it’s a degree in societal administration), but they really ought to be data scientists because everything they do is based on huge sql data sets.

We pay private contractors a lot of money to turn our data into cubes and manageable models because none of our analytics know how.

In 10 years I suspect anyone with that job title will need data science on their resume. Not just to manage the data, but also to start doing machine learning on it.

By comparison we have one network guy to run the network for 10.000 employees and 5000 students, with a backup guy who knows everything the first guy does but works with something else, you know, in case the first guy quits.

1 more reply

itronitron8y ago

I actually think in order to do it right you need to have a sizable team (n > 7) of data scientists in order to keep them honest, productive, and developing their skills. They will need to pair up with a team of software devs/ML/data wranglers to help them push the edge on what they are doing. Depending on the company that could be the complete engineering team, or it could be a disjoint set.

hkmurakami8y ago

Rebrand marketing into a data science'y name to attract quantitatively minded applicants (already starting to happen in some orgs)

telchar8y ago

As someone in data science this isn't a trend I particularly like. I suppose to some extent these jobs can be filtered out by requiring a salary that they're not willing to pay for that work.

booleandilemma8y ago

I don’t think the colleges care where their students end up when they’re done with them.

cossatot8y ago

They care to the extent that the alumni are willing and able to make fat donations.

codingdave8y ago

Startups do not need a huge team of data scientists... but industry in general does need data analysts and this would certainly enhance the skills of that crowd.

zitterbewegung8y ago

MIT open courseware has a bunch of classes related to traditional CS topics. Also just searching universities you can find some. Data Science is the new hip class and it’s just very aggressively marketed. The other one being marketed heavily is intro to coding .

globuous8y ago

Their computer security class is amazing. All lectures, notes, labs, quizzes etc: https://ocw.mit.edu/courses/electrical-engineering-and-compu...

amrx1018y ago

I think all MOOCs have traditional CS courses but marketing hype is on Data Science and ML. I remember doing Tim Roughgarden(StanFord) Algorithm course on Coursera. Loved it.

jamestimmins8y ago

Good to know about OCW. Are there often videos as well as assignments? My issue in the past was that they often didn't have many of the teaching resources.

Intro to coding is a fascinating one. From a marketing/business standpoint it makes sense, but 3-4 it was extremely frustrating to see dozens of intro coding courses, but practically nothing for intermediate programmers. Thankfully we're past that point for the most part.

zitterbewegung8y ago

It depends on the class but they have videos and assignments there .

fma8y ago

Because it sounds cool...I remember when I was applying for college (2001), nanotechnology and biomedical engineering was all the rage. Glad I stuck with electrical engineering.

banned18y ago

I thought biomedical engineering still has a lot of potential to offer?

swedishfish8y ago

My friend who majored in biomed kept getting passed over during her job search. Turns out all the biomed companies just wanted to hire mechanical engineers. (She did eventually find a job in her field.)

I guess it's analogous to the data science degrees popping up today. Will be interesting to see if it ends up as a fad degree or a legitimate career path.

1 more reply

fma8y ago

The people I knew that majored in biomedical engineering went on to med school... Probably one of those areas that is always in the news and science magazine but still too early to revolutionize life.

I remember when Dolly was cloned and we would have a whole new industry...

1 more reply

foobarchu8y ago

Or bioinformatics ~2009?

ithilglin9098y ago

I'm just curious -- where do you think the puck will be? I've had a number of younger acquaintances ask for career advice. Pursuing some kind of data science seems like an obviously smart direction now, but I've wondered if this, as well as traditional CS career paths, may be in danger of becoming over-saturated areas, now that everyone views them as sure paths to a job that pays well.

itronitron8y ago

By its nature data science work is very undefined. There is not yet a widely established design path for data science as there is for software development. The risk for someone starting their career in data science is that they will end up in an organization that doesn't know how to data science. So I'd recommend steering young folk to larger teams that have PhD level Statisticians.

proverbialbunny8y ago

On the tech side front end or data science both seem like good options.

Companies are need a store front, so front end work will continue for a while, until it is super easy for any joe to make a professional website.

Data Science looks promising too, because it is automating and solving problems that previously could not be done.

However, outside of tech the world still needs skilled blue collar workers. For example, I don't see carpenters being automated away any time soon. I hear some of those jobs pay better than tech work too.

oculusthrift8y ago

i think there is a natural barrier for these paths being too hard and too mathy for most people. people have known for decades that engineering is a good path.

resolaibohp8y ago

Yep, a lot of the people who want to get into it run away when they find out you need to know how to code and understand the math. For some reason there is this idea that it is easy money.

jamestimmins8y ago

I simply mean in terms of the availability of online courses. Eventually there will be harder CS courses online. While there's been enormous movement in this direction in even the last 2 years, there's still a lot of room for growth and improvement.

domnomnom8y ago

math is the secret weapon here. Cybersecurity, data science, programming

QML8y ago

I think it’s because data science has a more immediate, broader applicability than computer science. Not everybody needs to know how to program a full application; but they should be able to load in a dataset and statistically analyze it. Looking at the type of people taking Data 8 compared to CS61A (the introductory programming course), I would say the former is a fairly diverse crowd (political science majors, economics majors, biology majors, etc.)

It is also a possibility that Data Science is an easier topic to learn than Computer Science, and thus more popular.

itronitron8y ago

It might be more popular as many data science courses use Python and therefore students don't have to get their code to compile.

Seriously though, I think people are drawn to data science out of a desire to create stories with some underlying support (data/evidence) in order to influence policy or business decisions.

Ar-Curunir8y ago

61A also uses Python.

soziawa8y ago

It says fastest growing course. At my university it's also the fastest growing course. It grew from 0 to 50 students in the last semester.

ucarion8y ago

You're talking only about the public + free MOOC stuff, right? I think it's reasonable for that to be biased toward less specialized stuff.

Internally, Berkeley definitely isn't based toward the intro-level stuff. Quite the opposite. But the most polished, rehearsed, mass-manufactured classes are certainly the gigantic intro-level ones everyone takes.

electricslpnsld8y ago

> I find it curious that there are so many courses for data-science related subjects

Is it that there are more courses in data science relative to other topics, or is just more marketing around these classes? It costs Berekely essentially nothing to pump out some press around the release of course materials in data science.

trendnet8y ago

GaTech has a lot of good MOOC Computer System courses.

jamestimmins8y ago

Yeah they seem to be one of the few that's really leading the way here.

Socketopp8y ago

All these courses in data science ain't gonna solve science biggest mystery, that is consciousness. Too bad we don't focus on that.

/Cognitive scientist

ghaff8y ago

We are starting to see a bit of a pushback although it's hard to discern amongst all the hype. There's some recognition that deep learning is just one technique that happened to pop (to use Rodney Brooks' term) for a variety of reasons but that we haven't made huge progress in cognitive science and other fields.

Deep learning is the current shiny toy but I suspect we'll find it isn't actually sufficient for a lot of things we want to do and we'll run into a wall a lot of people aren't expecting.

JBSay8y ago

That's why it's called data science not cognitive science. Duh!

Ar-Curunir8y ago

Our computer security course at Berkeley regularly attracts 600+ students. Berkeley is doing fine in terms of traditional course offerings.

jamestimmins8y ago

I meant in terms of online courses. Would love to see that class put online.

Ar-Curunir8y ago

All the materials (except lecture videos) are available online; just google CS161 Berkeley.

bartart8y ago

People at Berkeley view this class as kind of a joke. The average grade is insanely high and the topics are covered in much less depth than just the normal intro cs or stats classes.

https://www.berkeleytime.com/grades/?course1=7765-all-all

bhl8y ago

The instructors have explicitly said, if you have previous CS or Statistics knowledge, the class isn’t for you. This is for people who don’t know how to program and haven’t taken a statistics course yet.

throwawayjava8y ago

None-the-less, IMO parent's observation is important for purchasers of MOOCs. It's true for a lot of the very popular MOOCs.

StashOfCode8y ago

I code myself. I've watched it. Pretty good learning platform (never had a look at a MOOC before): I am impressed (but I will take a look at other MOOCs mentioned in this discussion and I will certainly be less impressed very soon). Finished the first week. Right now, I find it a good introduction for someone who has no knowledge of code and statistics. IMHO, the main advantage would be that such a person may learn what coding and statistical reasoning looks like. The main disadvantage would be that this person think he has learned enough and keeps not trying to learn more about those topics. What an enlightening free week for the people: first look at "Do you trust this computer?", then attend this MOOC "Data8.1x"?

wwweston8y ago

Good to know. I looked at this Berkeley course (along with some private offerings like General Assembly) and got the feeling that they really weren't worth the investment for a guy with a Math degree, a CS minor, and programming experience going back to childhood.

But I think I'd like some kind of formal, credentialed program that would build on my existing linear algebra + software skills (and address the weaknesses in my statistical understanding that I know are there based on how I felt about my grasp on the related material for even the classes I passed)... and maybe isn't quite as big an investment as a full-fledged master's degree.

Anybody have any suggestions?

austenallred8y ago

This is exactly what we built at Lambda School - our data science/machine learning program has math (linear algebra, calculus) and CS (python) as pre-requisites, and is designed to train the rest of the way. It can also be free until you get hired in field.

It is a big commitment - 6 months full-time or one year part-time.

https://lambdaschool.com/courses/ds/machine-learning/

nl8y ago

The Georgia Tech online masters?

Depending on what stats you want to do, there are some pretty decent MOOCs. No one is going to claim that Daphne Koller's PGM course is weak in anyway for example[1].

[1] https://www.coursera.org/learn/probabilistic-graphical-model...

newfoundglory8y ago

A graduate certificate? Not sure what you’d want it in, something like this Applied Statistics option? https://www.worldcampus.psu.edu/degrees-and-certificates/app...

Someone12348y ago

I randomly started checking other CompSci courses on that site and they all had a similarly high average grade, in some cases even higher (A instead of a B+).

Which are the hard courses at Berkeley in CompSci using the site you linked?

QML8y ago

The hardest CS undergraduate course offered is probably 189: Introduction to Machine Learning. Nonetheless that class has a B+ average; so I wouldn’t say difficulty of class correlates with a low grade average. There are others like 170 and 162 which you can check out.

If my memory holds, there’s a policy that class averages should be around 3.0-3.3 (B/B+).

apengwin8y ago

Compilers (164) and digital system design (152) are the hardest classes. Other classes with high workload are systems (162) and graphics (184)

gnulinux8y ago

I'm a UC Berkeley alum. When I was there this was a course taken by humanity majors to learn some programming so that their Resume looks cooler. majority of STEM majors take CS 61A (SICP) or E7 (Programming in MATLAB). Just noting this as a context, this is not the class intended for CS majors; this : https://cs61a.org/ one is.

cbHXBY1D8y ago

Or if you want to get experience with data science: Stats 134 (easy), CS188 (medium), CS189 (hard).

gnulinux8y ago

Meh, I disagree. CS 188 is not a very useful course. You can learn 188 material very quickly (i.e. reading Wikipedia for a couple hours) if you do well in CS170 and CS 189.

I think for data science: Stats 134, CS 189, EE 127, EE 126 are most useful (in this order). Of course, in order to do well in CS 189 you need to have a good background in probability which can either be CS 70/Math 55 (if very well understood), or Stats 134, or EE 126.

jwilbs8y ago

I personally thought Stats 134 was the hardest of those courses by far (though I took it under a visiting professor who was needlessly difficult). 188 was a breeze, and I believe the full course is offered for free on edx

gnulinux8y ago

CS 189 (with Sahai) (Machine Learning) was by far the hardest course I took in my life.

master_yoda_18y ago

I think this is a bad trend. These university make basic courses free to gain popularity and then ask for big money for their real courses.

This is bad in two ways:

1) The people taking these courses do not learn much for the effort and time they spend. Also it gives them illusion that they know enough as they take course from big university.

2) Industry is already so confused in hiring, they hire by name. So even you take these courses and study in depth on your own you can't get hired. Someone more qualified can not get hired just because they can't pay 100k to get a degree in machine learning from one of these big university.

This is really a bad trend and we should spend time on real courses. Everyone knows that TV series are waste of time, these courses are like TV series. Stop watching them.

nerdponx8y ago

A course called "[Intro to] data science" should be taken about as seriously in hiring as "Intro to computer science", or "Intro to mechanical engineering". There's no reason these courses should bear any weight in hiring, and it's disingenuous to attempt to lead people to believe otherwise.

master_yoda_18y ago

I was talking about people who study in depth on their own after the course.

bagacrap8y ago

What a bleak world you must live in if all TV counts as a waste of time. Does that extend to other forms of drama? Literature? Entertainment?

Even if you believe it's pointless it's pretty clear it's no something everyone else "knows".

anonymous51338y ago

Always boggles my mind with these "free" online courses that still stick to old method of "registering" for the class and then following a regimented schedule.

Seriously, just upload the lecture videos, put the homework online and textbook. Add a message board and you're golden.

lucasverra8y ago

Having ~7 moocs and 1 udacity nanodegree under the belt, here is my anecdata :

Before Coursera, i was never able to finish anything on MIT opencourseware. Free flow of information need too much commitment from my end to be digerable.

It was the structure given by

> "registering" for the class and then following a regimented schedule.

that i managed to start and finish. Disclaimer: I discovered Coursera after grad school

benhamner8y ago

For those interested in a practical, hands-on course, we just released one at Kaggle https://www.kaggle.com/learn/overview

jph8y ago

Berkeley and the UC schools are making major strides in online education, including edX participation and on-campus projects. If you're interested in Berkeley and data science, there's an online masters program too. (Disclosure: Berkeley is in my client roster). https://requestinfo.datascience.berkeley.edu

appleiigs8y ago

US$65K for tuition

amrx1018y ago

LOL. As someone from third world where local currency is enormously devalued with respect to US$, I wonder why would anyone do this?

pc868y ago

If you think that's bad look up Executive MBA programs. Corporations basically give a university a nearly six figure donation for their top executives to get a tax-free tuition benefit for a rubber-stamped MBA largely indistinguishable from a real MBA.

I know the above probably sounds like sour grapes, I don't have an MBA or any graduate degree, I just think the whole Exec. MBA thing is a total scam against the companies paying for them and a huge cash cow for universities.

1 more reply

heurist8y ago

These prices seem mostly targeted employers who offer tuition reimbursement.

rrdkent8y ago

Currently in this Data Science Program. Happy to answer any questions anyone has if they're considering it.

seycombiOP8y ago

direct link: https://www.edx.org/professional-certificate/berkeleyx-found...

(There are two ways you can follow the course: Certificate Program is paid, but the AUDIT program is free)

qntty8y ago

Note that if you don't want to pay, you should click "View Courses" and click on 1 of the 3 courses in the Foundations of Data Science series.

based28y ago

Pursue the Program ( $357.30 USD - old: $397 )

1 more reply

dpflan8y ago

Have anyone followed the curriculum suggested here?

> http://datasciencemasters.org/

graycat8y ago

Okay, here's a view of what appears to be part of the course:

We have a course (right a school application of stuff taught in school!) with two teachers, that is, two sections of the course, each section with its own teacher and its own students. At the end of the two courses, that is, the two sections, we want to compare the teachers. So we give the same test to all of the students from both courses.

Suppose one section had 20 students and the other one, 25 -- the point here is that we don't ask that the two numbers be equal; fine if they are equal, but we're not asking that they be.

So, there were 45 students. So, get a good random number generator and pick 20 students from the 45 and average their scores; also average the scores of the other 25; then take the difference of the two averages.

That was once. It was resampling. Now, do that 1000 times -- remember, we have a computer to do this for us. So, now we have 1000 differences. If you want, then, "live a little" and do that 2000 times. Or, for A students, do all the combinations of 45 students taken 20 at a time. Ah, heck, lets stick closer to being practical and stay with the 1000.

Now, presto, bingo, drum roll please, may I have the envelope with the actual difference in the actual averages of the actual scores in the two classes.

If that actual difference is out in a tail of the empirical distribution of the 1000 differences from the resamplings, then we have a choice to make:

(1) The two teachers did equally well but just by chance in the luck of the draw of the students one of the teachers seemed to do much better than the other one.

(2) The actual difference is so far out in the tail that we don't believe that the two teachers were equally good, reject the hypothesis that there was no difference, called the null hypothesis, and conclude that the teacher with the higher actual average was actually a better teacher.

Sure, it happened that the real reason was that one section of the course started at 7 AM and was over before the sun came up and the other section was at 11 AM when nearly everyone was awake. We like to f'get about such details! Or, sure, we might get criticized for a poorly controlled experiment.

This is also called a statistical hypothesis test or a two sample test. It is a distribution free test because we are making no assumptions about probability distributions of the student scores, etc. Since we are not assuming a probability distribution, we are not assuming a probability distribution with parameters and, thus, have a non-parametric test. Uh, an example of a probability distribution with parameters is the Gaussian where the parameters are mean and standard deviation.

Such tests go way back in statistics for the social sciences, e.g., educational statistics.

In more recent years, leaders in resampling include B. Efron and P. Diaconis, recently both at Stanford.

Why teach such stuff? Well, some parts of computer science are tweaking old multivariate statistics, especially regression analysis, and calling the results machine learning and/or artificial intelligence, putting out a lot of hype and getting a lot of attention, publicity, students, and maybe consulting gigs. Also the newsies get another source of shocking headlines to get eyeballs for the ad revenue -- write about AI and the old "take over the world ploy"!

So, maybe now some profs of applied statistics, what for a while was called mathematical sciences, etc., or other profs of applied math want to get in on the party. Maybe.

What can be done with resampling tests? I don't know that there is any significant market for such: Long ago I generalized such things to a curious multidimensional case and published the results in Information Sciences. The work was a big improvement on what we were doing in AI at IBM's Watson lab for zero day monitoring of high end server farms and networks. Still, I doubt that my paper has ever been applied.

One of the best areas for applied statistics is the testing of medical drugs. Maybe at times resampling plans have been useful there.

I have a conjecture that resampling plans are closely tied to the now classic result in mathematical statistics that order statistics are always sufficient statistics. Sufficient statistics is cute stuff, from the Radon-Nikodym theorem in measure theory and, in particular, from a 1940s paper of Halmos and Savage, then both at the University of Chicago. Some of the interest is that sample mean and sample variance are sufficient for Gaussian distributed data, and that means that, given such data, you can always do just as well in statistics with only the sample mean and sample variance and otherwise just throw away the data. IIRC E. Dynkin, student of Kolmogorov and Gel'fand, long at Cornell, has a paper that this result for the Gaussian is in a sense unstable: If the distribution is only approximately Gaussian, then the sufficiency claim does not hold.

Other applications of resampling, such applied math, etc. might be in US national security. E.g., maybe monitoring activities in North Korea and looking for significant changes ....

Maybe there would be applications in A/B testing in ad targeting, but I wouldn't hold my breath looking for a job offer to do such from a big ad firm.

For all I know, some Wall Street hedge fund or some Chicago commodities fund uses such statistics to look for significant changes in the markets or anomalies that might be exploited. I doubt it, but maybe! Once I showed my work in anomaly detection to some people at Morgan Stanley, back before the 2008 crash of The Big Short, and there was some interest for monitoring their many Sun workstations but no interest for trading!

Net, IMHO for such applied math: If can find a serious application, that is, a serious problem where such applied math gives a powerful, valuable solution, the first good or much better solution, with a good barrier to entry, and cheap, fast, and easy to bring on-line and monetize, then be a company founder and go for it. But I wouldn't look for venture funding for such a project before had revenue significant and growing rapidly and no longer needed equity funding!

Otherwise look for job offers (1) in US national security, (2) medical research, (3) wherever else. But don't hold breath while waiting.

Now you may just have gotten enough from about 1/3rd of the Berkeley course!

subroutine8y ago

What you are describing is known as bootstrapping (if sampling with replacement) jackknifing (if sampling without replacement), or (in the case you want to run a significance test, and not simply create a distribution or stats like confidence intervals) a permutation test. I think you already know that; I'm just mentioning in case others want to look these up by name. Also while they can be called 'distribution free' it only means you are not assuming a prefab distribution. If you want to perform a significance test you'll be creating (explicitly or implicitly) a distribution of your calculated statistic (known as the empirical distribution). If you want to be very explicit about this, you can plot a PDF or CDF of your sampled stats just like you could with a gaussian, exponential, poisson, etc., distribution.

We teach these methods to our students in intro stats at UC San Diego. Have been for as long as I've been here (5 years). Last year a data science program was also created here at UCSD. I've TA'd a flagship course in that program too. It's almost exactly the same content; the major difference is imo are the faculty personalities. The stats profs are smug, while the data science profs are energetically self-important. They teach the same shit. Self motivated students with a STEMy personality tend to learn more in the stats courses because the profs drive on hard core theory; on average though, students do better in the data science course because the profs are so bombastic the kids walk out of each class thinking they are basically ready to join the fellas over at Waymo on some machine learning projects - maybe even show 'em a thing or two, cutting edge tricks learned back at the ol' uni.

graycat8y ago

Nice!

Yup. Thanks.

> known as the empirical distribution

Yup, and I wrote:

"out in a tail of the empirical distribution"

Yup, "rank" tests, "permutation" tests: With my TeX markup:

E.\ L.\ Lehmann, {\it Nonparametrics: Statistical Methods Based on Ranks,\/}

And, yup, again with my TeX markup,

Bradley Efron, {\it The Jackknife, the Bootstrap, and Other Resampling Plans,\/}

Last time I knew, Roger Wets was at UCSD. He read one of my papers and suggested JOTA where I did publish it!

fjsolwmv8y ago

Whg is stats full of goofy names to make everything sound more unique and complex than it is?

nextos8y ago

There's a lot of value in your posts. Mathematizing problems, when successful, brings elegant solutions with well understood properties. Hence, I don't understand the downvotes you are usually getting.

I'm a pure CS / logician by training, but I've spent a few years trying to expand my expertise into probability theory and stochastic processes. Lots of your advice resonates with me. My MSc advisor recommended I should go through Neveu. He was pretty good, had been a student of Pontryagin.

graycat8y ago

In most academic fields, work that mathematizes the field is regarded as the best.

Neveu is elegant beyond belief. I keep my copy close. I was aimed at Neveu by a star student of E. Cinlar, long at Princeton and before that at Northwestern -- long editor in chief of Mathematics of Operations Research. Neveu was a student of M. Loeve at Berkeley. So was the current darling of machine learning, L. Breiman, because of his Classification and Regression Trees (CART). Breiman's Probability as published by SIAM is generally easier reading than Neveu.

For stochastic processes, there are several relatively different directions to go.

Martingale theory is gorgeous, astounding, amazing, with one of the most powerful inequalities in math, the astounding, tough to believe, martingale convergence theorem, and likely the shortest proof of the strong law of large numbers.

Then can do Markov processes more generally. The discrete state space version is important and not too difficult -- Cinlar has a nice introductory text.

A high end direction for Markov processes is potential theory. There are claims that that is the math for exotic options on Wall Street, but I doubt that there have ever been any applications.

There is a big role for second order stationary stochastic processes in electronic engineering. I ran into that for processing ocean wave data for the US Navy. Here the fast Fourier transform added a lot of interest.

And there's more.

Generally long Russia, France, and Japan seemed to have emphasized stochastic processes more than the US. But by now I suspect that the US is well caught up.

I'd have a tough time believing that very many people with money to hire know enough about high end stochastic processes, or even just Neveu, to hire in those fields. US national security may be about the only hope, that is, outside of academics.

Yes it appears that some of the quantum field theorists in physics are interested in path integrals.

Uh, I'm disorganized here: There is the field of stochastic optimal control!

As usual for advanced applied math, my suggestion is, outside of academics or US national security, find a valuable application and start a business to make money. That is, don't expect to be hired.

selimthegrim8y ago

Did Neveu ever take a detour into physics? There is a well-known QFT model with his name on it.

Incidentally, I have a copy of Loeve's old probability textbook - I wonder if it is too out of date to be any use.

1 more reply

dsacco8y ago

Thanks for the recommendation on Neveu, I’m going to check it out. If I recall correctly Chung mentioned it in his book as well.

1 more reply

itronitron8y ago

you could make that argument even for basic applied math, ... don't expect to be hired ... because the people hiring don't know why they are hiring or what to look for

amrx1018y ago

I loved your post man. I think you are right about the rebranding of Applied Stat as ML|AI. I love stuff like these. I did take 3 Stat course in university. Currently I am working as a dev.I took a course in my free time http://codingthematrix.com/ and loved the programming part of it. Do you happen to know some courses where one would have stat part as well as the programming par?

graycat8y ago

For this statistics and applied math, at least anywhere near the level of the Berkeley course in the OP, it's by now old stuff, older than nearly all living programmers! Well from various subroutine libraries, some open source, some from, IIRC, the US National Bureau of Standards and Technology, SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System), R, Matlab, Mathematica, LINPACK, CART (Classification and Regression Trees, by L. Breiman and others), and more, there's a LOT of code from quite good up to highly polished. Mostly now people use such code instead of writing it. For stochastic processes, there's code, e.g., the fast Fourier transform for which there is a huge pile of code, for all the different flavors of that curious algorithm.

Well, there is more code to write, but IMHO that would be for relatively advanced techniques or, say, working with terabytes of data instead of megabytes.

If you want to write code for applied statistics, then maybe so indicate, have a portfolio of code, and contact the usual suspects -- US national security and medical research. I'm not optimistic. I've given my opinion -- find a good application and found a startup to monetize it.

It is true that today there is a WSJ article on how technical, with algorithms for trading, Wall Street has become. The article has next to nothing on what applied math is being used but does have lots of names, maybe some you could contact. Actually, the article mentions that Goldman Sachs (GS) got hot on such applied math. Well, that was about when I wrote Fisher Black, of Black-Scholes, there at GS asking about applied math at GS, and I got back a nice letter from Black saying that he saw no such opportunities. Well, the WSJ article today claims that that time was when GS was getting hot on applied math.

If you want to know about applied math on Wall Street, then try to get an opinion or overview from, say, James Simons.

Again, IMHO, it's academics, US national security, medical research, maybe a few other situations, but best of all, start a business, the money making kind.

nl8y ago

At my previous job we used some similar techniques in (1) social media monitoring and in (2) cyber security applications. In (1) we had an applied math team working on it, and in (2) I handed the project over to someone doing a math PhD.

To be fair, resampling wasn't the key to our projects, but we were doing a lot of work understanding probability distributions which is not entirely unrelated.

q-base8y ago

I love this. Thanks a lot for taking the time to write it!

Vaslo8y ago

Can you or someone else TL;DR that for us ADHD folks?

graycat8y ago

What I wrote is much shorter than the Berkeley course.

The idea of resampling is just dirt simple; read it again, just the first paragraphs, and you'll get it.

Treegarden8y ago

why is there no syllabus - as in a list of contents? I want to know what really is behind this buzzword stuff.

QML8y ago

It's probably around the same content as this semester's iteration: http://data8.org/sp18/. I would read the online book for more info.

frabbit8y ago

There are at least two big turn-offs to this course at first blush: 1) they insist on using anaconda (effectively another package manager complicating the already layered interaction of system pip, virtualenv, virtualenvwrapper etc ). 2) they use Microsoft VisualStudioCode (so, inevitably a good deal of time in this course will be spent learning how to navigate a bloated IDE)

pc868y ago

Neither of which are the least bit consequential for anyone more interested in learning about data science.

frabbit8y ago

As it turns out I was wrong about one of those points: in fact the course prefers that you avoid MSVisualStudioCode and instead use the Jupyter notebooks.

But, this bring us back to a much more central topic in data science: the tools and environment DO matter. Hugely.

Reproducibility is central to not just data science but all science. This is facilitated by the use of Free, Open platforms which adhere to common standards.

Imagine trying to debug why someone has a different answer than you when there are x*variant-of-program environments in which they have obtained their answer?

At the most basic level this course should be distributing a Docker image or a VM image of some sort in order to ensure that everyone has the same version of the software.

Even if you do not care about any of the above, please, shed a tear for the student who would like a simple setup.

Thank you.

meri_dian8y ago

What exactly is Data Science? It seems like such an overused term and the value of the subject really gets diluted for me when I see charts in Tableau being offered as examples of "data science".

What's the difference between, say, a Master's program in Computer Science where one studies machine learning and a Master's program in Data Science? Am I wrong for thinking the Data Science program weaker?

gaius8y ago

What exactly is Data Science?

Data Science and DevOps are both just labels for things people have been doing under more mundane terms for 40-odd years.

Even Machine Learning is just a trendy buzzword for what used to be called Predictive Statistics.

nl8y ago

I did stats before data science was a thing, and then ran a data science team afterwards, and it's dramatically different.

I've never seen any stats text book or course discuss techniques for dealing with large amounts of data to any significant level, but in data science that is a core part of what you do.

I ran production systems before DevOps and after. Again, it's very different - prior to devops, there was no emphasis at all about using software engineering techniques to manage and deploy software. The most you'd get was some scripts maybe kept in source control if you were lucky.

Now I run an AI company, and a key part of the ML we use involves generating structured text files from images. I guess predictive statistics is technically a correct label, but the tools and techniques are so dramatically different that that thinking of them as separate fields is more correct than incorrect.

gaius8y ago

I struggled for years to understand DevOps because I couldn’t see what was different from how I worked already... the answer was nothing :-)

1 more reply

itronitron8y ago

The CS program should focus more on data structures and algorithms (and possibly UX and good ol' software dev as well) and the DS program should focus more on statistical/analytical methods and their particular nuances and limitations. If the DS program is done well, with a lot of stats classes, then it is not a weaker program.

_ptgt8y ago

Honestly, with the current state of stats teaching, they might be better off just avoiding the stats classes in most cases. I don't need yet another drone that tries to convince me I should make a business decision because of "statistical significance" but can't explain why I should care about statistical significance (p.s. I shouldn't).

carlosgg8y ago

Berkeley also used to have this Data Science with Spark series on edX but they taught it just the one time and now even the archived versions of the courses are closed.

https://www.edx.org/xseries/data-science-engineering-apacher...

swedishfish8y ago

I'm so sad this was never taught again. It's the most useful MOOC I've taken, and it motivated me to start using pyspark on a daily basis. I would say the class is better for learning pyspark than actual data science concepts though.

bthallplz8y ago

For those interested, you might want to check out http://data8.org I'm not sure how it compares to the OP course, though.

csjr8y ago

Does anyone know how it compares to bootcamps like DataCamp[0] for e.g?

[0] https://www.datacamp.com

analognoise8y ago

Boot camps are stupid.

Go to community college. It's ridiculously cheap, and the credits are worth something.

et-al8y ago

How good a community college is often dependent on how wealthy its county is.

Secondly, a data science course may be offered, but only during fall semester and everyone else wants to sign up for it.

Also bootcamps can compress the curriculum from two years worth of junior college into 8-12 weeks. For someone who never enjoyed being in school, I'll take the bootcamp.

analognoise8y ago

"Also bootcamps can compress the curriculum from two years worth of junior college into 8-12 weeks"

Right, and I got a doctorate by listening to an audio tape.

1 more reply

barry-cotter8y ago

If you can learn skills that get you a job from a boot camp they’re not stupid. The fact that Lambda School and App Academy don’t get paid unless you get a job and they still exist suggests rather strongly that they get people jobs.

analognoise8y ago

Skills without the foundations are going to be useless in a technology shift or economic downturn. Also getting a job is great - how about keeping one, or advancing?

It seems very short sighted; community college only takes 2 years. Your career lifetime is what, 45 years? The ROI is insane, why would you shortchange yourself?

2 more replies

skinnymuch8y ago

App Academy takes $5000 last i heard. The hype is that they take nothing though.

devteambravo8y ago

What a ludicrous statement. Name a single community college program that touches on even a small portion of something like www.freecodecamp.com

erokar8y ago

Many if not all of the courses on Edx has a free audit option, like this one. It gives you no certificate and often you cannot access or submit exercises.

simpleAdam8y ago

this would seem to be a playlist

https://www.youtube.com/watch?v=xcgrnZay9Yc&list=PLFeJ2hV8Fy...

daveheq8y ago

Who has time for this?

Double_a_928y ago

Unemployed desperate people thinking that this will get them a job.

j / k navigate · click thread line to collapse