Data scientists shouldn’t need to know Kubernetes (opens in new tab)

(huyenchip.com)

181 pointsvtuulos4y ago114 comments

114 comments

90 comments · 30 top-level

void_mint4y ago· 14 in thread

Most people involved in tech, including most devs, shouldn't need to know/care about Kubernetes. The reason anyone thinks otherwise is the massive amounts of marketing money vested parties have pumped into sales (read: DevRel/Dev Evangelism, dev influencers).

throwaway8943454y ago

The objective is to minimize how much devs need to know. There are a couple of ways to do that. The first is to pull out the traditional ops skill set into a traditional ops team so the app devs throw code over the fence to the ops team to operate, and hilarity ensues because the ops team is measured on uptime but they can’t affect the first order causes of downtime (code issues), so instead they try to make it harder to ship changes frequently which slows the business.

The other solution is that devs operate the apps themselves. This is infeasible with a traditional VM setup because managing VMs effectively involves tons of specialist knowledge and it’s unreasonable to expect dev teams to master it while also being expert developers.

Enter Kubernetes. Now you have a core DevOps/SRE team managing the “platform” (the Kubernetes cluster and various add-ons such as operators) which gives the application developers a high-level interface for operating their applications. They need to know a bit of Kubernetes, but it’s a whole lot less than mastering the traditional VM-based ops skill set. Moreover, as the Kubernetes ecosystem continues to mature, the surface area with which developers interact becomes smaller.

vp89894y ago

"Enter Kubernetes. Now you have a core DevOps/SRE team managing the “platform” (the Kubernetes cluster and various add-ons such as operators) which gives the application developers a high-level interface for operating their applications."

I've personally changed my opinion on this in the last ~2 years ... observing at work what it takes for people to stand up and manage a Kubernetes platform it really just feels like incredible waste that we have hundreds and thousands of SREs across our industry all building their own unique compute platforms when the public cloud vendors have already done that work.

The Serverless paradigm just seems fundamentally superior to me, but it also seems like it inherently requires vendors to be very opinionated in order to provide a good developer experience with it which AWS is not and aren't ... at least not yet anyway.

2 more replies

zapita4y ago

You don’t need Kubernetes to implement an embedded SRE model or an internal platform. You’re describing a good organizational model but making the mistake of crediting a tool for it.

1 more reply

alexchamberlain4y ago

Not sure I agree to be honest. I don't think most developers should know how to run K8s, but I think most developers should know how to run their code on K8s. These guys aren't idiots - putting abstractions and guide rails in the way is just patronising.

That's not to say everyone has to be an expert either - there's a place for experts to optimise setups etc too.

void_mint4y ago

> Not sure I agree to be honest. I don't think most developers should know how to run K8s, but I think most developers should know how to run their code on K8s.

This is silly. Most devs have too many other things they know they don't know, to also add on something like kubernetes.

1 more reply

tluyben24y ago

Absolutely. It is a timesink and really not very valuable for most devs: they will not ever use it themselves anyway and there is too much to learn while the normal dev stuff already has that as well. In bigger (only marginally bigger than a one person shop) companies you have admins/devops and they don't want you to touch any of it anyway.

strzibny4y ago

I completely agree with you. And if you want backend engineers to know more about ops, sure. But let them learn the groundwork, not forced them into K8s. As for data scientists needing K8s knowledge, that's ridiculous to me.

tomrod4y ago

Data scientist here with very recent learning on K8s space. Exposure and general conceptual understanding is extremely helpful to have to assist in design of solutions. However, agreed that expecting me to maintain or lead the ownership of a K8s standup is outside the wheelhouse.

danjac4y ago

They shouldn't have to, true. But enough companies have bought the Coolaid that it's a job requirement and you'll have to learn it anyway, which means developers will try and shoehorn it into their projects whether it makes sense or not so they can have it on their resume and then companies will need to make it a job requirement when those developers leave and they need to maintain it...

throwaway8943454y ago

What’s the alternative? Devs master a VM/Ops skill set (strictly more work)? Or devs throw code over the wall to an ops team (and progress grinds to a trickle)? https://news.ycombinator.com/item?id=28652561

2 more replies

Bertram_Oglebs4y ago

'...um before all these tools being able to reach similar conclusions ?' (-;

(OT) ^^ somehow related comic: https://ibb.co/JktgqSV

best,

amerkhalid4y ago

100% agreed. Kubernetes/DevOps is huge cognitive load, way over-engineered for an average project. Kubernetes should not come into picture unless you can afford a full-time DevOps person for your team. If you can’t then you are not big enough or haven’t solved a real problem yet.

commandlinefan4y ago

> shouldn't need to know

Hm - maybe shouldn’t need to, but why wouldn’t you want to? Even if its not strictly your job/responsibility, its always helpful to know how things work when things go wrong.

void_mint4y ago

Because if you followed this logic there would be many lifetimes of things to pay attention to, most of which are just noise surrounding topics you find valuable.

falcolas4y ago· 10 in thread

My opinion is simply: You should understand the environment your code runs in. Be it bare metal, Kubernetes, or anything in-between. How that environment works determines how your code works - or doesn’t work.

Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

bwship4y ago

I don't really agree on this. If your data scientists are extracting important information about your data in Python or R. The actual hard work of this is them figuring out the algorithms to run, not what it is being run on. They develop this code to sift through data in a data warehouse, a database, or flat files and then come up with answers. What servers, or cloud infra, or kubernetes fleet it then runs on is of 0 concern to the actual code they just laid down.

6nf4y ago

Do you believe front-end CSS / HTML designers should understand the entire stack down to the machine code and hardware running the VMs? I don't think I can agree with this, our stacks are too tall these days.

thinkharderdev4y ago

I think this actually gets at an important distinction. Are Data Scientists more like designers or developers? UX designers shouldn't need to know anything about k8s (or any other infrastructure) but developers should. Ultimately if you are not only responsible for building something but also running it in production and maintaining availability then you need to understand the infrastructure it runs on to some degree.

vasco4y ago

That runs in a browser, so they should understand the browser if you follow their logic.

shoo4y ago

i don't think it is necessary or sufficient for all individuals to have a deep understanding of the runtime environment. i agree that if the team needs to ship production code, it would be a good idea if at least one person in the team has a good understanding of how the code runs.

but there are other failure modes -- if everyone on the team is great at writing efficient production code, but no one understands the business context or the problem domain or understands if the problem they're attempting to solve is even vaguely feasible from some kind of theoretical perspective (maybe someone with a decent statistics background could demonstrate the entire premise of the project is flawed, and needs to be rethought, using a blackboard and no computers at all), maybe they'll spend months or years building and deploying a lot of fast, beautiful, completely worthless machinery.

tluyben24y ago

Except that most devs who learn this stuff but do not use it daily (or ever) (and why would they, they are devs), will know just enough to have opinions and too little for them to make sense. You (in general, maybe YOU do) do not understand the env your code runs on: it is layers on layers on layers with millions of LoC in between; you know some abstraction and maybe you know a bit more about this abstraction than others but you still do not understand it really. If you run Java or .NET Core or whatever popular with good support, your day to day programming won't matter for whatever env it runs on; if you write best practice code in those envs, writing different code for whether it runs in k8s or bare metal is... weird in almost every case. Someone in the team should know how to tweak the knobs and if there are things you should not do (use the filesystem for persistence and other trivial things) but the average dev or data scientist really doesn't need to know about it in any significant detail.

But I am curious where you have seen modern runtimes fail and where the code was the issue (not tweaks to the JVM settings); any concrete examples where well written, best practice code worked on the laptop but failed in k8s?

quadrifoliate4y ago

> But I am curious where you have seen modern runtimes fail and where the code was the issue (not tweaks to the JVM settings); any concrete examples where well written, best practice code worked on the laptop but failed in k8s?

Not sure about OP, but the most times I have seen devs have issues with Kubernetes is in the tweaking of the knobs around deployments including security. Startup v/s readiness v/s liveness probes, rolling updates, auto-scaling, pod security policies and such are usually all-new to developers, and have a lot of different options. Most devs just want "give me the one that works, with good defaults", and need a higher level abstraction.

1 more reply

pjmlp4y ago

Java was doing alright, but then plenty of people decided they didn't want to go along with it.

We develop on Windows, and deploy on multiple kinds of scenarios and OS stacks, I hardly have to care what lies underneath.

Same applies to .NET, although in smaller extent, given its Windows focus until quite recently.

harpratap4y ago

>Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

I think containers are a pretty good attempt at abstracting away runtime environments, no? Same docker image works on your local docker setup, docker-compose, vanilla kubernetes, managed kubernetes, fancy PaaS like CloudRun, Fargate, Heroku etc

isbvhodnvemrwvn4y ago

That's just running the code, you need to connect to something or have something connected to it, handle failures etc - docker doesn't solve these on its own.

antman4y ago· 5 in thread

There are people mostly with an IT background who think that for data science you don’t need to know math and just monkey see monkey do sutoml based on atutorial, inspirational MOOCs and libraries that appeared magically out of thin air.

There are people with a math background who think data science is just an extension of statistics, so business, knowledge of scalable information storages, and productization is irrelevant.

There are both kind of posts here on HN. My take has been to hire math people with some cs msc, cs people with datascience msc, and business people that also know sales.

For me that has worked painlessly but your milage may vary. I haven’t seen that black swan CV capable in all three disciplines, but I have seen CVs that seem to think that they can tackle every problem because they have read all towardsds and kaggle tutorials. Marginalization? Kubeflow? POV?, 2 out of 3 are usually foreign concepts.

urthor4y ago

It's mostly work ethic I find.

I've met quite a lot of Black Swans, and been employed alongside precisely zero.

I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

They describe themselves as "a statistician who can program."

Generally speaking it's more common for them to come from the math side of the fence. From the IT side I'll say the math is a bit harder than the computer stuff.

It's genuinely, 100% work ethic.

musingsole4y ago

> I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

That's super awesome for that data scientist, but the question for a business is can/should you structure yourself in such a way that you NEED employees with that cornercase level of joint expertise.

The answer is you really can't. Individuals have awesome strengths that they developed for reasons particular to them. Use those strengths when you can. But the business has to rely on a common denominator of a role or else it'll never fill it when their unicorn leaves to go backpacking in Europe.

1 more reply

antman4y ago

Agree thar work ethic is the most important thing since complicated qualitative things cannot be measured, trust precedes everything. But work ethic does not complete the puzzle because people dont know always what they dont know.

For the example you mentioned, I will use a simplification I make to explain levels of expertise of challenging knowledge: 1.ABOUT: Know about something (heard it, know some examples) 2. KNOW: Know that something well (I now understand it and can leverage it towards an end to end a useful thing, also know its weaknesses) 3. HUMBLE: Realize I did not know many things about it but now know many ways of using it, can correct and extend other people's work, most of the time. 4. EXPERT: Know why it was structured that way. Contribute to the knowledge/tool itself.

So for that PhD an initial estimate would be a 3 or 4 scale on the math level, 1 or 2 on the kubernetes level (don't know him ofcourse I can be wrong without first discussing). If he works independently level 2 kubernetes is pretty great. If he needs to be part of a larger support team, a level 3 knowledge based on my (admittedly back of the napkin and ambiguous) categorization might prove to be less risky.

ozim4y ago

And what makes you think that when presented with a problem those people who can grasp 2 concepts cannot get the 3rd one.

Was it shooting questions from the hip on the spot while interviewing them?

Or you hired 10 people and worked with them for at least 6 months to really know what they are capable of?

I think the former because no one has enough budget to hire people stick with them for 6 months just to see how they fare.

So what is your N to back up your claim?

Because it sounds like you really have something to say.

antman4y ago

Smart and creative peope can grasp a lot of things but not everything is pure thought. Experience and experimentation time is required and there are only 24 hours in a day. Also the ds field has a lot of young people that did not have that much time or opportunity yet.

My N is a few hundred, not all my personal hires. I have visibility because now I do project management office duties (build sub teams per project), lead most of the interviews on the ds side, internal technical consulting duties. Ten you mentioned is my target number for hires previous and next week approx.

My claim is based on experience from the academic and the consulting space for global corp (which included consulting for other corps to build their ds teams, rarely though). I hope my claim appears logical and is useful.

1 more reply

tedk-424y ago· 5 in thread

Kubernetes, Linux, CI/CD pipelines, unit testing...<insert tech here/>

To be honest the landscape is constantly changing and people should learn as much as they can.

I call ignorance on these kinds of posts.

etaioinshrdlu4y ago

I don't want to learn just anything, I want to be careful to learn things that won't go out of date quickly.

tedk-424y ago

OK PHP is great for you then. C and Java won't go out of date quickly.

Not the hottest tech out there but they have a long used-by date if that's what your major concern is.

If you believe k8s will just 'go away', you don't really have a good clue about what it tries to solve and instead, get confused in its complexity. Having been around the block, i can see it sticking around for at least 10 years.

strzibny4y ago

Learn more of the underlying knowledge (which is what I teach in https://deploymentfromscratch.com/) and your knowledge will last longer. Ansible YAMLs or your CI/CD provider YAMLs are just abstractions.

But forcing anyone, especially data scientists into a specific and quite complex tool of the day? Pass.

tedk-424y ago

No-one is forcing anyone to do anything.

Tools are built by people that use them. If your team chooses to deploy their applications on a k8s stack, it's on them to own that and not treat it like a black box.

I'm completely against the entitled belief that a person 'shouldnt need to know how to <x>'.

I can stretch the example in many ways: 1) if you're commit secrets into your source code and claiming a 'data scientist shouldnt need to know about secrets management' 2) if you're building a data analysis script and you leave it as an undocumented mess that's not got no unit tests and one day it breaks, you shouldn't claim that 'a data scientist shouldnt need to know about testing'

Oh cry cry there's a tech that everyone is using but i don't want to learn it / i dislike doing that particular thing / working with that piece of tech.

Build your own damn tech stack/computer if you think you can do it better. Or ask in the job interview if your team is running their data science platform on k8s if you dislike operating apps on it so much and deny the job.

rc_hackernews4y ago

Bought your book awhile ago from someone that mentioned it on my Twitter feed.

I’m really enjoying it! You did a great job on it.

I especially liked the chapter on networking since it was always something I was weak in.

1 more reply

Jensson4y ago· 4 in thread

Data scientists wants salaries like software engineers which is why they get requirements like software engineers. There are plenty of data scientist positions where all you need to know is excel, but those doesn't pay nearly as well. And if you look at the typical software engineering position there is almost always a slew of adjacent technologies, it is hard to get a position today where you only have to know one thing.

musingsole4y ago

I don't believe pay directly influences job responsibilities like that. Maybe scale of responsibilities. But more pay doesn't mean you start doing something outside the job description.

The business leaders and managers trying to load kubernetes work on data scientists are doing so because the managers don't know what they're doing, what they want or who they need to get it done. Instead, they have the one hire they got greenlit last year and if that person can't do EVERYTHING, your group is screwed.

Zababa4y ago

That's pretty much what I came to say. Expectations to know all the stack are what software engineers face.

miscaccount4y ago

Not only stack but all roles from devops to devsecops to QA to performance and loads testing.

Seems they want to replace one team with all responsibilities

1 more reply

piggybox4y ago

"Data scientists wants salaries like software engineers" This is a bit weird. In general, DS is still one of highest paid jobs in recent years, if you check any job market report.

jrockway4y ago· 4 in thread

I think what's going on here is that tech leadership folks know that the models the scientists develop eventually need to feed into their live product (so need to be "production ready"), but there isn't enough work to have two teams; one to develop the models, and one to run them in production. Thus, the ideal employee is an expert in everything! That's valuable, but not likely to be something you find when both data science and SRE are deep fields where people are very successful only knowing one of them ;)

I work on something called Pachyderm, which is a Kubernetes-based data storage and job execution system that tries to bridge this gap. We have a managed solution (https://hub.pachyderm.com) where we provision your Kubernetes cluster and do all the management (keeping the software up to date, authentication and authorization, etc.) and in fact don't even expose kubectl to you. You'll never see any of the Kubernetes stuff (though you might recognize certain error messages, I suppose). You just supply your code and a specification for how data flows around your pipelines, add your data, and we do the rest. Data scientists can interact with the versioned inputs and outputs through notebooks, but you're getting the full suite of production features behind the scenes -- a history of exactly which data inputs went into which data outputs, incremental processing, seamless autoscaling (set cpus: 8, gpus: 1 in your pipeline specification, and we find you a machine that meets that spec, add it to your cluster in less than a minute, schedule your work there, and remove the machine when the job finishes), etc.

Sorry for the sales pitch. I pretty much never use HN to shill my paid work, but it seems especially relevant to this sort of problem. Maybe you don't need the unicorn employee that is an expert in multiple fields -- focus on the data science and let us actually deal with the ugliness of computers ;)

(And if you do like Kubernetes but don't want to write your own orchestration system, Pachyderm itself is open source.)

marcinzm4y ago

> but there isn't enough work to have two teams

Two teams causes an issue where scientists chuck models over the wall for the engineers to somehow rebuild into a semi-workable approach. The end result isn't great because you can't build good production models without taking production deployment into account. You also can't convert non-production models into production models without understanding the modeling assumptions that happened.

The general result is that the engineers and leadership finds the results underwhelming to horrible. The scientists often don't care because what happens on the other side of the wall isn't their problem.

That doesn't mean everyone has to know everything but separating people into teams is not the answer. Have a single team with people of different focuses and areas of expertise.

nerdponx4y ago

Forgive my ignorance, but wasn't (isn't?) Pachyderm a Hadoop data version control tool? Did the product pivot?

samuell4y ago

The (very) short version is that it is a much better alternative to Hadoop, built for the container era.

commandlinefan4y ago

> there isn't enough work to have two teams

There may not be enough work for two teams 100% of the time, but there sure is when TSHTF. Manufacturers understood the need for some slack, but software companies still haven’t figured this out.

Jugurtha4y ago· 3 in thread

We "do ML" for large organizations as a tiny consultancy. The way we've been able to improve the working conditions for ourselves (developers and data scientists) was by focusing on two things:

- Process: we analyzed what worked and what didn't in past projects. Continuously auditing and trying to extract learnings. We made sure people we built for at the client organization were involved. We scoped more thoroughly. We involved parts client organization that could torpedo the project downstream (legal, security, etc) upfront. Made fewer assumptions. Listened more.

- Tooling: we built a machine learning platform[0] to make sure a data scientist doesn't tap on anyone's shoulder to troubleshoot their system, set up their computing environment, or deploy their model. They could do it themselves. Furthermore, it wasn't necessary to get people who could move across the stack.

Changing our processes and the way we do consulting had a huge impact. A badly scoped project will in some way or another create toil downstream and create a situation where you need people to do full-stack and you need "all-hands-on-deck" constantly. That's just bad, and after we ruthlessly reworked the process, we had better results, better relations with clients, better cadence, etc. I emphasize on this because we were a larger team at some point running around working on so many projects simultaneously that everyone was practically burned out.

-[1]: https://news.ycombinator.com/item?id=28373127

teruakohatu4y ago

It looks good. Resubmit that to Show NH.

Jugurtha4y ago

Hi, I re-submitted it here: https://news.ycombinator.com/item?id=28777589

Jugurtha4y ago

Thanks. It fell between the cracks on HN, and I didn't want to re-submit it not to be spammy.

Although we technically added multi Kubernetes cluster support. It was only GKE, and now it runs notebooks and workloads on AWS EKS, Azure AKS, and DigitalOcean as well. I'm not sure it's enough of an improvement according to the Show HN rules to re-submit. Plus I'm reworking the landing page and docs to add more clarity on what this thing does, with gifs showing RTC and all.

Do you have any feedback?

1 more reply

m0zg4y ago· 3 in thread

Increasingly data scientists need to know a thing or two about underlying tech. Otherwise you're limiting yourself to stuff that can be built on a single machine, and that doesn't get you very far. That said, with that list of qualifications they'll be looking for a very long time, especially if they aren't prepared to hire a $400/hr contractor to do all that stuff. Such people exist, there are just very few of them, and they're booked solid months in advance.

savin-goyal4y ago

A single machine can take you remarkably far these days, given the availability of high RAM/Disk/CPU machines in the cloud.

mark_l_watson4y ago

I agree. A huge GCP VPS with a good GPU attached is very inexpensive when you only start it when you are in a work sprint.

Just this week I have been experimenting with SageMaker and SageMaker Studio. Too early for a real evaluation, but it looks like SageMaker Studio hits many requirements: good for experimenting, run large distributed jobs, good model and code versioning tools, easy to publish REST APIs, etc. Just yesterday someone asked me to review 3rd party tools, and I look forward to getting a better understanding of how SageMaker Studio stacks up against turn-key systems.

I have built my career from standing on the shoulders of giants. I am not shy about just using the results in academic papers, using open source libraries, tools and frameworks, etc. that other people have written.

So, I agree with you that so much can be done on a single beefy VPS, but services and frameworks that allow easy use of multiple servers are also important.

m0zg4y ago

In plain old data science? Sure. In deep learning? Nope. Gotta be distributed unless you want to wait until the Sun burns out.

phendrenad24y ago· 3 in thread

Developers in general shouldn't need to know about Kubernetes, but it's become trendy to slash your IT/Ops teams to the bone and instead accept that your developers will just spend all of their time trying to configure GCP.

thinkharderdev4y ago

I don't understand how you would do your job as a developer without understanding the infrastructure it runs on. I agree that it can make sense to have dedicated people do all the infrastructure setup/management/etc, but when you have an application running in production there are a lot of considerations which can't be cleanly separated from underlying infrastructure. Not to mention troubleshooting production issues. When something is not working in prod, the first thing I do is check basic operational stuff with the underlying deployment. Are all the pods still running? Have there been any restarts? If there is some DNS/network error how can I spin up a pod in the cluster to check on various things?

throwaway8943454y ago

With an Ops team, developers aren’t expected to operate their code. That’s the ops team’s problem. And the ops team is measured on uptime, which is a function of the code itself, which they can’t actually change—devs own that. What the ops team can do is to slow down the rate of deployments (another input to downtime/uptime). Rather than many small deployments, they’ll have larger deployments once or twice a quarter (at best).

So a desire to ship features regularly and preserve agility and quality is the “trendy” that the GP is talking about.

1 more reply

phendrenad24y ago

> When something is not working in prod, the first thing I do is check basic operational stuff with the underlying deployment. Are all the pods still running? Have there been any restarts? If there is some DNS/network error how can I spin up a pod in the cluster to check on various things?

And how much less downtime would you have if domain experts were doing that part?

urthor4y ago· 2 in thread

I don't think it's a particularly new feature of software development that a few highly paid employees who've got the entire stack in their brains are vastly more productive than a vast cross functional team.

urthor4y ago

Also I will say.

There is some fantastic tooling for machine learning.

Databricks, GCP, everyone knows it.

The issue is that the data industry was raised from birth in complete fear of the boogeyman.

The boogeyman is Oracle. And the frankly ridiculous things Oracle did in the bad old days.

Hence most places have a constant internal conflict between "look here are all these brilliant data science tools" and "ah shit, GCP costs a ton of money when some idiot runs a select * query on a join across 5 TB of data."

But there are plenty of great tools.

tomrod4y ago

Can you speak a bit more to this? I dislike Oracle with a passion, but I am not sure how the GCP comment connects.

1 more reply

rjzzleep4y ago· 2 in thread

This is a pretty good post. I completely agree that a data scientist should not need to know Kubernetes.

There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.

One thing I have learned dealing with different data science teams is something else though. I have gone through every single pipelining tool(including pachyderm) and stream processing tool that was available at the time. The thing that people forget is that every single one of them has a thing that throws you off of what you actually want to accomplish or has some sort of caveats in your use case.

The important thing to note is that the job of the architect or whatever you want to call that person, is to provide an infrastructure where the data scientist can just run their code. And no matter which one of these environments you use you still need to build glue code for your use case. Even if that glue code is python library with a good distribution mechanism.

tdeck4y ago

> There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.

Airflow's UX is just needlessly easter-eggy and bad. The one thing I'd want out of the dashboard is the list of recent job runs and whether they succeeded or failed, so of course that's hidden in such a way that a novice has to click 10 different places to find it. There's also the fact that they chose to call a timestamp "execution time" when it often doesn't correspond to the time the job is executed. Want to add parameters to your task? You better like hand-writing JSON or pasting it into a textbox because apparently that's a weird thing to do, so why bother adding any UI support for it.

crucialfelix4y ago

I found Argo Workflows (k8s job and pipeline manager) much easier to work with and manage than Airflow. But I know Kubernetes and find it easy, so ..

TruthWillHurt4y ago· 2 in thread

What DO they know? Their Python code is sub-par, a procedural script not suitable for production use. They can't use Git, They don't write tests. They don't understand how to deploy/use CICD.

Maybe they should stick to spreadsheets, or upskill a bit so they don't consume so much of the engineers time.

zwaps4y ago

You pay these people for their PhD level knowledge of math and stats, because that is a sparse skill: No matter how many Coursera courses one does, you can not upskill anyone to that level (at least, I have never seen it).

So, if their time is better spent applying that knowledge rather than thinking about infrastructure trivialities, then by all means, pay an engineer to clean up. In the end, that's still more cost-efficient.

That being said, I refuse to believe that anyone leaving university today with a degree in stats/ML/econometrics etc. doesn't know git and can not be taught good programming that doesn't at least interfere with operations.

But as soon as you start requiring your experts to do infrastructure, you are either wasting money, or you hired a quotation mark "data scientist" with a degree from medium.com and towardsdatascience.org or whatever - in which case by all means, require them to do engineering duties.

tchalla4y ago

It's not surprising that some scientists aren't the best at engineering practices given that it's not their speciality. Much like some engineers aren't experts in scientific either. May be, both science and engineers should learn to understand their limitations and collaborate towards achieving a common goal. That would be productive over being condescending.

tofflos4y ago· 1 in thread

It's a price data scientists have to pay in order to work in rapidly evolving business and solution spaces. Someone within the local organization has to experience all these tools before being able to reach similar conclusions. Many organizations are still struggling to get the data science infrastructure in place so they look for full-stack people to help get the ball rolling and start making progress on some initial set of prioritized business problems.

A few organizations are further along on that journey enabling their data scientists to focus on things other than process and tooling. Full-stack will be in demand until the solution space stabilizes and the bulk of organizations catch up.

hobofromabroad4y ago

That might be true for startups. But larger business organizations are far better of creating a specific heterogeneous team with data scientist, data engineer and ops in one. At least starting out. That way, there is inherent knowledge transfer. You are not artificially limiting your hiring pool and can actually get some T shaped folks being experts in a certain domain.

Later on you can then build more specific teams or even more cross functional ones.

Of course, if you only want feel the waters and check if DS use cases are viable at all, consider getting a (few) freelancers and but a somewhat technically inclined person in charge. If that's a success use it to get funding for a proper team.

kureikain4y ago· 1 in thread

I had extensive airflow and I generally agree that Airflow isn't a good solution. It good when you process a single atomic/"unit of work" per step, when each step process multiple files etc and if it's restart you have to write code to handle skip those processed file for example.

But I want to point out a few things that are wrong in the artcile to help other evaluate airflow.

> Second, Airflow’s DAGs are not parameterized, which means you can’t pass parameters into your workflows. So if you want to run the same model with different learning rates, you’ll have to create different workflows.

You can pass the parameter to workflows by giving it a JSON config. When trigger on the UI, you can paste the JSON with the right argument/parameters into your DAGs. So you can train model with different arguments etc

> Third, Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed.

You can absolutely create new steps at run time. The point of airflow is everything is just Python code that is evaluate to generate DAGs, as long as you generate the DAGs and write the operator. It will happily run and log. It may have trouble rendered on the UIs and cause some weird issue (tasks won't advanced after certain steps regardless when I last work on them but they are bugs).

You can write an operator, the operator in turn can initiate any other known operators, and point the next steps to those operators. Here is an example: https://stackoverflow.com/questions/41517798/proper-way-to-c...

samuell4y ago

The one drawback I did note with Airflow was none of the mentioned ones, but this: It does not allow defining data dependencies at the data level. That is, in terms of individual inputs and outputs of a process or task.

I cover this issue in some detail in a blog post from a few years back: https://rillabs.org/posts/workflows-dataflow-not-task-deps

spicyramen4y ago· 1 in thread

Very limited and unfair comparison between Kubeflow and metaflow. Metaflow is dependent on AWS (it is mentioned but not emphasized). To me this is a non-starter. It makes sense for Netflix but not for the rest of the world

vtuulosOP4y ago

As the article mentions, Metaflow will start supporting Kubernetes natively soon, although data scientists don't need to care about it :) Nothing changes in your Metaflow code when you move e.g. from AWS to Azure, so Metaflow isn't fundamentally dependent on AWS in any way.

Netflix is an AWS shop, so naturally we started with AWS integrations.

FpUser4y ago

I am a developer and do not know much about k8s. Well I know the theory and what they're for and could learn to use it in practice. However I have yet to find a single case amongst my clients where all this infrastructure overhead will provide positive ROI. I do not deal at Google scale and for normal businesses a single instance of properly written server deployed on dedicated hardware covers all their needs many times over. It serves as many requests as they can ever hope for without breaking a sweat.

dudeinjapan4y ago

Waiters shouldn't need to know anything about cooking.

However, knowing a bit about cooking might one a better waiter.

thom4y ago

Full stack data scientists exist. They have certain advantages over others. Specialists exist. They have certain advantages over others. Live your life, be free.

sandGorgon4y ago

I'm kind of surprised at seeing kubeflow vs metaflow levels of abstraction honestly.

If you are indeed talking from a data scientist POV - then the right abstractions here are Dask and Ray Distributed.

Both can run on Kubernetes as the underlying orchestration layer - but are a pythonic interface to distributed data science primitives.

mmarq4y ago

These requests are not unreasonable in organisations that only need to run some simple (from a mathematical standpoint) operations against a complex (from an IT perspective) dataset. Quite often you don't need a full time statistician or mathematician, but you can make it a full time job if you hire a sysadmin or a developer that understand statistical distributions and hypothesis testing, and you put them in charge of the whole data infrastructure.

I'm not saying this is the majority of data scientists jobs, but in some organisations I worked for the data analyst was a guy that run `SELECT MIN(v), MAX(v) AVG(v) from TableX` against a MySql DB, so they were also in charge of DB administration and data ingestion, otherwise it would not have been a full time job.

alxmrs4y ago

My favorite infrastructure abstraction tool in this category is Apache Beam. I like that it lets you think in Python and an explicit Map Reduce DAG. Serialization errors are a bear to deal with. But, the power and composability of the framework make it nearly addictive.

ricklamers4y ago

This post really resonates with why we created Orchest [0]

From the article: "involve two full sets of tools: one for the dev environment, and another for the prod environment"

This is what we think should change. We intend to bring dev and prod into a single cohesive environment. Initially it will be difficult to cover all types of production workloads (like the post mentioned, production is a spectrum). But what we've observed is that through container encapsulation we can create well defined production workloads that we can run on any container orchestrator while shielding the data scientists from that complexity during pipeline development _and_ deployment.

With a container first approach to DAGs it becomes trivial not just to mix library versions but even languages (e.g. feature extraction in Scala and model fitting in Python). In practice, this flexibility has resulted in a significant productivity increase because existing code "just works". No "one virtual environment to rule them all" necessary.

I like how the article does justice to the fact that there's a subtle yet important difference between mere workflow orchestrators and workflow orchestrators that take on meaningful responsibility when it comes to infrastructure. To really unburden the data scientist from having to be a full-stack unicorn you need to hide the underlying stack to the point where it's invisible. In that sense, the OS kernel analogy really works. Similarly, how many data analysts writing SQL have ever worried about database node sharding?

A big problem we see in the space is that there are still way too many leaky abstractions and data scientists end up dealing with architecture & config yet again, for many a task out of their depth. We hope to contribute to a better ecosystem, one where data scientists spend their time looking at the data, relating it to the domain, shipping value generating data pipelines/models, and communicating about results with their stakeholders. Not fighting config & infra.

[0] https://github.com/orchest/orchest

lvl1004y ago

This is laughable. 15 years of DL? I ran neural net models more than 15 years ago. It wasn’t even accepted back then. Heck people looked at you weird if you mentioned Python. As far as I am concerned if you tell me you did DL before 2013 as a “DATA SCIENTIST” you are full of shit.

As far as OP, how do you learn Docker without Kubernetes these days? To me this is like saying you don’t need to learn Windows because all you do is run the solver in Excel.

EastSmith4y ago

Nobody that is not in a system administrator / dev ops role needs to know about it. I do not want to know about it. I am not explaining react reconciliation in my scrum updates, so stop giving me updates about Kubernetes.

justsomeuser4y ago

Sure they don’t need to know how to schedule their computations on CPU’s as another team member can handle it, but I think the reality is that if you work in software you have to constantly be learning.

tuananh4y ago

by that def, developers shouldn't need to know Kubernetes as well..

however, with the raise of devops culture, everyone should know the stack so they can use the platform effectively. everyone needs to up skill.

sgt1014y ago

I am really puzzled by "production is a spectrum". Production means that the code is run with a support team to an sla - the support team must have accepted it to service and be confident that they can deal with what might go wrong.

That's production.

alexnewman4y ago

i’ve heard a lot about people don’t want to learn the stack they program on.

fithisux4y ago

Sooner or later DSes will need to become Full-Stack. Knowing Kubernetes will be an advantage.

streetcat14y ago

There is a reason that operating systems is a mandatory course in any respectable CS program. Kuberentes is no difference.

Data scientist should know about kubernetes as much as they should know how to program.

j / k navigate · click thread line to collapse

114 comments

90 comments · 30 top-level

void_mint4y ago· 14 in thread

throwaway8943454y ago

vp89894y ago

2 more replies

zapita4y ago

You don’t need Kubernetes to implement an embedded SRE model or an internal platform. You’re describing a good organizational model but making the mistake of crediting a tool for it.

1 more reply

alexchamberlain4y ago

That's not to say everyone has to be an expert either - there's a place for experts to optimise setups etc too.

void_mint4y ago

> Not sure I agree to be honest. I don't think most developers should know how to run K8s, but I think most developers should know how to run their code on K8s.

This is silly. Most devs have too many other things they know they don't know, to also add on something like kubernetes.

1 more reply

tluyben24y ago

strzibny4y ago

tomrod4y ago

danjac4y ago

throwaway8943454y ago

2 more replies

Bertram_Oglebs4y ago

'...um before all these tools being able to reach similar conclusions ?' (-;

(OT) ^^ somehow related comic: https://ibb.co/JktgqSV

best,

amerkhalid4y ago

commandlinefan4y ago

> shouldn't need to know

Hm - maybe shouldn’t need to, but why wouldn’t you want to? Even if its not strictly your job/responsibility, its always helpful to know how things work when things go wrong.

void_mint4y ago

Because if you followed this logic there would be many lifetimes of things to pay attention to, most of which are just noise surrounding topics you find valuable.

falcolas4y ago· 10 in thread

Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

bwship4y ago

6nf4y ago

thinkharderdev4y ago

vasco4y ago

That runs in a browser, so they should understand the browser if you follow their logic.

shoo4y ago

tluyben24y ago

quadrifoliate4y ago

1 more reply

pjmlp4y ago

Java was doing alright, but then plenty of people decided they didn't want to go along with it.

We develop on Windows, and deploy on multiple kinds of scenarios and OS stacks, I hardly have to care what lies underneath.

Same applies to .NET, although in smaller extent, given its Windows focus until quite recently.

harpratap4y ago

>Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

isbvhodnvemrwvn4y ago

That's just running the code, you need to connect to something or have something connected to it, handle failures etc - docker doesn't solve these on its own.

antman4y ago· 5 in thread

There are people with a math background who think data science is just an extension of statistics, so business, knowledge of scalable information storages, and productization is irrelevant.

There are both kind of posts here on HN. My take has been to hire math people with some cs msc, cs people with datascience msc, and business people that also know sales.

urthor4y ago

It's mostly work ethic I find.

I've met quite a lot of Black Swans, and been employed alongside precisely zero.

I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

They describe themselves as "a statistician who can program."

Generally speaking it's more common for them to come from the math side of the fence. From the IT side I'll say the math is a bit harder than the computer stuff.

It's genuinely, 100% work ethic.

musingsole4y ago

> I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

That's super awesome for that data scientist, but the question for a business is can/should you structure yourself in such a way that you NEED employees with that cornercase level of joint expertise.

1 more reply

antman4y ago

ozim4y ago

And what makes you think that when presented with a problem those people who can grasp 2 concepts cannot get the 3rd one.

Was it shooting questions from the hip on the spot while interviewing them?

Or you hired 10 people and worked with them for at least 6 months to really know what they are capable of?

I think the former because no one has enough budget to hire people stick with them for 6 months just to see how they fare.

So what is your N to back up your claim?

Because it sounds like you really have something to say.

antman4y ago

1 more reply

tedk-424y ago· 5 in thread

Kubernetes, Linux, CI/CD pipelines, unit testing...<insert tech here/>

To be honest the landscape is constantly changing and people should learn as much as they can.

I call ignorance on these kinds of posts.

etaioinshrdlu4y ago

I don't want to learn just anything, I want to be careful to learn things that won't go out of date quickly.

tedk-424y ago

OK PHP is great for you then. C and Java won't go out of date quickly.

Not the hottest tech out there but they have a long used-by date if that's what your major concern is.

strzibny4y ago

But forcing anyone, especially data scientists into a specific and quite complex tool of the day? Pass.

tedk-424y ago

No-one is forcing anyone to do anything.

Tools are built by people that use them. If your team chooses to deploy their applications on a k8s stack, it's on them to own that and not treat it like a black box.

I'm completely against the entitled belief that a person 'shouldnt need to know how to <x>'.

Oh cry cry there's a tech that everyone is using but i don't want to learn it / i dislike doing that particular thing / working with that piece of tech.

rc_hackernews4y ago

Bought your book awhile ago from someone that mentioned it on my Twitter feed.

I’m really enjoying it! You did a great job on it.

I especially liked the chapter on networking since it was always something I was weak in.

1 more reply

Jensson4y ago· 4 in thread

musingsole4y ago

I don't believe pay directly influences job responsibilities like that. Maybe scale of responsibilities. But more pay doesn't mean you start doing something outside the job description.

Zababa4y ago

That's pretty much what I came to say. Expectations to know all the stack are what software engineers face.

miscaccount4y ago

Not only stack but all roles from devops to devsecops to QA to performance and loads testing.

Seems they want to replace one team with all responsibilities

1 more reply

piggybox4y ago

"Data scientists wants salaries like software engineers" This is a bit weird. In general, DS is still one of highest paid jobs in recent years, if you check any job market report.

jrockway4y ago· 4 in thread

(And if you do like Kubernetes but don't want to write your own orchestration system, Pachyderm itself is open source.)

marcinzm4y ago

> but there isn't enough work to have two teams

That doesn't mean everyone has to know everything but separating people into teams is not the answer. Have a single team with people of different focuses and areas of expertise.

nerdponx4y ago

Forgive my ignorance, but wasn't (isn't?) Pachyderm a Hadoop data version control tool? Did the product pivot?

samuell4y ago

The (very) short version is that it is a much better alternative to Hadoop, built for the container era.

commandlinefan4y ago

> there isn't enough work to have two teams

There may not be enough work for two teams 100% of the time, but there sure is when TSHTF. Manufacturers understood the need for some slack, but software companies still haven’t figured this out.

Jugurtha4y ago· 3 in thread

We "do ML" for large organizations as a tiny consultancy. The way we've been able to improve the working conditions for ourselves (developers and data scientists) was by focusing on two things:

-[1]: https://news.ycombinator.com/item?id=28373127

teruakohatu4y ago

It looks good. Resubmit that to Show NH.

Jugurtha4y ago

Hi, I re-submitted it here: https://news.ycombinator.com/item?id=28777589

Jugurtha4y ago

Thanks. It fell between the cracks on HN, and I didn't want to re-submit it not to be spammy.

Do you have any feedback?

1 more reply

m0zg4y ago· 3 in thread

savin-goyal4y ago

A single machine can take you remarkably far these days, given the availability of high RAM/Disk/CPU machines in the cloud.

mark_l_watson4y ago

I agree. A huge GCP VPS with a good GPU attached is very inexpensive when you only start it when you are in a work sprint.

So, I agree with you that so much can be done on a single beefy VPS, but services and frameworks that allow easy use of multiple servers are also important.

m0zg4y ago

In plain old data science? Sure. In deep learning? Nope. Gotta be distributed unless you want to wait until the Sun burns out.

phendrenad24y ago· 3 in thread

thinkharderdev4y ago

throwaway8943454y ago

So a desire to ship features regularly and preserve agility and quality is the “trendy” that the GP is talking about.

1 more reply

phendrenad24y ago

And how much less downtime would you have if domain experts were doing that part?

urthor4y ago· 2 in thread

urthor4y ago

Also I will say.

There is some fantastic tooling for machine learning.

Databricks, GCP, everyone knows it.

The issue is that the data industry was raised from birth in complete fear of the boogeyman.

The boogeyman is Oracle. And the frankly ridiculous things Oracle did in the bad old days.

But there are plenty of great tools.

tomrod4y ago

Can you speak a bit more to this? I dislike Oracle with a passion, but I am not sure how the GCP comment connects.

1 more reply

rjzzleep4y ago· 2 in thread

This is a pretty good post. I completely agree that a data scientist should not need to know Kubernetes.

tdeck4y ago

crucialfelix4y ago

I found Argo Workflows (k8s job and pipeline manager) much easier to work with and manage than Airflow. But I know Kubernetes and find it easy, so ..

TruthWillHurt4y ago· 2 in thread

What DO they know? Their Python code is sub-par, a procedural script not suitable for production use. They can't use Git, They don't write tests. They don't understand how to deploy/use CICD.

Maybe they should stick to spreadsheets, or upskill a bit so they don't consume so much of the engineers time.

zwaps4y ago

tchalla4y ago

tofflos4y ago· 1 in thread

hobofromabroad4y ago

Later on you can then build more specific teams or even more cross functional ones.

kureikain4y ago· 1 in thread

But I want to point out a few things that are wrong in the artcile to help other evaluate airflow.

> Third, Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed.

samuell4y ago

I cover this issue in some detail in a blog post from a few years back: https://rillabs.org/posts/workflows-dataflow-not-task-deps

spicyramen4y ago· 1 in thread

vtuulosOP4y ago

Netflix is an AWS shop, so naturally we started with AWS integrations.

FpUser4y ago

dudeinjapan4y ago

Waiters shouldn't need to know anything about cooking.

However, knowing a bit about cooking might one a better waiter.

thom4y ago

Full stack data scientists exist. They have certain advantages over others. Specialists exist. They have certain advantages over others. Live your life, be free.

sandGorgon4y ago

I'm kind of surprised at seeing kubeflow vs metaflow levels of abstraction honestly.

If you are indeed talking from a data scientist POV - then the right abstractions here are Dask and Ray Distributed.

Both can run on Kubernetes as the underlying orchestration layer - but are a pythonic interface to distributed data science primitives.

mmarq4y ago

alxmrs4y ago

ricklamers4y ago

This post really resonates with why we created Orchest [0]

From the article: "involve two full sets of tools: one for the dev environment, and another for the prod environment"

[0] https://github.com/orchest/orchest

lvl1004y ago

As far as OP, how do you learn Docker without Kubernetes these days? To me this is like saying you don’t need to learn Windows because all you do is run the solver in Excel.

EastSmith4y ago

justsomeuser4y ago

tuananh4y ago

by that def, developers shouldn't need to know Kubernetes as well..

however, with the raise of devops culture, everyone should know the stack so they can use the platform effectively. everyone needs to up skill.

sgt1014y ago

That's production.

alexnewman4y ago

i’ve heard a lot about people don’t want to learn the stack they program on.

fithisux4y ago

Sooner or later DSes will need to become Full-Stack. Knowing Kubernetes will be an advantage.

streetcat14y ago

There is a reason that operating systems is a mandatory course in any respectable CS program. Kuberentes is no difference.

Data scientist should know about kubernetes as much as they should know how to program.

j / k navigate · click thread line to collapse