The other solution is that devs operate the apps themselves. This is infeasible with a traditional VM setup because managing VMs effectively involves tons of specialist knowledge and it’s unreasonable to expect dev teams to master it while also being expert developers.
Enter Kubernetes. Now you have a core DevOps/SRE team managing the “platform” (the Kubernetes cluster and various add-ons such as operators) which gives the application developers a high-level interface for operating their applications. They need to know a bit of Kubernetes, but it’s a whole lot less than mastering the traditional VM-based ops skill set. Moreover, as the Kubernetes ecosystem continues to mature, the surface area with which developers interact becomes smaller.
I've personally changed my opinion on this in the last ~2 years ... observing at work what it takes for people to stand up and manage a Kubernetes platform it really just feels like incredible waste that we have hundreds and thousands of SREs across our industry all building their own unique compute platforms when the public cloud vendors have already done that work.
The Serverless paradigm just seems fundamentally superior to me, but it also seems like it inherently requires vendors to be very opinionated in order to provide a good developer experience with it which AWS is not and aren't ... at least not yet anyway.
That's not to say everyone has to be an expert either - there's a place for experts to optimise setups etc too.
This is silly. Most devs have too many other things they know they don't know, to also add on something like kubernetes.
(OT) ^^ somehow related comic: https://ibb.co/JktgqSV
best,
Hm - maybe shouldn’t need to, but why wouldn’t you want to? Even if its not strictly your job/responsibility, its always helpful to know how things work when things go wrong.
Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.
but there are other failure modes -- if everyone on the team is great at writing efficient production code, but no one understands the business context or the problem domain or understands if the problem they're attempting to solve is even vaguely feasible from some kind of theoretical perspective (maybe someone with a decent statistics background could demonstrate the entire premise of the project is flawed, and needs to be rethought, using a blackboard and no computers at all), maybe they'll spend months or years building and deploying a lot of fast, beautiful, completely worthless machinery.
But I am curious where you have seen modern runtimes fail and where the code was the issue (not tweaks to the JVM settings); any concrete examples where well written, best practice code worked on the laptop but failed in k8s?
Not sure about OP, but the most times I have seen devs have issues with Kubernetes is in the tweaking of the knobs around deployments including security. Startup v/s readiness v/s liveness probes, rolling updates, auto-scaling, pod security policies and such are usually all-new to developers, and have a lot of different options. Most devs just want "give me the one that works, with good defaults", and need a higher level abstraction.
We develop on Windows, and deploy on multiple kinds of scenarios and OS stacks, I hardly have to care what lies underneath.
Same applies to .NET, although in smaller extent, given its Windows focus until quite recently.
I think containers are a pretty good attempt at abstracting away runtime environments, no? Same docker image works on your local docker setup, docker-compose, vanilla kubernetes, managed kubernetes, fancy PaaS like CloudRun, Fargate, Heroku etc
There are people with a math background who think data science is just an extension of statistics, so business, knowledge of scalable information storages, and productization is irrelevant.
There are both kind of posts here on HN. My take has been to hire math people with some cs msc, cs people with datascience msc, and business people that also know sales.
For me that has worked painlessly but your milage may vary. I haven’t seen that black swan CV capable in all three disciplines, but I have seen CVs that seem to think that they can tackle every problem because they have read all towardsds and kaggle tutorials. Marginalization? Kubeflow? POV?, 2 out of 3 are usually foreign concepts.
I've met quite a lot of Black Swans, and been employed alongside precisely zero.
I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.
They describe themselves as "a statistician who can program."
Generally speaking it's more common for them to come from the math side of the fence. From the IT side I'll say the math is a bit harder than the computer stuff.
It's genuinely, 100% work ethic.
That's super awesome for that data scientist, but the question for a business is can/should you structure yourself in such a way that you NEED employees with that cornercase level of joint expertise.
The answer is you really can't. Individuals have awesome strengths that they developed for reasons particular to them. Use those strengths when you can. But the business has to rely on a common denominator of a role or else it'll never fill it when their unicorn leaves to go backpacking in Europe.
For the example you mentioned, I will use a simplification I make to explain levels of expertise of challenging knowledge: 1.ABOUT: Know about something (heard it, know some examples) 2. KNOW: Know that something well (I now understand it and can leverage it towards an end to end a useful thing, also know its weaknesses) 3. HUMBLE: Realize I did not know many things about it but now know many ways of using it, can correct and extend other people's work, most of the time. 4. EXPERT: Know why it was structured that way. Contribute to the knowledge/tool itself.
So for that PhD an initial estimate would be a 3 or 4 scale on the math level, 1 or 2 on the kubernetes level (don't know him ofcourse I can be wrong without first discussing). If he works independently level 2 kubernetes is pretty great. If he needs to be part of a larger support team, a level 3 knowledge based on my (admittedly back of the napkin and ambiguous) categorization might prove to be less risky.
Was it shooting questions from the hip on the spot while interviewing them?
Or you hired 10 people and worked with them for at least 6 months to really know what they are capable of?
I think the former because no one has enough budget to hire people stick with them for 6 months just to see how they fare.
So what is your N to back up your claim?
Because it sounds like you really have something to say.
My N is a few hundred, not all my personal hires. I have visibility because now I do project management office duties (build sub teams per project), lead most of the interviews on the ds side, internal technical consulting duties. Ten you mentioned is my target number for hires previous and next week approx.
My claim is based on experience from the academic and the consulting space for global corp (which included consulting for other corps to build their ds teams, rarely though). I hope my claim appears logical and is useful.
To be honest the landscape is constantly changing and people should learn as much as they can.
I call ignorance on these kinds of posts.
Not the hottest tech out there but they have a long used-by date if that's what your major concern is.
If you believe k8s will just 'go away', you don't really have a good clue about what it tries to solve and instead, get confused in its complexity. Having been around the block, i can see it sticking around for at least 10 years.
But forcing anyone, especially data scientists into a specific and quite complex tool of the day? Pass.
Tools are built by people that use them. If your team chooses to deploy their applications on a k8s stack, it's on them to own that and not treat it like a black box.
I'm completely against the entitled belief that a person 'shouldnt need to know how to <x>'.
I can stretch the example in many ways: 1) if you're commit secrets into your source code and claiming a 'data scientist shouldnt need to know about secrets management' 2) if you're building a data analysis script and you leave it as an undocumented mess that's not got no unit tests and one day it breaks, you shouldn't claim that 'a data scientist shouldnt need to know about testing'
Oh cry cry there's a tech that everyone is using but i don't want to learn it / i dislike doing that particular thing / working with that piece of tech.
Build your own damn tech stack/computer if you think you can do it better. Or ask in the job interview if your team is running their data science platform on k8s if you dislike operating apps on it so much and deny the job.
I’m really enjoying it! You did a great job on it.
I especially liked the chapter on networking since it was always something I was weak in.
The business leaders and managers trying to load kubernetes work on data scientists are doing so because the managers don't know what they're doing, what they want or who they need to get it done. Instead, they have the one hire they got greenlit last year and if that person can't do EVERYTHING, your group is screwed.
Seems they want to replace one team with all responsibilities
I work on something called Pachyderm, which is a Kubernetes-based data storage and job execution system that tries to bridge this gap. We have a managed solution (https://hub.pachyderm.com) where we provision your Kubernetes cluster and do all the management (keeping the software up to date, authentication and authorization, etc.) and in fact don't even expose kubectl to you. You'll never see any of the Kubernetes stuff (though you might recognize certain error messages, I suppose). You just supply your code and a specification for how data flows around your pipelines, add your data, and we do the rest. Data scientists can interact with the versioned inputs and outputs through notebooks, but you're getting the full suite of production features behind the scenes -- a history of exactly which data inputs went into which data outputs, incremental processing, seamless autoscaling (set cpus: 8, gpus: 1 in your pipeline specification, and we find you a machine that meets that spec, add it to your cluster in less than a minute, schedule your work there, and remove the machine when the job finishes), etc.
Sorry for the sales pitch. I pretty much never use HN to shill my paid work, but it seems especially relevant to this sort of problem. Maybe you don't need the unicorn employee that is an expert in multiple fields -- focus on the data science and let us actually deal with the ugliness of computers ;)
(And if you do like Kubernetes but don't want to write your own orchestration system, Pachyderm itself is open source.)
Two teams causes an issue where scientists chuck models over the wall for the engineers to somehow rebuild into a semi-workable approach. The end result isn't great because you can't build good production models without taking production deployment into account. You also can't convert non-production models into production models without understanding the modeling assumptions that happened.
The general result is that the engineers and leadership finds the results underwhelming to horrible. The scientists often don't care because what happens on the other side of the wall isn't their problem.
That doesn't mean everyone has to know everything but separating people into teams is not the answer. Have a single team with people of different focuses and areas of expertise.
There may not be enough work for two teams 100% of the time, but there sure is when TSHTF. Manufacturers understood the need for some slack, but software companies still haven’t figured this out.
- Process: we analyzed what worked and what didn't in past projects. Continuously auditing and trying to extract learnings. We made sure people we built for at the client organization were involved. We scoped more thoroughly. We involved parts client organization that could torpedo the project downstream (legal, security, etc) upfront. Made fewer assumptions. Listened more.
- Tooling: we built a machine learning platform[0] to make sure a data scientist doesn't tap on anyone's shoulder to troubleshoot their system, set up their computing environment, or deploy their model. They could do it themselves. Furthermore, it wasn't necessary to get people who could move across the stack.
Changing our processes and the way we do consulting had a huge impact. A badly scoped project will in some way or another create toil downstream and create a situation where you need people to do full-stack and you need "all-hands-on-deck" constantly. That's just bad, and after we ruthlessly reworked the process, we had better results, better relations with clients, better cadence, etc. I emphasize on this because we were a larger team at some point running around working on so many projects simultaneously that everyone was practically burned out.
Although we technically added multi Kubernetes cluster support. It was only GKE, and now it runs notebooks and workloads on AWS EKS, Azure AKS, and DigitalOcean as well. I'm not sure it's enough of an improvement according to the Show HN rules to re-submit. Plus I'm reworking the landing page and docs to add more clarity on what this thing does, with gifs showing RTC and all.
Do you have any feedback?
Just this week I have been experimenting with SageMaker and SageMaker Studio. Too early for a real evaluation, but it looks like SageMaker Studio hits many requirements: good for experimenting, run large distributed jobs, good model and code versioning tools, easy to publish REST APIs, etc. Just yesterday someone asked me to review 3rd party tools, and I look forward to getting a better understanding of how SageMaker Studio stacks up against turn-key systems.
I have built my career from standing on the shoulders of giants. I am not shy about just using the results in academic papers, using open source libraries, tools and frameworks, etc. that other people have written.
So, I agree with you that so much can be done on a single beefy VPS, but services and frameworks that allow easy use of multiple servers are also important.
So a desire to ship features regularly and preserve agility and quality is the “trendy” that the GP is talking about.
And how much less downtime would you have if domain experts were doing that part?
There is some fantastic tooling for machine learning.
Databricks, GCP, everyone knows it.
The issue is that the data industry was raised from birth in complete fear of the boogeyman.
The boogeyman is Oracle. And the frankly ridiculous things Oracle did in the bad old days.
Hence most places have a constant internal conflict between "look here are all these brilliant data science tools" and "ah shit, GCP costs a ton of money when some idiot runs a select * query on a join across 5 TB of data."
But there are plenty of great tools.
There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.
One thing I have learned dealing with different data science teams is something else though. I have gone through every single pipelining tool(including pachyderm) and stream processing tool that was available at the time. The thing that people forget is that every single one of them has a thing that throws you off of what you actually want to accomplish or has some sort of caveats in your use case.
The important thing to note is that the job of the architect or whatever you want to call that person, is to provide an infrastructure where the data scientist can just run their code. And no matter which one of these environments you use you still need to build glue code for your use case. Even if that glue code is python library with a good distribution mechanism.
Airflow's UX is just needlessly easter-eggy and bad. The one thing I'd want out of the dashboard is the list of recent job runs and whether they succeeded or failed, so of course that's hidden in such a way that a novice has to click 10 different places to find it. There's also the fact that they chose to call a timestamp "execution time" when it often doesn't correspond to the time the job is executed. Want to add parameters to your task? You better like hand-writing JSON or pasting it into a textbox because apparently that's a weird thing to do, so why bother adding any UI support for it.
Maybe they should stick to spreadsheets, or upskill a bit so they don't consume so much of the engineers time.
So, if their time is better spent applying that knowledge rather than thinking about infrastructure trivialities, then by all means, pay an engineer to clean up. In the end, that's still more cost-efficient.
That being said, I refuse to believe that anyone leaving university today with a degree in stats/ML/econometrics etc. doesn't know git and can not be taught good programming that doesn't at least interfere with operations.
But as soon as you start requiring your experts to do infrastructure, you are either wasting money, or you hired a quotation mark "data scientist" with a degree from medium.com and towardsdatascience.org or whatever - in which case by all means, require them to do engineering duties.
A few organizations are further along on that journey enabling their data scientists to focus on things other than process and tooling. Full-stack will be in demand until the solution space stabilizes and the bulk of organizations catch up.
Later on you can then build more specific teams or even more cross functional ones.
Of course, if you only want feel the waters and check if DS use cases are viable at all, consider getting a (few) freelancers and but a somewhat technically inclined person in charge. If that's a success use it to get funding for a proper team.
But I want to point out a few things that are wrong in the artcile to help other evaluate airflow.
> Second, Airflow’s DAGs are not parameterized, which means you can’t pass parameters into your workflows. So if you want to run the same model with different learning rates, you’ll have to create different workflows.
You can pass the parameter to workflows by giving it a JSON config. When trigger on the UI, you can paste the JSON with the right argument/parameters into your DAGs. So you can train model with different arguments etc
> Third, Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed.
You can absolutely create new steps at run time. The point of airflow is everything is just Python code that is evaluate to generate DAGs, as long as you generate the DAGs and write the operator. It will happily run and log. It may have trouble rendered on the UIs and cause some weird issue (tasks won't advanced after certain steps regardless when I last work on them but they are bugs).
You can write an operator, the operator in turn can initiate any other known operators, and point the next steps to those operators. Here is an example: https://stackoverflow.com/questions/41517798/proper-way-to-c...
I cover this issue in some detail in a blog post from a few years back: https://rillabs.org/posts/workflows-dataflow-not-task-deps
Netflix is an AWS shop, so naturally we started with AWS integrations.
However, knowing a bit about cooking might one a better waiter.
If you are indeed talking from a data scientist POV - then the right abstractions here are Dask and Ray Distributed.
Both can run on Kubernetes as the underlying orchestration layer - but are a pythonic interface to distributed data science primitives.
I'm not saying this is the majority of data scientists jobs, but in some organisations I worked for the data analyst was a guy that run `SELECT MIN(v), MAX(v) AVG(v) from TableX` against a MySql DB, so they were also in charge of DB administration and data ingestion, otherwise it would not have been a full time job.
From the article: "involve two full sets of tools: one for the dev environment, and another for the prod environment"
This is what we think should change. We intend to bring dev and prod into a single cohesive environment. Initially it will be difficult to cover all types of production workloads (like the post mentioned, production is a spectrum). But what we've observed is that through container encapsulation we can create well defined production workloads that we can run on any container orchestrator while shielding the data scientists from that complexity during pipeline development _and_ deployment.
With a container first approach to DAGs it becomes trivial not just to mix library versions but even languages (e.g. feature extraction in Scala and model fitting in Python). In practice, this flexibility has resulted in a significant productivity increase because existing code "just works". No "one virtual environment to rule them all" necessary.
I like how the article does justice to the fact that there's a subtle yet important difference between mere workflow orchestrators and workflow orchestrators that take on meaningful responsibility when it comes to infrastructure. To really unburden the data scientist from having to be a full-stack unicorn you need to hide the underlying stack to the point where it's invisible. In that sense, the OS kernel analogy really works. Similarly, how many data analysts writing SQL have ever worried about database node sharding?
A big problem we see in the space is that there are still way too many leaky abstractions and data scientists end up dealing with architecture & config yet again, for many a task out of their depth. We hope to contribute to a better ecosystem, one where data scientists spend their time looking at the data, relating it to the domain, shipping value generating data pipelines/models, and communicating about results with their stakeholders. Not fighting config & infra.
As far as OP, how do you learn Docker without Kubernetes these days? To me this is like saying you don’t need to learn Windows because all you do is run the solver in Excel.
however, with the raise of devops culture, everyone should know the stack so they can use the platform effectively. everyone needs to up skill.
That's production.
Data scientist should know about kubernetes as much as they should know how to program.