Building a data team at a mid-stage startup (opens in new tab)

(erikbern.com)

607 pointssquarecog4y ago89 comments

89 comments

74 comments · 28 top-level

IMTDb4y ago· 17 in thread

What would be the name of the position/profile of someone in charge of building the data warehousing architecture/ETL pipelines?

I my view, they need make sure the warehouse model is a correct representation of the business and that it can be leveraged to answer basic or not-so-basic questions using SQL. They also need to promote it's usage internally by ensuring it is accessible and easy to use and guide other team to a more data oriented mindset.

I feel that this is a specialised position not exactly similar to a developer, but every time I look for "data scientist" I get guys that want to do machine learning prediction models, which is not exactly the same stuff either.

Orou4y ago

I would also vote for "data engineer" (it's my current job title).

You very likely don't want a data scientist to be doing a data engineer's job (and they probably don't want to be doing it themselves!). While there are similarities, data engineering tends to be a lot closer to software development than data science. If you're advertising for a data scientist role, don't expect them to be happy if 80% of their job is writing ETL scripts and cleaning datasets.

I think the reason there has been a flattening in data scientist job growth more recently is that lots of companies hired data scientists to build cool ML applications but had no infrastructure in place to support advanced data analysis. These companies didn't realize they needed to walk before they could run, and that what they really wanted was data analysts and engineers to build the foundation for a strong data science function.

Tools like dbt have been great for advancing an ELT approach to managing data pipelines, where modeling for BI tools, business users, and data scientists alike can all happen in the warehouse and ensure consistency in data usage across the company.

ramraj074y ago

The one issue is that the gamut of experience and ability in a data engineer (and the salaries) is extremely wide, far wider than I’ve seen for any other role. Hiring a good DE is so hard!

dijksterhuis4y ago

Seconded.

I was a bit sad to not see any mention of a data engineer anywhere in the article.

Like, if you gave me access to all the prod tables and the warehouse I'd be having a whale of a time and (hopefully) delivering enough business value to automate some of the more regular "English to SQL" translations.

> You very likely don't want a data scientist to be doing a data engineer's job.

100%. This is one of those things that would make "disgruntled ML people" in the article want to leave.

rickeydidio4y ago

This is spot on. As someone who has been looking for a data analyst role, I’ve actually read quite a few DS reqs that were geared more towards infrastructure and ETL. Then the flip side with the DE reqs wanting NumPy and Pandas along with the infrastructure and ETL. Weird, right?

sails4y ago

IMO data engineer roles are further subset into:

1. kafka / streaming oriented software engineering

2. data warehouse and ETL/ELT development for analytics

1 more reply

teej4y ago

A new role has arisen in the last few years that captures much of this responsibility - Analytics Engineer.

This article by Claire Carroll describes the role and motivation for it https://www.getdbt.com/what-is-analytics-engineering/

hobs4y ago

I currently do that job as a Data Architect - kind of a mouthful lol but it covers the gamut of understanding the entire business as an abstract set of data flows, being responsible for the ingest and outflows of data, the level of quality in our overarching system, managing data engineers, developers, business folks all accessing said data, at the end of the day explaining what it all means to our clients and devs via standard modeling stuff and more targeted things as needed.

edmundsauto4y ago

You mention that you manage data engineers. Where does your role not overlap w/ a data eng?

1 more reply

sischoel4y ago

What about "data engineer"? There seem to be a lot of jobs for that title nowadays.

skrtskrt4y ago

Yeah we would call this Data Engineer (likely Senior level or up for someone that has had experience building multiple data warehouses) plus the DevOps/SRE work required to stitch all the architecture together

marcinzm4y ago

You're mixing up two different tasks as I see it:

* Building/defining the data infrastructure

* Building/defining the schemas

In a traditional ETL infrastructure they are jumbled together but if you do ELT they are not. A data engineer can build the infrastructure but the transformations can be handled better by technical analysts. They're simply one view on the underlying data so the risk is minimal. Analysts query the data day in and day out so they know much better what they need than someone who doesn't.

sjg0074y ago

The bigger issue is adaptability.. can you migrate schemas preserving older clients, typically that’s by providing a decent middleware…. SQL views are one way, APIs are another etc…

All of that while improving performance.

mjirv4y ago

Analytics Engineer is a clear one for this, as teej said.

The title is strongly associated with the dbt community, so it could imply you’re using dbt for your data modeling (not necessarily a bad thing, as it sounds like it would be a good tool for your use case).

pram4y ago

I’ve done this for the past 6 years and my title was “Big Data Infrastructure Engineer” but I don’t think there’s any consistency at companies from what I’ve seen

edmundsauto4y ago

This is what data engineers do, although that is also used to describe data ops (maintaining clusters, running kafka, etc.)

herodoturtle4y ago

You pretty much described my job in a nutshell, and they call me "the database guy".

tmp_anon_224y ago

Most common would be a DevOps or SRE on an observability team.

zippy54y ago· 6 in thread

This was wonderfully written and if your gonna start a data team, this is how you do it. But I can see that I’m the only one who thought it was crazy to start a data team in the first place.

This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?

A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.

I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.

I’ve heard people talk about data as the new oil but for most companies it’s a lot closer uranium. Hard to find people who can to handle / process it correctly, nontrivial security/liabilities if PII is involved, expensive to store and a generally underwhelming return on effort relative to the anticipated utility.

My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built. Obviously there are a couple of exceptions such regulatory reasons like hippa compliance for which building in-house can be the right choice if no vendor fits your use case.

lifeisstillgood4y ago

As someone who reaches for code if they need to blow their nose, what is a 3rd party vendor going to supply that a “English-to-SQL translators” wont do?

(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)

Edit: Also love the Uranium quote :-)

zippy54y ago

So my assumption is that for a given business model, like e-commerce or Saas business much of the highest value analysis is fairly standardized and can be templated. For example breaking down conversion rate by weekly cohort is something that can be pretty easily be done in google analytics.

The problem with English to sql translators or most coders in general are the assumptions we make, in particular about the underlying data. For example, say we want a join two tables, so we write a query to join on two columns and often call it correct which it is from a logical or schema perspective it is. However, null values, defaults like 0, many to one relationships vs one to one relationships, issues with instrumentation such as networking timeouts or bot detection, etc all can impact the down stream metrics. My point is that when there are 500 lines of sql in a query such as those mentioned the article, there’s a lot of ways to be mostly correct but to cumulatively be wrong.

Like many popular enough open source tools, 3rd party vendors get battle tested, issues get found before you, and they can justify devoting more resources to rigorously ensure correctness than the average analyst has the time or energy todo because their business depend on you trusting the outputs.

I’m not saying you couldn’t do all this yourself. But given the sheer number of analytics tools that are reasonably priced, you might have chosen to spend your time on something more specialized like a recommendation system.

1 more reply

fouc4y ago

> spends 3M on the team and infrastructure

You're making a pretty big assumption on cost of team & infrastructure there. This company could have 100+ people with that kind of revenue (I've worked at a company this size before). The data team is only about 6 people. The cost of the data team & infrastructure is likely less than $1M

roenxi4y ago

Having unique data is quite valuable. If your organisation can make decisions based on signals that other people can't detect then it can gain a decisive edge.

I do wonder at the anecdotes in this article though. In businesses that I've seen, the data team is usually the biggest impediment to a data-driven culture because they have databases full of numbers and no real grasp of how that links to the decision making process that makes the business money.

Beefing up the team doesn't help. In data, as in business more generally, the important think is not trying to guess what job your doing and spend a lot of time talking to customers about what job they need done. If the data team is where that work happens in a business then that can be helpful - but the grunt work of SQL/reporting/basic analysis is almost never where the value appears from.

hitekker4y ago

> My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built.

I really like your takeaway about data teams at tech companies. They try to make "data" a core competency of their business, at huge cost for fixed value.

I also appreciated the very subtle implication that the OP is shrouding empire building under an otherwise informative growth story.

chupchap4y ago

> it’s a lot closer uranium

Love this analogy!

czep4y ago· 6 in thread

This is so eerily familiar I swear I've had many of these exact conversations word for word. The only way this doesn't turn into a complete nightmare of a cluster is if the exec team "gets it". If so, you just might stand a chance at building a data team that gels with the rest of the org.

But if the exec team simply hired you for window-dressing, expect to be treated like a scapegoat and a punching bag. Any mistakes will be your fault. Any wins will be to the credit of the business. The Director of Product will ask to "embed" dedicated DS headcount and you won't have any real power to shape the roadmap. If the exec team doesn't give you equal footingf with Product (or Marketing, Finance, and Eng for that matter) then this will rapidly become a soul-sucking job. However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads, then you actually might be able to accomplish something really cool.

PragmaticPulp4y ago

This applies to most specialties. Companies tend to have a few teams that lead the charge and expect everyone else to follow. Knowing which teams get the authority and which teams are along for the ride at a company is important for knowing what your job experience will look like.

> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads

I know this was meant partially in jest, but if you reach the point where you're at odds with all of the teams and departments in the company you may get a lot done in the short term, but long term it's going to be difficult if you don't have some allies in each of those departments. Obviously no one should roll over and take orders from other departments, but some times it's necessary to do some give and take to build rapport. It's a balance, not a war.

czep4y ago

Thanks for the tips! One mantra I've tried when starting at a new job is "for the first 3 months say yes to everything, for the next 3 months say no to everything." The idea is you first immerse yourself in everything, to find out what works and what doesn't. Then you dedicate time to fix the broken processes so that hopefully when you hit 6 months your team is better positioned to be more efficient. Obviously you can't be too rigid, but it seemed to work for me when I had buy in. Curious if you think that approach sounds good.

1 more reply

nwsm4y ago

This was my only complaint about this great article. The CEO was innately "data-driven" which opened a lot of doors.

OTOH, if the execs don't have this priority, no one gets hired to lead and scale a data team and the story never starts.

marcinzm4y ago

In my experience much of this is a question of trust, political capital and soft power. Find out the problems that the key players in the business are actually having that you can solve and then solve them. Find out what the key KPIs are for the business and make a plan to improve them and then have a plan to publicize that improvement. And make sure to hire a team that covers your weaknesses rather than exposes them. Don't fight people if you can help it, either they're as competent as you on average or you shouldn't have taken the job. Figure out how to help them and what they need to work more efficiently and then give it to them. Sure there's a ton of politics involved in all of that but that's management in general.

WastingMyTime894y ago

> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads, then you actually might be able to accomplish something really cool.

So what's the business case for having a data team independent of product, business and engineering?

Because as I see it the data team is a support function not q core part of the business. I'm sure it can be cool for you but if you are at odd with all the people actually creating value, what exactly do you bring to the table?

higeorge134y ago

Engineering is building some schema, creates and uses multiple data stores , message queues, etc, eventually the queries do not longer work properly as the company scales and gets more and larger customers and hundreds of other issues. Doesn’t engineering need a proper data engineering team/dba/you name it to handle those?

GlennS4y ago· 6 in thread

I liked this article, but I have two questions:

1. Is it definitely a good idea to build a separate data team, rather than embedding people with analytics knowledge in feature teams?

Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?

2. Is A/B testing and driving your business by metrics really a good idea?

My (uninformed) impression is that data-driven is responsible for rather a lot of rot:

- Extremely irritating websites.

- Businesses ignoring important things because they can't measure them. (Financialisation, hand-in-hand with the MBA types the author decries.)

dijksterhuis4y ago

> Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?

It's important to get the core centralised data infrastructure up and running (even if it's dirty af) as that helps with the bulk of the data work.

The oft quoted not completely true but kinda true statistic is that 70% of data work is finding, cleaning and storing the data. Analysis and modelling is the easy bit.

You could do it the other way around. Hire some data people in each team and get them to meet up every once in a while.

But I'd wager the central data stuff that makes everyone's life easier will get pushed back behind the "urgent" team work every time.

#ConwaysLaw

Edit: it's possible to do both btw. E.g. Have a bunch of centralised data engineers that do the heavy lifting stuff. With data scientist/analysts embedded in teams doing the fine grained modelling stuff. It's not a binary choice (once things are up and running).

> My (uninformed) impression is that data-driven is responsible for rather a lot of rot.

I agree! I was talking to someone else (not a tech head) the other week and realised why they hate tech so much... User interfaces that just... Don't work.

Showed him a terminal cli and he went nuts over it.

Then again, we're two kinda weird ye olde "back in my day" kinda people... So...

dgb234y ago

Interesting. I'm a bit of a hybrid, CLI/GUI user. There are things that I find easier to to in a CLI (or with text in general) and things were a GUI is more natural.

CLIs are finicky and force you to think in terms of text, whether it is appropriate or not. GUIs can be more expressive and haptic, but are typically very idiosyncratic and can get in the way of things.

The data-driven approach to UI seems a bit crazy?

If I think about the problems of any UI, I think in terms of communication, intent, learning, psychology and aesthetics. All of those things are human to human or human to computer related issues.

I think data-driven (as in statistical data derived from user behavior) approaches are or can be useful in terms of "what" to present, prioritize and so on. But much less so on "how", because I think this should be based on experiences derived from direct interaction and needs to be induced by creativity.

And I mean creativity from both sides, the implementer and the user. One thing that CLIs generally do better is to provide composable tools within a adaptive and simple system (pipes, text etc.), whereas it is hard to impossible to let GUIs talk to eachother and compose them to a user tailored whole.

I think we should empower "non-technical" users with the freedoms and sound principles we have come to enjoy ourselves, instead of letting statistical data dominate their experience.

wheelinsupial4y ago

Is driving your business by the highest paid person’s opinion any different than driving it by A/B testing? I see those as two extreme end positions.

A/B testing can help you with optimizing existing processes for incremental improvement, but big bets, which can sometimes have data and sometimes don’t, help with step change improvements.

Even with big bets you need a way to show that it’s better than the previous way. Either by coming up with ways to cheaply test the hypothesis or committing to being “agile” (I hate that term) and continuing to iterate.

What is statistical significance anyways? If the p-value is 0.06 is that good enough? Practical significance is something that also needs to be accounted for.

If something can’t be measured, is there a way to find some proxy metric for it?

If not, then you can try to negotiate a pilot study of the problem and have specific criteria to determine success.

Just because something can’t be measured with existing processes doesn’t mean it can’t be measured at all.

For example, there were complaints about systems crashing and having intermittent behavior, and the claim was that’s affecting sales. Technology said nothing in our logs shows any issues, our service center shows no reporting of issues, so we think they are overreacting. We put a team together and went to several different locations to observe the process and get feedback. From the feedback we put together a data collection sheet and went back for a week to collect more data. That finally convinced the Tech team that it was a problem they needed to investigate. They went to the stores, determined it’s true, and amended logging to capture what’s truly going on.

alzaeem4y ago

I share the frustration with how many A/B testing driven development processes end up. Leads to a very iterative process with lots of small changes, rather than big bets. Also, trying to get statistical significance from iterative changes when you don’t have a ton of data is problematic.

iamacyborg4y ago

I think that’s just down to a lot of folks who think ab testing is the answer to every problem not necessarily having a background in maths or stats. I see it all the time in marketing teams where people’s are so conditioned to think of testing as the default that they don’t understand what they’re doing or why.

tiagogm4y ago

In my experience AB testing has a time and place - and that is after a certain level of traffic load and product/feature maturity and only to "validate" certain hypotheses.

For low volumes of traffic AB testing would takes ages to wield significant results and for products still maturing and shaping there is lot of "wisdom of crowds" data already available to help make decisions faster (ie: do you really need an AB test to know offering timely promotion to users helps convert?)

If you got a young product trying to grow, fast, it's a lot more effective to rely on experienced product people and off-the-shelf simple analytics to iterate quickly and to take some bets so one day you get to a point where AB testing "optimisations" starts to make sense.

It's a quite an interesting topic! I agree with you too - A/B test driven sites tends to culminate in terrible "cumulative experience" for users

ttz4y ago· 4 in thread

> MBA types

I chuckled. Then cried, because at least his MBA types can use SQL. My MBA types use Excel.

OT: Good article. Like and agree with the push for centralizing data first, then building outwards so external teams can move towards self-service.

munk-a4y ago

Building a good process into your company to receive a query, execute it against a read-only database, and shovel the results back to the user as a CSV file will pay dividends and is, honestly, pretty trivial in most cases.

jaggederest4y ago

Blazer is my go-to for this kind of thing:

https://github.com/ankane/blazer

Pretty easy to set up and share queries, dashboards, whatever

ttz4y ago

Funnily enough, this is what I did, except I built an app where I write the queries as "pre-built" parameterized ones (sanitized, of course).

People still do a bunch of stuff in Excel, though, and every once in a while, it breaks, and I have to dig through the mess. Excel is great when it's just for yourself and you can manage it... it's a pain when others have to figure out someone else's.

herodoturtle4y ago

I'm an MBA type that studied math and computer science, and for a living programs distributed database solutions.

I chuckled too.

plank_time4y ago· 2 in thread

This is probably the singly best written and most realistic article I’ve read on HN ever and I’ve been on HN for a long long time. It’s so realistic I wonder if the author took it from his diary or something. Everything about it is supersaturated with authenticity and teaches better than any other article I’ve read. Kudos to the author, and I would love to see this style of article take off.

maileslin4y ago

Erik is a legend in the modern data world. Wrote Luigi and built Spotify's first recommendation engine. He has the ground-level experience to lean on

alexpetralia4y ago

His post on Berkson's Paradox is excellent!

jabagonuts4y ago· 2 in thread

Really enjoyed this narrative, but what about the next phase? Going from mid-stage to mature startup?

> Note that you took on a lot of “tech debt” earlier when you started dumping the production database tables straight into the data warehouse.

How do you manage expectations when the year-long honeymoon is over, the business grows tremendously, and the centralized data warehouse reaches a breaking point?

neighbour4y ago

Also thought this. Let's hope the author has a SQL in the works as I am keen to hear more.

xpe4y ago

I won't be able to REST until I hear the endpoint of this epic saga.

Artgor4y ago· 1 in thread

When I had started reading this article, I had thought that it would be a sad story about another startup failure. The blogpost turned out to be a fascinating story of the success. I really liked it.

But after I had finished reading it, I have realized that it is a sad story, if we look from the eyes of data scientists in the team. People were hired to do cool machine learning projects, but it turned out there is no infrastructure for them. After the new boss had arrived, they had to work as analysts for months. What is more sad - the new boss dangled a carrot before them several times, but each time the carrot disappeared.

machinelearning4y ago

Very interesting perspective. As a early-mid stage startup, you definitely want to invest in generalists who are able to build infrastructure before hiring specialized ICs.

I honestly had flashbacks when the author mentioned the carrot dangling thing. I’ve personally experienced this and as a naive early career swe, I gave the manager the benefit of doubt for a year even though I knew there was no way they could guarantee it. This is just pure manipulation.

The worst part is that he wrote the job description himself and resorted to manipulation to cover up his mistake of hiring for the wrong job role.

civilized4y ago· 1 in thread

Wow, a story where things start out a mess and end up a lot better! Can we write one of these for society too?

xpe4y ago

There are bright spots. You might enjoy this book: Radical Equations: Civil Rights from Mississippi to the Algebra Project by Bob Moses [1]

See also: The Algebra Project https://algebra.org/wp/

[1]: https://en.wikipedia.org/wiki/Bob_Moses_(activist)

waynesonfire4y ago· 1 in thread

TLDR, refine your thoughts.

oliv__4y ago

Refine your mind

plaidfuji4y ago

So many gems in this article…

> You notice a a lot of the code starts with very complicated preprocessing steps, where data has to be fetched from many different systems. There appears to be several scripts that have to be run manually in the right order to run some of these things.

> “We need to focus on delivering business value as quickly as possible”, you say, but you add that “we might get back to the machine learning stuff soon… let's see”.

So so relatable. But the key insight is a really really key insight.

> What I think makes most sense to push for is a centralization the reporting structure, but keeping the work management decentralized. Why? Primarily because it creates a much tighter feedback loop between data and decisions. If every question has to go through a central bottleneck, transaction costs will be high. On the other hand, you don't want to decentralize the management. Strong data people want to report into a manager who understands data, not into a business person.

I have the same role at a non-software company, and to me this is nothing short of a complete reimagining of IT. It’s not just, “make sure everyone’s computer works and help them install software,” it’s, “build a model of the business, determine what information flows and metrics are crucial to success, and build an IT and analysis infrastructure around that model.” The CIO will soon be better thought of as the Chief Optimization Officer.

gumby4y ago

Great article. The confusion about what team does what is priceless...yet so common!

To provide some sympathy for the folks already working there: you always replace systems well after you've overrun them.

When the ad hoc system works (consider that google spreadsheet at a time when there were three support people and perhaps a dozen customers) you're not going to decide to replace it with something more complicated. Then you're busy growing so you just keep the system going through sheer force of will. You only replace it when the effort is unbearable; at that point you say, frustratedly, "I wish we'd done this sooner."

correlator4y ago

Thank you for writing this. I personally just walked into a very similar role and this rang really true. This article made me realize how much more effort I need to put into the data culture side of the role.

simonw4y ago

"This is basically a (somewhat cynical) depiction of things that may happen at a lot of companies early in the data maturity stage"

I don't think this is very cynical at all! Feels pretty accurate to me.

herodoturtle4y ago

For the last 15 years I've been building (what I consider to be) accessible database solutions, for a bunch of different industries.

This sentence from the article resonated with me:

> You're starting to lay the most basic foundation of what is most critically needed: all the important data, in the same place, easily queryable.

roystonvassey4y ago

This is a perfect encapsulation of my career as a data-guy square peg in a round hole, filled with jargon and misplaced understanding of data in general.

Despite all that you read and hear about data science advancing, you’ll be surprised to see how poorly leveraged, or worse, billions of dollars are sought to implement the latest tool that promises to change the world. Tech and data as we imagine it be in the FAANG kind of companies is far different than how it is in older industries. It’s not just systems that need upgrading, company cultures do and that’s never an easy or fast process. I’ve been in the data Analytics space for 16 years now and I still feel, more often than not, I’m part of the minority, working to demonstrate true data use-cases

cobertos4y ago

Part of me wonders what the long term of a transition like this looks like. Would this company be able to keep its data consumption healthy, or would it drive product changes that might harm it's users or lead to dark patterns?

te_chris4y ago

This is a good write-up, but for the sort of insights they’re getting they’re over staffed and overpaying. A combination of a cloud dw (big query, e.g), cloud etl (stitch, fivetran) and dbt for the T in ELT to build useful reporting tables, along with some sort of sql based BI (mode, in our case), could deliver the same insights for a fraction of the price. Throw in a sub to Heap or similar for ad-hoc product analytics as a cherry on top.

I concede, of course, that they’re rescuing a bad situation, not starting from scratch, but still.

neighbour4y ago

Excellent article. For me, the timing couldn't be better as I am about to step into a role not too dissimilar to the one described in the piece. It will be interesting to see if I run into many of the situations the author describes.

AtNightWeCode4y ago

I really enjoyed reading this. Very well written. At companies I worked teams can never read data from the DW btw.

My experience with A/B tests is that they are way overrated.

On the poor data quality. You sit on a product like a call center. Frontend developers thinks it is an excellent idea to store all data in some doc db blob. Then business wants stats about number of calls based on users...

Be careful when putting tabular data into doc dbs.

babublacksheep4y ago

Extremely relatable content throughout. Especially around teams beating their own drums while CEO questions around metrics. ;)

Will wait for a follow up post on how decentralised data team created data silos and how we solve it using data discovery and data standardisation. :P

Disclaimer: I have built decentralised data teams and it scales well.

tsrez4y ago

It's such an interesting and valuable article on building a data team, esp. insightful for organisation starting out. Guess the challenges in traditional/larger companies starting out a data team might look slightly different.

soumyadeb4y ago

Such a great read. Have been in this position in a large public org. Over a year was spent just creating a catalog of what all data the company has and figuring out how to pull them into a data-warehouse

spicyramen4y ago

Can correlate, author is a truly a genius. We had a company mandate to be ML first, we went through a lot of phases and so many conversations happened as described in this amazing piece. Thanks Erik

mindvirus4y ago

This is a wonderful article, thank you for sharing. I really like the narrative of bringing people with you on the journey, and celebrating the small wins that lead to a good long term outcome.

oliv__4y ago

No snark implied but what a great ad for the author!

This was very fun to read, and an interesting window into the processes and inner workings of a startup that size.

div3rs34y ago

Done well (like here), The Goal like storytelling, is both educational and interesting.

nerdponx4y ago

This is an incredibly valuable writeup. Great job.

j / k navigate · click thread line to collapse

89 comments

74 comments · 28 top-level

IMTDb4y ago· 17 in thread

What would be the name of the position/profile of someone in charge of building the data warehousing architecture/ETL pipelines?

Orou4y ago

I would also vote for "data engineer" (it's my current job title).

ramraj074y ago

The one issue is that the gamut of experience and ability in a data engineer (and the salaries) is extremely wide, far wider than I’ve seen for any other role. Hiring a good DE is so hard!

dijksterhuis4y ago

Seconded.

I was a bit sad to not see any mention of a data engineer anywhere in the article.

> You very likely don't want a data scientist to be doing a data engineer's job.

100%. This is one of those things that would make "disgruntled ML people" in the article want to leave.

rickeydidio4y ago

sails4y ago

IMO data engineer roles are further subset into:

1. kafka / streaming oriented software engineering

2. data warehouse and ETL/ELT development for analytics

1 more reply

teej4y ago

A new role has arisen in the last few years that captures much of this responsibility - Analytics Engineer.

This article by Claire Carroll describes the role and motivation for it https://www.getdbt.com/what-is-analytics-engineering/

hobs4y ago

edmundsauto4y ago

You mention that you manage data engineers. Where does your role not overlap w/ a data eng?

1 more reply

sischoel4y ago

What about "data engineer"? There seem to be a lot of jobs for that title nowadays.

skrtskrt4y ago

marcinzm4y ago

You're mixing up two different tasks as I see it:

* Building/defining the data infrastructure

* Building/defining the schemas

sjg0074y ago

The bigger issue is adaptability.. can you migrate schemas preserving older clients, typically that’s by providing a decent middleware…. SQL views are one way, APIs are another etc…

All of that while improving performance.

mjirv4y ago

Analytics Engineer is a clear one for this, as teej said.

pram4y ago

I’ve done this for the past 6 years and my title was “Big Data Infrastructure Engineer” but I don’t think there’s any consistency at companies from what I’ve seen

edmundsauto4y ago

This is what data engineers do, although that is also used to describe data ops (maintaining clusters, running kafka, etc.)

herodoturtle4y ago

You pretty much described my job in a nutshell, and they call me "the database guy".

tmp_anon_224y ago

Most common would be a DevOps or SRE on an observability team.

zippy54y ago· 6 in thread

This was wonderfully written and if your gonna start a data team, this is how you do it. But I can see that I’m the only one who thought it was crazy to start a data team in the first place.

This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?

A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.

I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.

lifeisstillgood4y ago

As someone who reaches for code if they need to blow their nose, what is a 3rd party vendor going to supply that a “English-to-SQL translators” wont do?

(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)

Edit: Also love the Uranium quote :-)

zippy54y ago

1 more reply

fouc4y ago

> spends 3M on the team and infrastructure

roenxi4y ago

Having unique data is quite valuable. If your organisation can make decisions based on signals that other people can't detect then it can gain a decisive edge.

hitekker4y ago

I really like your takeaway about data teams at tech companies. They try to make "data" a core competency of their business, at huge cost for fixed value.

I also appreciated the very subtle implication that the OP is shrouding empire building under an otherwise informative growth story.

chupchap4y ago

> it’s a lot closer uranium

Love this analogy!

czep4y ago· 6 in thread

PragmaticPulp4y ago

> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads

czep4y ago

1 more reply

nwsm4y ago

This was my only complaint about this great article. The CEO was innately "data-driven" which opened a lot of doors.

OTOH, if the execs don't have this priority, no one gets hired to lead and scale a data team and the story never starts.

marcinzm4y ago

WastingMyTime894y ago

So what's the business case for having a data team independent of product, business and engineering?

higeorge134y ago

GlennS4y ago· 6 in thread

I liked this article, but I have two questions:

1. Is it definitely a good idea to build a separate data team, rather than embedding people with analytics knowledge in feature teams?

Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?

2. Is A/B testing and driving your business by metrics really a good idea?

My (uninformed) impression is that data-driven is responsible for rather a lot of rot:

- Extremely irritating websites.

- Businesses ignoring important things because they can't measure them. (Financialisation, hand-in-hand with the MBA types the author decries.)

dijksterhuis4y ago

> Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?

It's important to get the core centralised data infrastructure up and running (even if it's dirty af) as that helps with the bulk of the data work.

The oft quoted not completely true but kinda true statistic is that 70% of data work is finding, cleaning and storing the data. Analysis and modelling is the easy bit.

You could do it the other way around. Hire some data people in each team and get them to meet up every once in a while.

But I'd wager the central data stuff that makes everyone's life easier will get pushed back behind the "urgent" team work every time.

#ConwaysLaw

> My (uninformed) impression is that data-driven is responsible for rather a lot of rot.

I agree! I was talking to someone else (not a tech head) the other week and realised why they hate tech so much... User interfaces that just... Don't work.

Showed him a terminal cli and he went nuts over it.

Then again, we're two kinda weird ye olde "back in my day" kinda people... So...

dgb234y ago

Interesting. I'm a bit of a hybrid, CLI/GUI user. There are things that I find easier to to in a CLI (or with text in general) and things were a GUI is more natural.

The data-driven approach to UI seems a bit crazy?

If I think about the problems of any UI, I think in terms of communication, intent, learning, psychology and aesthetics. All of those things are human to human or human to computer related issues.

I think we should empower "non-technical" users with the freedoms and sound principles we have come to enjoy ourselves, instead of letting statistical data dominate their experience.

wheelinsupial4y ago

Is driving your business by the highest paid person’s opinion any different than driving it by A/B testing? I see those as two extreme end positions.

A/B testing can help you with optimizing existing processes for incremental improvement, but big bets, which can sometimes have data and sometimes don’t, help with step change improvements.

What is statistical significance anyways? If the p-value is 0.06 is that good enough? Practical significance is something that also needs to be accounted for.

If something can’t be measured, is there a way to find some proxy metric for it?

If not, then you can try to negotiate a pilot study of the problem and have specific criteria to determine success.

Just because something can’t be measured with existing processes doesn’t mean it can’t be measured at all.

alzaeem4y ago

iamacyborg4y ago

tiagogm4y ago

In my experience AB testing has a time and place - and that is after a certain level of traffic load and product/feature maturity and only to "validate" certain hypotheses.

It's a quite an interesting topic! I agree with you too - A/B test driven sites tends to culminate in terrible "cumulative experience" for users

ttz4y ago· 4 in thread

> MBA types

I chuckled. Then cried, because at least his MBA types can use SQL. My MBA types use Excel.

OT: Good article. Like and agree with the push for centralizing data first, then building outwards so external teams can move towards self-service.

munk-a4y ago

jaggederest4y ago

Blazer is my go-to for this kind of thing:

https://github.com/ankane/blazer

Pretty easy to set up and share queries, dashboards, whatever

ttz4y ago

Funnily enough, this is what I did, except I built an app where I write the queries as "pre-built" parameterized ones (sanitized, of course).

herodoturtle4y ago

I'm an MBA type that studied math and computer science, and for a living programs distributed database solutions.

I chuckled too.

plank_time4y ago· 2 in thread

maileslin4y ago

Erik is a legend in the modern data world. Wrote Luigi and built Spotify's first recommendation engine. He has the ground-level experience to lean on

alexpetralia4y ago

His post on Berkson's Paradox is excellent!

jabagonuts4y ago· 2 in thread

Really enjoyed this narrative, but what about the next phase? Going from mid-stage to mature startup?

> Note that you took on a lot of “tech debt” earlier when you started dumping the production database tables straight into the data warehouse.

How do you manage expectations when the year-long honeymoon is over, the business grows tremendously, and the centralized data warehouse reaches a breaking point?

neighbour4y ago

Also thought this. Let's hope the author has a SQL in the works as I am keen to hear more.

xpe4y ago

I won't be able to REST until I hear the endpoint of this epic saga.

Artgor4y ago· 1 in thread

When I had started reading this article, I had thought that it would be a sad story about another startup failure. The blogpost turned out to be a fascinating story of the success. I really liked it.

machinelearning4y ago

Very interesting perspective. As a early-mid stage startup, you definitely want to invest in generalists who are able to build infrastructure before hiring specialized ICs.

The worst part is that he wrote the job description himself and resorted to manipulation to cover up his mistake of hiring for the wrong job role.

civilized4y ago· 1 in thread

Wow, a story where things start out a mess and end up a lot better! Can we write one of these for society too?

xpe4y ago

There are bright spots. You might enjoy this book: Radical Equations: Civil Rights from Mississippi to the Algebra Project by Bob Moses [1]

See also: The Algebra Project https://algebra.org/wp/

[1]: https://en.wikipedia.org/wiki/Bob_Moses_(activist)

waynesonfire4y ago· 1 in thread

TLDR, refine your thoughts.

oliv__4y ago

Refine your mind

plaidfuji4y ago

So many gems in this article…

> “We need to focus on delivering business value as quickly as possible”, you say, but you add that “we might get back to the machine learning stuff soon… let's see”.

So so relatable. But the key insight is a really really key insight.

gumby4y ago

Great article. The confusion about what team does what is priceless...yet so common!

To provide some sympathy for the folks already working there: you always replace systems well after you've overrun them.

correlator4y ago

simonw4y ago

"This is basically a (somewhat cynical) depiction of things that may happen at a lot of companies early in the data maturity stage"

I don't think this is very cynical at all! Feels pretty accurate to me.

herodoturtle4y ago

For the last 15 years I've been building (what I consider to be) accessible database solutions, for a bunch of different industries.

This sentence from the article resonated with me:

> You're starting to lay the most basic foundation of what is most critically needed: all the important data, in the same place, easily queryable.

roystonvassey4y ago

This is a perfect encapsulation of my career as a data-guy square peg in a round hole, filled with jargon and misplaced understanding of data in general.

cobertos4y ago

te_chris4y ago

I concede, of course, that they’re rescuing a bad situation, not starting from scratch, but still.

neighbour4y ago

AtNightWeCode4y ago

I really enjoyed reading this. Very well written. At companies I worked teams can never read data from the DW btw.

My experience with A/B tests is that they are way overrated.

Be careful when putting tabular data into doc dbs.

babublacksheep4y ago

Extremely relatable content throughout. Especially around teams beating their own drums while CEO questions around metrics. ;)

Will wait for a follow up post on how decentralised data team created data silos and how we solve it using data discovery and data standardisation. :P

Disclaimer: I have built decentralised data teams and it scales well.

tsrez4y ago

soumyadeb4y ago

spicyramen4y ago

Can correlate, author is a truly a genius. We had a company mandate to be ML first, we went through a lot of phases and so many conversations happened as described in this amazing piece. Thanks Erik

mindvirus4y ago

This is a wonderful article, thank you for sharing. I really like the narrative of bringing people with you on the journey, and celebrating the small wins that lead to a good long term outcome.

oliv__4y ago

No snark implied but what a great ad for the author!

This was very fun to read, and an interesting window into the processes and inner workings of a startup that size.

div3rs34y ago

Done well (like here), The Goal like storytelling, is both educational and interesting.

nerdponx4y ago

This is an incredibly valuable writeup. Great job.

j / k navigate · click thread line to collapse