How to Become a Data Engineer in 2021 (opens in new tab)

(khashtamov.com)

264 pointsadilkhash5y ago137 comments

137 comments

94 comments · 26 top-level

josephmosby5y ago· 12 in thread

(source for everything following: I recently hired entry-level data engineers)

The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.

If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.

If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.

I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.

dominotw5y ago

Agree 100% with this comment,

Old stack: Hadoop, spark, hive, hdfs.

New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift

disgruntledphd25y ago

Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.

5 more replies

sseppola5y ago

Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and airflow/dagster are quite similar, so why do you need one of each? fivetran/stitch/singer are all new

2 more replies

fetanchaud5y ago

Just a thought : what about dremio ?

llbeansandrice5y ago

> I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments

I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.

josephmosby5y ago

I should have specified more thoroughly.

I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)

However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.

1 more reply

teddyuk5y ago

I'm also the odd one out, so many enterprises moving to spark on databricks.

Mauricebranagh5y ago

It does rather depends what sort of data I bet a data engineer at CERN or JPL has quiet a different set of required skills to say Google or a company playing at data science because its the next big thing.

I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.

I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"

pjmlp5y ago

When I was there, almost 20 years ago, it was all about C++, Python and Fortran, with GUIs being done in Java Swing.

I bet it is still mostly the same, just using Web GUIs nowadays.

valenterry5y ago

We yet have to wait for the proper sweet spot: a language that allows SQL-like handling without the restrictions of SQL.

As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.

On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.

Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.

andkenneth5y ago

What's your opinion of LINQ in C#? It's been a while since I've used it but to me it seems like one of the most powerful ways to manipulate data inside of a language.

1 more reply

alexpetralia5y ago

Great points. It depends on where the business is at, the scale of their data, how processed their data is, and the timeliness/accuracy requirements of that data.

u678u5y ago· 10 in thread

Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.

musingsole5y ago

Most orgs working with small data I've seen will just fall back to the tutorial version of some big data tools (often times just eating the unused infrastructure cost for something like Hadoop when they're generating biweekly reports). Most project managers have a dream of their project scaling up and want to be prepared should a dream become a reality. And if you "under-engineer" (by which I mean specifically engineer for the problem the company is facing), you'll get called out by every armchair developer for not going with the "obvious, best solution."

I'm not bitter; you're bitter. /s

u678u5y ago

Yeah I'm bitter. I want to bring structure, tools and discipline to small scale data gathering, but the big tools are just too time-consuming to get up and keep running with just a few hours.

kthejoker25y ago

Depending on what your sources and sinks are:

* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL

* Trifacta's Wrangler has a free version with limits

* Talend's Open Studio is free, a little clunky but works fine

* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)

* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well

* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers

The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion

Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.

hermitcrab5y ago

Take a look at our https://www.easydatatransform.com tool. It is a drag and drop data munging tool for datasets up to a few million rows. It runs locally on Windows or Mac. You should be able to install it and start transforming your data within a few minutes. It doesn't have a built in scheduler (yet), but you can run it from the command line.

Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.

u678u5y ago

Thanks there are enough workstation tools, but I want an automated tool that runs on a server.

kfk5y ago

I have been working with small data few years. We built an internal library to move data in/out of systems and the schedule this into jobs. We mostly leverage S3 and Spectrum. The major complexities we found were in scheduling, proxies and fetching raw data from legacy applications.

imcoconut5y ago

No reason you cant still use airflow for the orchestration/depenendecy management/scheduling. And for the processing and storage - Pandas and sqlite.

Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)

throwaway72815y ago

Take look at luigi, which is a lightweight task orchestrator with minimalistic dependencies.

[1] https://github.com/spotify/luigi

ABeeSea5y ago

In AWS, lambda’s and step functions.

https://aws.amazon.com/step-functions/

u678u5y ago

Thanks, presumably this is similar to AWS Airflow too.

StreamBright5y ago· 7 in thread

2021? More like 2010. Hadoop is getting deprecated rapidly and more companies split their write and read workloads. Separated storage and compute is also popular. Scala is not used that much, I think it is not worth the time investment. More and more companies go for Kotlin instead of Java when these want to tap into the Java ecosystem.

adilkhashOP5y ago

Hadoop is still widely used in enterprises (especially in banks), if you have experience working with Hadoop ecosystems it is a big plus anyway.

StreamBright5y ago

Yes, that is the status quo.

There is also some trends:

https://trends.google.com/trends/explore?date=today%205-y&ge...

pjmlp5y ago

Kotlin's future is tied with Android, on the JVM it will be another Scala in 5 years time.

If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.

Lets see if it doesn't end like like Typesafe.

smattiso5y ago

Are you working in this field? I am looking for a consultant to setup a modern data processing pipeline for a data driven hardware product I am building.

StreamBright5y ago

Yes, I am moving companies to their next data pipeline, that is my specialty. I have added my email to my profile, you can reach out to me.

kentm5y ago

What does your typical stack look like?

StreamBright5y ago

Depends. Just some random mixture of stacks: PrestoDB, S3, Airflow, Luigi, Dremio, Athena, Hive LLAP, EMC Isilon, Kafka.

My favorite so far is S3 + PrestoDB with either ORC or Parquet files. It is a solid DWH solution for most enterprises on the cloud. (Cloud or not is a different discussion). It works for small scale (50TB) to really high scale (50PB). There are some (very few) gotchas and moving parts as opposed to Hadoop + co. You can combine it with Kafka for streaming data and you got yourself a pretty solid data solution.

dominotw5y ago· 7 in thread

I've been in this space last 6 yrs or so and my scala usuage has gone down to zero. Not worth learning scala.

pgoggijr5y ago

This is an anecdote - plenty of firms are using Scala in their data engineering stacks and it's a great tool for the job.

While maybe not strictly necessary per se, it's a great way to get a foot in the door, and provides a great way to foster advanced type systems and functional programming (I personally find it to be a really fun language to write in to boot).

dominotw5y ago

> it's a great tool for the job.

What job can this do that can't be done via sql. dealing with unstructured data?

st1x75y ago

> plenty of firms are using Scala in their data engineering stacks

Isn't that just a result of everyone being into Spark a few years ago?

1 more reply

sidlls5y ago

Scala, when it's not used because it's just what someone learned the ropes with, is the Haskell of data science and machine learning: it's what people use when they want to inflate their credentials and/or egos.

switch0075y ago

What languages are worth learning?

dominotw5y ago

SQL has taken over the space completely. 90% of data munging and transforms happen via SQL.

I would learn python. Its the number one language outside sql.

2 more replies

johanneskanybal5y ago

sql, python, terraform, maybe some basic java. Airflow is pretty common. Whatever the company is migrating away from. As long as you`re good at one of those and can pick up the rest on the fly you should be fine to start out.

edit: Guess this was pretty much in the post.

prions5y ago· 4 in thread

SQL proficiency is important but I wouldn't say it supersedes programming experience. To me, Data Engineering is a specialization of software engineering, and not something like an analyst who writes SQL all day.

As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.

I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.

cbdumas5y ago

SQL proficiency is something I've seen developers of all sorts neglect, which I think is a huge mistake. And relegating SQL to something that just an "analyst" does is an even bigger mistake.

Several times over my career I've been brought in on a project where the team was considering replacing their RDBMS entirely with a no-SQL data store (a huge undertaking!) because they were having "performance problems". In many cases the solution is as simple as adding an index or modifying a query to use an index, but the devs regard it as some kind of wizardry to read a query plan.

Grimm15y ago

I spent way too large a portion of my last position teaching developers about indexes, query plans and underlying join types and their impact on performance and memory consumption.

1 more reply

otter-in-a-suit5y ago

I wholeheartedly agree with the "specialization" comment.

Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.

Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).

Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.

garciasn5y ago

TL;DR: SQL is still a DE's best and most ubiquitous tool/language in the space--hands down.

I've led DE teams for the last decade. I have lived through shifts in toolsets, languages, etc. Regardless of platform, languages, model types, etc, etc, etc, etc, the one constant has been SQL with some sort of of scripting around it.

Right now, it seems Python is the big wrapper language, whether it's via dag or some other means but that's just the preferred method TODAY. Considering SQL has been around for decades and has outlasted just about every other language and system, many of which have opted for a SQL-like interface on top of their system, I would highly recommend DEs be very strong there.

darth_avocado5y ago· 4 in thread

We want all these skills, yet, we'll give you a separate title and pay you less than a software engineer. Meanwhile front end software engineers are still software engineers and get high pay.

black_mage5y ago

I just got the 2020 stats from one of the biggest tech recruiters in my country. On every level DEs outperform SWEs on salary.

alexpetralia5y ago

I don't think data engineers are paid less than software engineers.

darth_avocado5y ago

They are. I should know. I've worked as one for years including big tech companies. For e.g. FB has a lower pay than SWE, lower RSUs etc. and you can only get SWE pay if you transition into one, and that requires you to go through an interview process internally.

2 more replies

tharne5y ago

In my experience, they get paid more.

1 more reply

laichzeit05y ago· 3 in thread

I think it's missing the resources to one of the hardest sections: Data modelling, like Kimball and Data Vault. That, and maybe a section to modern data infrastructure. I'd put a link to [1] and [2] for a quick overview and probably [3] for more detail.

[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/

markus_zhang5y ago

This. I also think modern columnar databases and other techniques somehow makes Kimball to be obsolete or relaxed some how, but I could be very wrong.

For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me. This gives me some trouble as I'm not really show how to model data properly.

mulmen5y ago

> For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me.

I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.

Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.

No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.

Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.

I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.

I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.

1 more reply

mulmen5y ago

SQL is easy, data is hard.

wheaties5y ago· 3 in thread

...and nothing of basic statistics? Data Science people want to know about your data pipeline and have some quantification of the quality of that data. Also, monitoring data pipelines for data integrity often relies upon a statistical test. You don't need to go as far as Bayesian but you do need to understand when a median goes way off or if it bi-modal, etc.

diehunde5y ago

That should be assumed in the "engineer" part of the role.

ZephyrBlu5y ago

Yeah I would definitely expect an engineer to have a grasp of basic statistics such as mean, median, mode and be able to interpret statistical graphs on a basic level (Modality, skewness, shape, etc) .

adilkhashOP5y ago

a good catch, thanks!

somurzakov5y ago· 3 in thread

advanced proficiency in SQL and in any scripting language of your choice (C#/powershell, python) is enough to be a data engineer on any technical stack: windows/linux, on-prem/cloud, vendor specific/opensource, literally anything.

knur5y ago

I disagree. That's not enough these days.

If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.

Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.

somurzakov5y ago

I believe what you described is a job of Platform Engineer/Systems Engineer/Data lake Architect, especially JVM aspect of it. The interesting job is in the beginning when you build the cluster initially, or do major extension, after that the ops/maintenance is usually outsourced to cheap labor offshore - so this kinda job is personally not for me.

spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.

same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.

dominotw5y ago

> building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL

What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?

ABeeSea5y ago· 2 in thread

I think learning Scala is a bit of a waste of time, but I don’t know everyone’s stack. Maybe it’s a west coast bubble, but serverless seems to be the most popular choice for new ETL stacks even if the rest the cloud tech stack isn’t serverless. AWS tools like kinesis, glue (pyspark), step functions, pipelines, lambdas, etc.

If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.

ianbutler5y ago

Glue is both more of a pain in the butt than regular old spark with pyspark and way more expensive, from my experience I would seriously question someone suggesting to use it.

We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.

ABeeSea5y ago

Both pricing and start-up times are significantly better in Glue 2.0 (assuming one can migrate). But even on Glue 1.0, orchestrating an ETL process with with several dozen jobs is a non-trivial amount of configuration and labor. (Jobs failures, job restarts, paging, job run history, cloudwatch logs, re-usable infrastructure as code when creating a new jobs, permissions and security, etc) that the increased cost is more than worth it for us.

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...

1 more reply

zaptheimpaler5y ago· 2 in thread

Somewhat outdated view. This may be the current stack, but its outdated now and is slowly being replaced. The new view is not big data pipelines and ETL jobs, its lambda architecture, live aggregations/materialized views and simple SQL queries on large data warehouses that hide the underlying details. The batch model may still apply to ML I guess, but I'm no expert there.

tomnipotent5y ago

This is true for only a very limited subset of data producers that need real-time or near real-time data included in ML models. For 99% of the rest, batch processing is just fine and considerably more economical.

molsongolden5y ago

Any resources/guides you'd recommend?

mywittyname5y ago· 2 in thread

For GCP, our stacks tend to be Composer (Airflow), BigQuery, Cloud Functions, and Tensorflow.

There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.

smattiso5y ago

What is your product? I am looking for a consultant to help me setup a good process for a data driven product hardware product.

mywittyname5y ago

The work I do is almost entirely Google Anaytics/Ads related. So probably not what you're looking for, but if so, leave your email and I'll reach out!

master_yoda_15y ago· 2 in thread

IMHO first you need to become a programmer then you can become a data engineer. So if you need to start by learning data structure then you are doing something wrong. Also the topics suggested in "Algorithms & Data Structures" could easily be skipped, the information is drastically misleading. We should seriously have some fact checker, otherwise this kind of bullshit article keep trending on HN and people keep wasting their time on learning LSM tree (what the fuck is that in the first place).

markus_zhang5y ago

Can you recommend something else? I'm preparing to go through a full CS education.

master_yoda_15y ago

Coding and building teach you more than taking a course or watching a video. If you don't have any programming background, you can enroll in some coursera or udacity courses to start with. Then go through this course http://web.stanford.edu/class/cs106x/, the course reader is really good. After that for data engineering read this book https://www.amazon.com/Designing-Data-Intensive-Applications.... Also learn some sql. Take some data, feed into sql light db, and ask question and convert question into query. Becoming good at this takes some time. Be patience. The learning curve is like hokey stick, initial phase might have a dip but it accelerate in the later phase. BY ANY CHANCE DO NOT JOIN A BOOTCAMP.

1 more reply

snidane5y ago· 2 in thread

In data engineering your goal is "standardization". You can't afford every team using their unique tech stack, their own databases, coding styles, etc. People leave the company all the time and you as a data engineer always end up with their mess which now becomes your responsibility to maintain. You'd at least be grateful if those people had used the same methods to code stuff as you and your team so that you wouldn't have to become a Bletchley Park decoding expert any time someone leaves. Or you'd hope the tech stack was powerful and flexible enough that other people other than engineer types could pick it up and maintain themselves. They mostly cannot do that, because there is no such powerful system out there. Even when some modern ELT systems get you 80% there, you, data engineer, are still needed to bridge the gap for the 20% of the cases.

Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.

Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.

It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.

valenterry5y ago

> It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

Love this quote. It hits the nail on the head. Not sure why it's paid so well though...

markus_zhang5y ago

Why would no one want to do it though? It's way more interesting than my current BA job. I managed to squeeze a bit into the role by opting to do some ETL and it's already very rewarding.

I'd definitely do it even if it's the same pay as BA.

justinzollars5y ago· 2 in thread

Amazon introduced Step Functions, which are very nice to dig into and a helpful skill for Data Engineering.

jimsparkman5y ago

With direct integrations to EMR, Lambda, and Athena, its a great tool for building pipelines and effectively costs nothing on its own and is completely headache free.

justinzollars5y ago

This is exactly how I use it. I left out that important point.

runT1ME5y ago· 1 in thread

I've been approached about various data engineering jobs over the last couple years and the job descriptions have varied wildly. It has been everything from:

1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.

2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.

3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.

So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.

ABeeSea5y ago

I’d call 2 and 3 data engineers and 1 either a data analyst or BI developer/engineer depending on technical proficiency.

dibujante5y ago· 1 in thread

"In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset."

What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.

fantod5y ago

> In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset.

I think these two sentences are sort of orthogonal to one another. The first, I interpret as saying that it's useful to understand Scala if you're using Spark, essentially because of the law of leaky abstractions [1]. I think you're responding to the second sentence and in that case I agree.

[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

querulous5y ago· 1 in thread

i see a lot of "spark is dead" talk here. what replaces it for transform inbetween something like kafka and redshift/bigquery?

llbeansandrice5y ago

I agree with both sides here. Spark was the in thing for a while so a lot of places are using it but probably don't need it and could have been better off running various SQL scripts to do some transformations. I worked on a project exactly like this where we should have used SQL scripts instead of Spark.

But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.

diehunde5y ago

Nice article. From experience I would say the SQL knowledge should be advanced though. Not intermediate.

sseppola5y ago

Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.

What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.

Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.

freebee165y ago

Im my experience teams operating under the "The AI Hierarchy of Needs" principles are optimized for generating white papers

airbreather5y ago

Your data is only as good as your instrumentation and you usually only get one chance to grab that data, but can have many goes at processing it, do I would argue the bit not covered is the most important.

Nydhal5y ago

Shameless plug to my much simpler (simplistic?) view of things. In this case, I think Data Engineers are the people building systems that solely focus on the data, all the data and nothing but the data.

https://www.linkedin.com/pulse/mapping-data-science-professi...

ectoplasmaboiii5y ago

Is anyone here using kdb+/q for data engineering, specifically outside of finance?

rmelhem5y ago

where I work for, our stack is all about GCP/Airflow/Python/BigQuery ML, for recommender systems. I'm now playing around with Turicreate (Apple) to compare with BQML.

cargoshipit5y ago

I don't recommend it

j / k navigate · click thread line to collapse

137 comments

94 comments · 26 top-level

josephmosby5y ago· 12 in thread

(source for everything following: I recently hired entry-level data engineers)

dominotw5y ago

Agree 100% with this comment,

Old stack: Hadoop, spark, hive, hdfs.

New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift

disgruntledphd25y ago

Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.

5 more replies

sseppola5y ago

Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and airflow/dagster are quite similar, so why do you need one of each? fivetran/stitch/singer are all new

2 more replies

fetanchaud5y ago

Just a thought : what about dremio ?

llbeansandrice5y ago

> I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments

I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.

josephmosby5y ago

I should have specified more thoroughly.

I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)

However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.

1 more reply

teddyuk5y ago

I'm also the odd one out, so many enterprises moving to spark on databricks.

Mauricebranagh5y ago

I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.

I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"

pjmlp5y ago

When I was there, almost 20 years ago, it was all about C++, Python and Fortran, with GUIs being done in Java Swing.

I bet it is still mostly the same, just using Web GUIs nowadays.

valenterry5y ago

We yet have to wait for the proper sweet spot: a language that allows SQL-like handling without the restrictions of SQL.

As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.

Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.

andkenneth5y ago

What's your opinion of LINQ in C#? It's been a while since I've used it but to me it seems like one of the most powerful ways to manipulate data inside of a language.

1 more reply

alexpetralia5y ago

Great points. It depends on where the business is at, the scale of their data, how processed their data is, and the timeliness/accuracy requirements of that data.

u678u5y ago· 10 in thread

musingsole5y ago

I'm not bitter; you're bitter. /s

u678u5y ago

Yeah I'm bitter. I want to bring structure, tools and discipline to small scale data gathering, but the big tools are just too time-consuming to get up and keep running with just a few hours.

kthejoker25y ago

Depending on what your sources and sinks are:

* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL

* Trifacta's Wrangler has a free version with limits

* Talend's Open Studio is free, a little clunky but works fine

* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers

The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion

hermitcrab5y ago

Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.

u678u5y ago

Thanks there are enough workstation tools, but I want an automated tool that runs on a server.

kfk5y ago

imcoconut5y ago

No reason you cant still use airflow for the orchestration/depenendecy management/scheduling. And for the processing and storage - Pandas and sqlite.

Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)

throwaway72815y ago

Take look at luigi, which is a lightweight task orchestrator with minimalistic dependencies.

[1] https://github.com/spotify/luigi

ABeeSea5y ago

In AWS, lambda’s and step functions.

https://aws.amazon.com/step-functions/

u678u5y ago

Thanks, presumably this is similar to AWS Airflow too.

StreamBright5y ago· 7 in thread

adilkhashOP5y ago

Hadoop is still widely used in enterprises (especially in banks), if you have experience working with Hadoop ecosystems it is a big plus anyway.

StreamBright5y ago

Yes, that is the status quo.

There is also some trends:

https://trends.google.com/trends/explore?date=today%205-y&ge...

pjmlp5y ago

Kotlin's future is tied with Android, on the JVM it will be another Scala in 5 years time.

If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.

Lets see if it doesn't end like like Typesafe.

smattiso5y ago

Are you working in this field? I am looking for a consultant to setup a modern data processing pipeline for a data driven hardware product I am building.

StreamBright5y ago

Yes, I am moving companies to their next data pipeline, that is my specialty. I have added my email to my profile, you can reach out to me.

kentm5y ago

What does your typical stack look like?

StreamBright5y ago

Depends. Just some random mixture of stacks: PrestoDB, S3, Airflow, Luigi, Dremio, Athena, Hive LLAP, EMC Isilon, Kafka.

dominotw5y ago· 7 in thread

I've been in this space last 6 yrs or so and my scala usuage has gone down to zero. Not worth learning scala.

pgoggijr5y ago

This is an anecdote - plenty of firms are using Scala in their data engineering stacks and it's a great tool for the job.

dominotw5y ago

> it's a great tool for the job.

What job can this do that can't be done via sql. dealing with unstructured data?

st1x75y ago

> plenty of firms are using Scala in their data engineering stacks

Isn't that just a result of everyone being into Spark a few years ago?

1 more reply

sidlls5y ago

switch0075y ago

What languages are worth learning?

dominotw5y ago

SQL has taken over the space completely. 90% of data munging and transforms happen via SQL.

I would learn python. Its the number one language outside sql.

2 more replies

johanneskanybal5y ago

edit: Guess this was pretty much in the post.

prions5y ago· 4 in thread

As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.

cbdumas5y ago

SQL proficiency is something I've seen developers of all sorts neglect, which I think is a huge mistake. And relegating SQL to something that just an "analyst" does is an even bigger mistake.

Grimm15y ago

I spent way too large a portion of my last position teaching developers about indexes, query plans and underlying join types and their impact on performance and memory consumption.

1 more reply

otter-in-a-suit5y ago

I wholeheartedly agree with the "specialization" comment.

garciasn5y ago

TL;DR: SQL is still a DE's best and most ubiquitous tool/language in the space--hands down.

darth_avocado5y ago· 4 in thread

We want all these skills, yet, we'll give you a separate title and pay you less than a software engineer. Meanwhile front end software engineers are still software engineers and get high pay.

black_mage5y ago

I just got the 2020 stats from one of the biggest tech recruiters in my country. On every level DEs outperform SWEs on salary.

alexpetralia5y ago

I don't think data engineers are paid less than software engineers.

darth_avocado5y ago

2 more replies

tharne5y ago

In my experience, they get paid more.

1 more reply

laichzeit05y ago· 3 in thread

[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/

markus_zhang5y ago

This. I also think modern columnar databases and other techniques somehow makes Kimball to be obsolete or relaxed some how, but I could be very wrong.

mulmen5y ago

> For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me.

I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.

No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.

Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.

I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.

1 more reply

mulmen5y ago

SQL is easy, data is hard.

wheaties5y ago· 3 in thread

diehunde5y ago

That should be assumed in the "engineer" part of the role.

ZephyrBlu5y ago

adilkhashOP5y ago

a good catch, thanks!

somurzakov5y ago· 3 in thread

knur5y ago

I disagree. That's not enough these days.

Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.

somurzakov5y ago

spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.

same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.

dominotw5y ago

> building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL

What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?

ABeeSea5y ago· 2 in thread

If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.

ianbutler5y ago

Glue is both more of a pain in the butt than regular old spark with pyspark and way more expensive, from my experience I would seriously question someone suggesting to use it.

We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.

ABeeSea5y ago

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...

1 more reply

zaptheimpaler5y ago· 2 in thread

tomnipotent5y ago

molsongolden5y ago

Any resources/guides you'd recommend?

mywittyname5y ago· 2 in thread

For GCP, our stacks tend to be Composer (Airflow), BigQuery, Cloud Functions, and Tensorflow.

There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.

smattiso5y ago

What is your product? I am looking for a consultant to help me setup a good process for a data driven product hardware product.

mywittyname5y ago

The work I do is almost entirely Google Anaytics/Ads related. So probably not what you're looking for, but if so, leave your email and I'll reach out!

master_yoda_15y ago· 2 in thread

markus_zhang5y ago

Can you recommend something else? I'm preparing to go through a full CS education.

master_yoda_15y ago

1 more reply

snidane5y ago· 2 in thread

valenterry5y ago

> It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

Love this quote. It hits the nail on the head. Not sure why it's paid so well though...

markus_zhang5y ago

Why would no one want to do it though? It's way more interesting than my current BA job. I managed to squeeze a bit into the role by opting to do some ETL and it's already very rewarding.

I'd definitely do it even if it's the same pay as BA.

justinzollars5y ago· 2 in thread

Amazon introduced Step Functions, which are very nice to dig into and a helpful skill for Data Engineering.

jimsparkman5y ago

With direct integrations to EMR, Lambda, and Athena, its a great tool for building pipelines and effectively costs nothing on its own and is completely headache free.

justinzollars5y ago

This is exactly how I use it. I left out that important point.

runT1ME5y ago· 1 in thread

I've been approached about various data engineering jobs over the last couple years and the job descriptions have varied wildly. It has been everything from:

1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.

ABeeSea5y ago

I’d call 2 and 3 data engineers and 1 either a data analyst or BI developer/engineer depending on technical proficiency.

dibujante5y ago· 1 in thread

fantod5y ago

[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

querulous5y ago· 1 in thread

i see a lot of "spark is dead" talk here. what replaces it for transform inbetween something like kafka and redshift/bigquery?

llbeansandrice5y ago

But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.

diehunde5y ago

Nice article. From experience I would say the SQL knowledge should be advanced though. Not intermediate.

sseppola5y ago

Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.

freebee165y ago

Im my experience teams operating under the "The AI Hierarchy of Needs" principles are optimized for generating white papers

airbreather5y ago

Nydhal5y ago

https://www.linkedin.com/pulse/mapping-data-science-professi...

ectoplasmaboiii5y ago

Is anyone here using kdb+/q for data engineering, specifically outside of finance?

rmelhem5y ago

where I work for, our stack is all about GCP/Airflow/Python/BigQuery ML, for recommender systems. I'm now playing around with Turicreate (Apple) to compare with BQML.

cargoshipit5y ago

I don't recommend it

j / k navigate · click thread line to collapse