The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.
If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.
If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.
I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.
Old stack: Hadoop, spark, hive, hdfs.
New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift
For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.
I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.
I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)
However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.
I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.
I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"
I bet it is still mostly the same, just using Web GUIs nowadays.
As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.
On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.
Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.
[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/
For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me. This gives me some trouble as I'm not really show how to model data properly.
I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.
Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.
No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.
Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.
I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.
I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.
As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.
I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.
Several times over my career I've been brought in on a project where the team was considering replacing their RDBMS entirely with a no-SQL data store (a huge undertaking!) because they were having "performance problems". In many cases the solution is as simple as adding an index or modifying a query to use an index, but the devs regard it as some kind of wizardry to read a query plan.
Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.
Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).
Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.
I've led DE teams for the last decade. I have lived through shifts in toolsets, languages, etc. Regardless of platform, languages, model types, etc, etc, etc, etc, the one constant has been SQL with some sort of of scripting around it.
Right now, it seems Python is the big wrapper language, whether it's via dag or some other means but that's just the preferred method TODAY. Considering SQL has been around for decades and has outlasted just about every other language and system, many of which have opted for a SQL-like interface on top of their system, I would highly recommend DEs be very strong there.
There is also some trends:
https://trends.google.com/trends/explore?date=today%205-y&ge...
If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.
Lets see if it doesn't end like like Typesafe.
My favorite so far is S3 + PrestoDB with either ORC or Parquet files. It is a solid DWH solution for most enterprises on the cloud. (Cloud or not is a different discussion). It works for small scale (50TB) to really high scale (50PB). There are some (very few) gotchas and moving parts as opposed to Hadoop + co. You can combine it with Kafka for streaming data and you got yourself a pretty solid data solution.
If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.
We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.
https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...
While maybe not strictly necessary per se, it's a great way to get a foot in the door, and provides a great way to foster advanced type systems and functional programming (I personally find it to be a really fun language to write in to boot).
I would learn python. Its the number one language outside sql.
edit: Guess this was pretty much in the post.
1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.
2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.
3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.
So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.
What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.
I think these two sentences are sort of orthogonal to one another. The first, I interpret as saying that it's useful to understand Scala if you're using Spark, essentially because of the law of leaky abstractions [1]. I think you're responding to the second sentence and in that case I agree.
[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...
What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.
Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.
There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.
I'm not bitter; you're bitter. /s
* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL
* Trifacta's Wrangler has a free version with limits
* Talend's Open Studio is free, a little clunky but works fine
* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)
* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well
* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers
The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion
Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.
Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.
Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)
Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.
There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.
Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.
It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.
Love this quote. It hits the nail on the head. Not sure why it's paid so well though...
I'd definitely do it even if it's the same pay as BA.
But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.
https://www.linkedin.com/pulse/mapping-data-science-professi...
If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.
Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.
spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.
same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.
What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?