undefined | Better HN

0 pointssomurzakov6y ago0 comments

can I ask a question about your #2. It seems you do ELT with spark/sql doing the T part after loading. Is your loading part high performance or do people even care whether it is fast or not? In my experience, when I extract and load data as is (for example into SQL Server) - it is kinda slow, because the columns have to be wide and generic, to accommodate all the crap that can come in. For example, I noticed that loading 1M rows into nvarchar(2048) is way slower, than into varchar(50). Let's say you have one column that usually does not exceed 50 chars, but sometimes it can be crap data and be 2000 chars. What is the best scenario to ELT it quickly?

What I found is that if data is high quality - then ELT is totally fine, often times it ends up being just EL without much T. But if the data is crap, and you have a lot of wide columns, then even loading it takes time, before we even get to processing stage. In this scenario ETL works much faster.

0 comments

1 comments · 1 top-level

oxfordmale6y ago

There is two approaches we follow. The first approach, which is quite slow, bulk extract the data to S3 and then run our transforms on top of that. If we need high performance, we write a delta streamer that only streams modified records.

j / k navigate · click thread line to collapse