A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance; the questions are often know-it-or-don't questions (especially involving window functions w/ uncommon syntax). A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.
That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.
If you don't get the offer there's no bullet to dodge, right?
</pedandtry>
Sad. One needs a thick skin to interview, because the manner of rejection is far more detrimental than rejection itself. I have seen friends cry unable to believe their luck or themselves. Often, due to negative bias, it takes a lot of confidence away from an individual and they start to feel inadequate, doubly so, if they had prepared super hard and yet failed because they couldn't remember a trivia.
One of the best interviewing advice I got from a mentor was, interviews aren't a pissing-off or a dick-measuring contest but often are. It changed the way I approached the interviews as an interviewee. Also, a good thing that I seeked for and got this advice way before I started interviewing candidates.
That's really reprehensible.
One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.
If you read about entrance exams at older institutions and professions with entrance exams (the bar, medical and nursing boards, actuarial exams, and so forth), you'll find that a they are often considered among the more stressful events in a person's academic career. They are often (and should be) very rigorous, but I do think a certain unspoken bill of rights has emerged to protect the student as well was the people conducting the exams. For example, entrance exams should have a study path, a known body of knowledge that is getting tested. They should be graded consistently and fairly, by acknowledged experts in the field. There should be a way to file a grievance, and the evaluation metrics must be "transparent" - if not the specific deliberations, then at least the general approach.
Tech interview exams have none of this. They are conducted very, very capriciously, often by people who have limited skills and experience - even if they are experts in their field (which they often aren't), they may not have any idea how to evaluate a candidate.
One basic tenet here is that you don't "mock" a candidate. Seriously, wow.
I've used BETWEEN ROW maybe once or twice in my career in a professional setting. Self-joins more often, but as others have pointed out window functions are more efficient here for writing dashboard ETLs, etc.
Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!
This is only useful if you tell candidates to study a book for a week first.
Otherwise you are filtering for narrowminded memorizers, not smart people who can learn and solve problems
Sure OP has presented some example solutions, but most of these solutions can be built up from the basics of tabular data manipulation. Furthermore, there are multiple steps involved that would allow a candidate to show their grasp of these fundamentals and receive help from an interviewer (that doesn’t trivialise the problem).
In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key
If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.
I've given SQL interview questions quite a few times and like to present a sample dataset or schema, and state a problem statement or expected result. Then I collaborate with the candidate to come up with assumptions, and I give gentle guidance if they are stuck or going down the wrong path.
I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving and collaboration skills than the actual SQL they write.
People who ask ask questions about the data and schema usually do much better than those who jump right into the solution. Just like in the real world where really nice SQL or code doesn't matter if it solves the wrong problem.
I'd wager that if you re-wrote those snippets as Ruby or Scala operating against something tabular like CSV files, with all the joins and aggregates and so on done in code, it would look uglier.
* ok sometimes sql (thinking specifically of SELECT statements) gets code-y, e.g. with inline formulas. But generally its more on the specification side.
Codebases I work in continue to use ORMs extensively, but when I need a query of any complexity I'm far more likely to start with raw SQL than to try to make the ORM do what I want. I'm far more likely to "let the database do the work" when I need to do any kind of data analysis or reconciliation. It is very good at what it does, and if you understand how it works it's very simple to do complex things, and quickly. For me, SQL often elicits a response of wonder, and even gratitude for its expressiveness and power.
I also suspect your feelings are rooted in syntactic aesthetics (there's little else in common between SQL and COBOL, for example). I can certainly agree that the syntax of SQL is not what I would choose (though I would not look to Ruby or Scala as examples of my preferences). But SQL is not just syntax, it's also a tool with incredibly expressive capabilities for viewing, analyzing and manipulating data effectively.
To you this is a bad thing, to me, this is exactly why I like SQL. I don't need to be in love with the elegance and majesty of a language I just need it to work and SQL typically just works and 9 times out of 10 it hooks up nicely with whatever high level language I'm using.
And for what it's worth, I've seen quite a few senior level SQL developers write some pythonic SQL queries that would put my regular Python programming to shame.
Don’t get me wrong, I see where SQL shines, but I also see where my (your, their, anyone’s) mind shines and where it will lose its traction and prediction abilities. People are fine with some level of declarativeness, but add some more and it turns into hardest puzzles where formulating it correctly is a hardly provable task itself. While that’s true for imperative code, it breaks along the way of your thinking, and not at randomly ‘optimized’ expressions. UB issues in C language group is essentially the same sort of trouble.
And although it may be a matter of opinion, but I find a query that runs one or more orders of magnitude faster than another query to be a thing of beauty.
There's precious little difference between
set.select { |x| x.a > 10 }
.group_by { |x| x.c }
.map { |c, xs| xs.length }
and select count(*)
from set
where a > 10
group by c
except the latter has more scope for optimization.Whether or not you're actually writing SQL, it's a good conceptual approach.
The ironic thing is that the way developers like to deal with data today is more like how they did in the early days of COBOL, to which SQL was an improvement.
The first computerized databases were navigational, which just means hierarchical objects. You "navigated" through the data to find the parts that were interesting for a given query, just like with JSON you might loop around through the properties. In 1973 Charles Bachman wrote a book called The Programmer as Navigator.
These data structures were insidious, because: (1) you wind up with duplicated data, vulnerable to getting out of sync with itself, and (2) complex queries can get slow. For example, imagine an array of Customer objects. Each has an Orders field, which is an array of Order objects. Each of those has, among other things, fields for the item name and description, and so on. With this structure, it's easy to fetch all the orders of a certain customer, but it's slow and complex for other queries, like the total number of orders for each item. For that, you might duplicate the data into a different structure. It was just like NoSQL, only there was no SQL at the time. It was PreSQL.
Programmers are immediately attracted to such data structures because they are amenable to the first few pages you have in mind to build. It's really easy to run those nested objects through a template and output HTML, and it's straightforward to take data from a form and save it as one of these objects. As your application grows and spirals, though, your original data structures become more and more cumbersome and less suited to the new pages you have to make.
This was a problem in the 1960s and 70s just as much as today, which is why E. F. Codd wrote his papers, most famously "A Relational Model of Data for Large Shared Data Banks." You might say, relational? Those old navigatorial objects sounded like they had lots of relationships. But it is a popular misconception that Relational here meant the relationships among tables (i.e., foreign keys). Dr. Codd was a mathematician, and he meant the mathematical term relation, which is essentially a grid of values, a table. So they were called relational databases not because you could relate one table to another but because they were databases made up of relations (tables).
Tabular data solves the speed problem in navigational data. But now fetching your data is even more tedious, if you have to navigate those tables by hand (Loop through Table A. If the value in cell 12 > 10, then save it to a temporary variable...). But in that very same paper, Codd also proposed a very high-level language for working with the tables. It wasn't called SQL. IBM came up with SQL specifically, after examining Codd's papers (who in fact was a researcher at IBM). Believe me, SQL was an improvement. Codd's original language, called Alpha, was mathematical hieroglyphics. The foundation was solid but the user friendliness was lacking. SQL was an attempt to have the same nature but resemble English instead of Mathematics.
But the two pillars, tabular data structures and a high-level query language, were introduced simultaneously and are both equally part of what makes SQL what it is. Which one would you like to remove?
Chesterton's Fence comes to mind when watching programmers meet SQL:
"In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, 'I don't see the use of this; let us clear it away.' To which the more intelligent type of reformer will do well to answer: 'If you don't see the use of it, I certainly won't let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.'" --- https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence
> If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.
I continue to urge my team to capitalize SQL keywords. This is because the vast majority of SQL queries in our codebases are embedded in another language, as a string. Syntax highlighting is not available (I know there are some tools for this in some environments for some host languages, but it's not widely available or even remotely a solved problem). Static analysis tools for this scenario are generally hard to come by. Every syntactical hint is a godsend for reading and comprehending these queries. I also encourage quoting every identifier even if it isn't strictly required, and extensively using whitespace to make a query's structure more apparent.
If I were writing and reading SQL under better circumstances, it's quite likely I would have different preferences.
Probably the biggest single determinant of a SQL query's performance is the order of joins. The best join order is dependent on how many rows result after each join, which in turn depends on how selective the predicates are that can be applied to each table, and that depends on the data distribution. The database does this with statistics. That means it can change the join order - the nesting level of your respective loops - as the data distribution changes.
The bigger issue is the lack of performant abstractions - I copy-paste orders of magnitude more code in SQL than in any other language I've used. There are basically just views and functions, and both can carry substantial (read: multiple orders of magnitude) performance penalties compared to the copy-paste approach, especially when you try to nest or compose them. Materialized views rectify this somewhat but they come with various RDBMS-specific limitations and gotchas.
A friend of mine did numerous interviews at a large company, hours out of his time and those of the interviewers, only to be caught up by some inane SQL question asked by a know-nothing after the entire process of interviews had been completed.
Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?
It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.
You may say no, that any competent person can read the book and learn this stuff. But in my hiring experience there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL. Once you've hired that person they do not magically turn into an efficient or self-aware programmer.
Such questions don't test for that, they test for rote memorization. I don't learn things by heart I only use rarely, and certainly not to help you with a problem I'm not causing and that such memorization doesn't fix.
Second, the most important skill to DBA or Data Scientist role and to an employer is finding someone that understands the data domain well enough to point out obvious mistakes, bad approaches and thinking to managers and provide leadership, not just access, to information.
Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.
Then again, this depends a lot on your architecture. For instance, if your database is running on expensive hardware with very limited CPU and memory quotas then often it's better to export intensive problems to the application. Also a lot of the more complicated SQL commands are implemented differently in all the dialects of MySQL, MSSQL, Oracle and DB2. Sticking to simpler queries lends itself to cross platform compatibility if you need that.
I think that when you run into someone who knows these in depth things about SQL you either have a smart technically aware person or a rote learner. I think the best bet is to ask them how they would use the different stuff and why. Rote learners will hit a brick wall or answer generically.
If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.
If you can't solve the problems after looking it up, then perhaps you are not a SQL expert. I think these sort of questions are valuable.
In my experience, once you're hitting hundreds of millions of records, implementation details of your database engine will start to matter. A database designed for transactional workloads like Postgres will start to choke on aggregate and window functions, often taking minutes to run instead of milliseconds. A columnar database like Redshift (which exposes a SQL interface) will breeze through it without a sweat.
That query is much simpler, safer, faster, no self-join or windowing necessary, and you can properly handle missing months in your dataset in your higher-level language (which the provided solution doesn't do BTW).
And you're still getting the performance boost from the DB indexes for grouping and sorting.
Also, a lot of these would need to be done in SQL in practice since the data wouldn't fit in memory. Any solution that requires loading the table [login (user_id, timestamp)] into memory probably won't scale very well!
(1) The flavor of SQL I use at work supports macros, which are functions that can take in parameters like a function in R/Python might. So, the SQL is "turbo-charged" in that sense and some of the value-added of switching over to Python/R is diminished.
(Big Query has UDFs, which seem similar: https://cloud.google.com/bigquery/docs/reference/standard-sq...)
(2) Like I mentioned in the doc, I personally use the SQL in these practice problems for ETLs on dashboards showing trends. AFAIK, much easier/efficient to write metrics for daily ETLs in SQL than R or Python, especially if these are top-line metrics like MAU.
1. MoM Percent Change
It's better to use windowing functions, I believe it should be faster than self-join.
2. It seems that the first solution is wrong -- it returns whether "a"-s parent is Root/Inner/Leaf, not "a" itself.
I'd instead add a "has_children" column to the table, and then it would be clear.
Second solution works, but without optimization it's 1 query per row due to nested query -- slow, but not mentioned.
The question just screams for windowing functions, and cumsum is a canonical example for them.
Sorry post author, you'd fail my interview :)
Sorry you feel that way! Thankfully my employer felt differently :)
(You can make "but what if a month is missing?" a latter part of a multi-part interview question)
Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.
Your statement is making the assumption you have completely dense data and you can simply offset a number of rows to get the desired denominator. Sparse data is a very common occurrence, and now your MoM/YoY/XoX queries are completely incorrect.
One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".
But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).
Often better to group explicitly so you know what's actually going on.
You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore
The days of dedicated SQL programers are mostly gone.
Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.
As I responded to another comment, I have to wonder if that's because you haven't spent the time to become proficient enough to appreciate them. I understand that a lot of programmers decry "magic" and want to get at the underlying steps and components that produce a given result, but literally all of us work at some level of abstraction above that because it would be impossible not to. None of us are directly transforming sand with electricity into computed values.
You're describing being more comfortable with an imperative set of statements than a declarative description of results. Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test, except that the description is less freeform and must conform to a certain syntax and structure so that a machine can write the test for you.
It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.
Except that you know/expect how exactly it will be executed and if it doesn’t, maybe test should not really pass. SQL in general is full of this “we write declarative queries expecting this exact imperative result and investigating if it’s not”. It’s much like an interview question: it may have many different answers, but you must provide the one that satisfies a grumpy plan builder guy. I know sql and use it when it’s shorter/more descriptive, but sometimes you just want to take them rows and do them things, without leaving a database process and its guarantees.
Not much db experience, but the fact that such powerful engines (i.e. acid, indexing, good ods, pooling, etc) are always hidden beyond some cryptic uppercase frontend with a weak execution layer always bothered me. Just give me that postgres-grade index scan directly in C, dammit. /rant (inb4 just write some parts as a sp in a language of choice)
I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.
If you would prefer a more fine-grained control over execution of queries, you could probably get very far by starting with the SQLite code base and working at a more primitive level through its library.
This is from the perspective of trying to get queries to run in 5 minutes instead of 30 minutes instead of hours or days or forever, not brief transactions measured in milliseconds. And it's not something I figured out on my own, but by paying attention to the guy who never talked but was consistently 10x faster in producing reports than anyone.
The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.
Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.
Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.
There is more than one type of join (INNER, LEFT OUTER, etc), and more than one join algorithm (neesed loop, merge, hash, etc). All these change based on the data. Even the join order can have a huge impact on query time, and needs to adapt based on the number of rows you'll pull from each table.
A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.
SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.
Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.
- SQL:1999 https://en.wikipedia.org/wiki/SQL:1999#Common_table_expressi...
- SQL standardization https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...
It is so usefull, reliable and does not change every year.
I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.
Some databases use lock structures that automatically detect and resolve deadlocks, so from a user standpoint there are no deadlocks but deadlock resolution has visible side effects that are implementation defined.
I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.
Also, for your language I wouldn't expect all mid tier SQL devs to be able to write a recursive CTE from memory (though its useful, you can just look it up again), but something like breaking apart a query plan on your platform of choice is way more important as that muscle allows you to tell if your CTE was crap or not.
I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.
What would one of these do?
Is it only useful for ad hoc/analytical queries, or am I missing something?
People can define, describe and provide examples but with SQL, it won't sink in until you try it out yourself and have the "Ah ha" moment. If you have a database server and some test data, try writing a few window function queries or try some online examples.
The data is already sorted, partitioned, in memory .. adding a moving average calc or max value per category is certainly faster than fetching the set to disk, recreating it in an intermediate structure and then calculating the new value with all that partitioning, sorting, etc. to be redone client side.
You can certainly use them in app dev for things like:
* figuring out how many things are ahead of you in a queue * snapshotting (what's changed since the last time you were here) * comparing something to some other sample for outliers or inconsistencies
These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.
However certainly if your app is written in Python or Ruby I can see there being a big difference.
The reason I mention analytics is more to do with the ease of scaling out the DB than the nature of the queries. In busy OLTP databases usually one wants to keep the work off the database because they’re difficult and expensive to scale.
The purpose was to make the questions more realistic, since at least in my experience in data analyst interviews the questions are asked in the context of actual business or product situations ... like company leaders, PMs, or others wanting to understand trends in MAU.
If you want to get undeniably good with SQL, this is the book.
Also I would do a lot of them differently. And I don't think the answer to #6 is valid (in T-SQL at least).
But one thing it falls apart are these time series data processing tasks.
It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.
SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).
In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.
In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.
Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.
Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?
Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.
Asof join is still an open issue in their issue tracker so it is not usable as a time series database.
Could be you dodged a bullet. A company with advanced interview questions may have some ugly SQL code. For jobs that lean heavily on SQL, I expect candidates to know things like windowing & dynamic variables, in SQL & an ORM library. For SWE's, I feel basic SQL is fine.
I'd be happy if they knew what EXPLAIN was, since that impacts production on a daily basis.
Also, the OP is basically "fighting the last battle" again. Most interviews don't filter out candidates based on SQL.
Source: DBA.
It uses React but I imagine there is some other library like ProseMirror here.
Where are the pivot and unpivot questions?
Mode's SQL tutorial uses SUM(CASE ...) and CROSS JOIN to mimic pivots: https://mode.com/sql-tutorial/sql-pivot-table/
PostgreSQL can do it via the tablefunc extension and there's `\crosstabview` which is built into psql.
https://www.postgresql.org/docs/current/tablefunc.html
https://www.postgresql.org/docs/12/app-psql.html#APP-PSQL-ME...
select
node
, (case when parent is null then 'Root'
when exists (
select * from tree c
where c.parent = node
) then 'Inner'
else 'Leaf'
end) "label"
from tree
EDIT: also, in the fourth, it seems like you'd want to partition the window function, who cares about order. Something like sum(cash_flow) over (partition by date) "cumulative_cf" select
t.node,
case
when t.parent is null then 'Root'
when p.parent is null then 'Leaf'
else 'Inner'
end as label
from tree t
left join (select distinct parent from tree) p on t.node = p.parentIn an interview, presumably my logic would, hopefully, shine through minor issues of syntax.
Where would that put me? Maybe "okay to decent" when dealing with "medium-hard" questions?
I would fail utterly at DBA management SQL and stored procedures, my responsibilities skew towards data analysis.
select * from (select adddate('1970-01-01',t4.i10000 + t3.i1000 + t2.i100 + t1.i10 + t0.i) selected_date from (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v where selected_date between '2016-01-01' and now()
This works in MySQL / MariaDB.
Previous MAU 1000 Current MAU 2000 Percent Change -100
1. use some underlying implementation detail of the particular RDBMS or proprietary extension to the standard
2. are essentially tricks, like the typical ridiculous interview problem designed to "make you think outside the box". Yes, you can do almost anything in SQL but often you should not.
I get the perspective here is data analysis where you probably need to know more SQL than the standard developer, but I still feel you should be testing for solid understanding of basics, understanding of the relational algebra concepts being used and awareness of some typical gotchas or bad smells. That's it. They'll be able to google the precise syntax for that tricky by date query when you're not guaranteed to have data for every month or whatever on-demand.