Medium-hard SQL interview questions (opens in new tab)

JohnTHaller6y ago

I've been using SQL for 24 years and don't think I've ever used BETWEEN ROW before. Did read up on it now, though.

psaux6y ago

I wrote sql for years for a forex platform used by some of the top dogs (literal). Pretty hardcore sql and had to be very fast, never used BETWEEN ROW.

thom6y ago

I use it all the time, both in ad hoc queries and in data prep for models. Most commonly just to look into a future window from the current event (i.e. did this pass in a game of soccer lead to a shot etc). It's quite rare that I'll specifically care about X rows preceding or following but it does happen that you care about a shorter period than the entire window. Entirely possible people are using graph databases for that sort of thing, but the window syntax is nice.

nogabebop236y ago

There are so many specialized aspects that you'll never use outside of an interview or an edge case when you can learn it on-demand.

markus_zhang6y ago

I have to use it quite often because the dialect I'm using defaults on BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW which I have to manually change to BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

ignoramous6y ago

> A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.

Sad. One needs a thick skin to interview, because the manner of rejection is far more detrimental than rejection itself. I have seen friends cry unable to believe their luck or themselves. Often, due to negative bias, it takes a lot of confidence away from an individual and they start to feel inadequate, doubly so, if they had prepared super hard and yet failed because they couldn't remember a trivia.

One of the best interviewing advice I got from a mentor was, interviews aren't a pissing-off or a dick-measuring contest but often are. It changed the way I approached the interviews as an interviewee. Also, a good thing that I seeked for and got this advice way before I started interviewing candidates.

geebee6y ago

"mocked me".

That's really reprehensible.

One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.

If you read about entrance exams at older institutions and professions with entrance exams (the bar, medical and nursing boards, actuarial exams, and so forth), you'll find that a they are often considered among the more stressful events in a person's academic career. They are often (and should be) very rigorous, but I do think a certain unspoken bill of rights has emerged to protect the student as well was the people conducting the exams. For example, entrance exams should have a study path, a known body of knowledge that is getting tested. They should be graded consistently and fairly, by acknowledged experts in the field. There should be a way to file a grievance, and the evaluation metrics must be "transparent" - if not the specific deliberations, then at least the general approach.

Tech interview exams have none of this. They are conducted very, very capriciously, often by people who have limited skills and experience - even if they are experts in their field (which they often aren't), they may not have any idea how to evaluate a candidate.

One basic tenet here is that you don't "mock" a candidate. Seriously, wow.

thomzi12OP6y ago

Sure, that's a fair criticism. That said, you can build multi-step interview problems with SQL (I tried to convey one or two in this doc) such that interviewers can build up towards needing a more advanced window function instead of starting there.

I've used BETWEEN ROW maybe once or twice in my career in a professional setting. Self-joins more often, but as others have pointed out window functions are more efficient here for writing dashboard ETLs, etc.

Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!

minimaxir6y ago

yep! :) Glad you made use of it!

lonelappde6y ago

Why would you ask interview questions to quiz on skills that can be learned in a week?

This is only useful if you tell candidates to study a book for a week first.

Otherwise you are filtering for narrowminded memorizers, not smart people who can learn and solve problems

onemoresoop6y ago

Theres no shame in knowing what it is/the theory behind a convept but not knowing the syntax, especially if the interview is not for a DBA. Id be honest and ask to look it up online or move on..

truculent6y ago

I don’t think it’s fair to say these examples are know-it-or-not.

Sure OP has presented some example solutions, but most of these solutions can be built up from the basics of tabular data manipulation. Furthermore, there are multiple steps involved that would allow a candidate to show their grasp of these fundamentals and receive help from an interviewer (that doesn’t trivialise the problem).

xivzgrev6y ago

Then you’ve only encountered shitty interviewers. The point isn’t if you know some syntax, that can be easily taught / googled. The point is how you think about the problem.

In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key

WoahNoun6y ago

Yea, I ask the employer/manager query problem sometimes. I'm not looking for the a right answer, but rather how they think through the problem and how they think about depth bounds. Occasionally I'll have a candidate who can throw out an oracle "connect by," but it is absolutely not what I'm looking for. (Even though that's probably the "most correct" answer).

vsareto6y ago

I'd say it's unlikely you know about self joins but don't know at least one syntax for it just because of how we traditionally teach SQL to others. This is the know-it-or-not part: SQL is rarely taught without syntax and syntax is used to illustrate the theories. This is in contrast to regular programming where people just pick up pseudo code eventually.

If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.

megaframe6y ago

I think it depends on your data / query patterns. Self Inner/Left/Right joins are one of my most common queries. I often need to chain joins in various ways to get the desired output.

crazygringo6y ago

I've never heard of BETWEEN ROW, but self-joins have been a pretty common thing for me, particularly when you're doing analysis on any kind of business-centric table that (rightly) has tons of columns -- e.g. a "users" or "products" or "items" or "orders" table.

jlj6y ago

> A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance

I've given SQL interview questions quite a few times and like to present a sample dataset or schema, and state a problem statement or expected result. Then I collaborate with the candidate to come up with assumptions, and I give gentle guidance if they are stuck or going down the wrong path.

I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving and collaboration skills than the actual SQL they write.

People who ask ask questions about the data and schema usually do much better than those who jump right into the solution. Just like in the real world where really nice SQL or code doesn't matter if it solves the wrong problem.

noelsusman6y ago

That's pretty funny. I've been doing data science for ten years or so and I've never heard of BETWEEN ROW before.

kyberias6y ago

As if "data science" would be a field where the full scope of SQL is covered. SQL was developed for the enterprise world.

WoahNoun6y ago

I use self-joins and windows constantly. It's way easier to have our massive redshift cluster do those calculations in parallel than to try and do it in python on a VM

minimaxir6y ago

I do use window functions all the time; just not with BETWEEN ROWs. (to your other point, I work with BigQuery on the DB side, but R/dplyr is great at working with window functions as well in a much cleaner manner than Python/pandas).

mycall6y ago

The problem with SQL windowing functions (OVER ORDER BY, LAG/LEAD, RANK, etc) is that they are nondeterministic.

revscat6y ago· 19 in thread

Every time I see an article such as this it reminds me how much I deeply abhor SQL. It is an ugly language, closer in feel to COBOL than something that can at times approach elegance, like Ruby or Scala. With languages like those, you can loo at your work after you are done and be proud of it beyond its purely functional aspect. SQL never elicits a response beyond “the task is finished and it does what I want”, typically with a “finally” in there somewhere.

gigatexal6y ago

Where you see ugliness I see beauty. I guess it is just what clicks for someone and what doesn’t. But together our strengths make for one really good programmer hence why I like working on teams.

fiddlerwoaroof6y ago

Yeah, my experience is basically the opposite to the GP's as well: for expressing the projection of data I want to work with, there are very few syntaxes I find more elegant than an SQL query. (I have similar thoughts about CSS)

Scene_Cast26y ago

With SQL as an analysis / insights tool (as opposed to prod or dashboard use), answering one (nontrivial) question leads to follow-up questions that need a lot of query restructuring.

codeulike6y ago

SQL isn't really code though, its more like a specification of what you want. Working out how to actually get what you want is then (usually) the engine's problem. And when you're describing precisely what you want from large datasets with lots of columns, its always going to look ugly.

I'd wager that if you re-wrote those snippets as Ruby or Scala operating against something tabular like CSV files, with all the joins and aggregates and so on done in code, it would look uglier.

* ok sometimes sql (thinking specifically of SELECT statements) gets code-y, e.g. with inline formulas. But generally its more on the specification side.

Have you considered that your distaste for SQL has prevented you from becoming proficient enough with it to appreciate it? I used to share your distaste, but as I've been faced with problems where SQL is not only the objectively best solution but also the other available solutions are either impossible (due to, for instance, memory constraints) or significantly harder to understand and maintain (due to SQL's expressiveness), I've come to learn how to use the tool more effectively and in turn I've come to deeply appreciate it.

Codebases I work in continue to use ORMs extensively, but when I need a query of any complexity I'm far more likely to start with raw SQL than to try to make the ORM do what I want. I'm far more likely to "let the database do the work" when I need to do any kind of data analysis or reconciliation. It is very good at what it does, and if you understand how it works it's very simple to do complex things, and quickly. For me, SQL often elicits a response of wonder, and even gratitude for its expressiveness and power.

I also suspect your feelings are rooted in syntactic aesthetics (there's little else in common between SQL and COBOL, for example). I can certainly agree that the syntax of SQL is not what I would choose (though I would not look to Ruby or Scala as examples of my preferences). But SQL is not just syntax, it's also a tool with incredibly expressive capabilities for viewing, analyzing and manipulating data effectively.

BoysenberryPi6y ago

>SQL never elicits a response beyond “the task is finished and it does what I want”...

To you this is a bad thing, to me, this is exactly why I like SQL. I don't need to be in love with the elegance and majesty of a language I just need it to work and SQL typically just works and 9 times out of 10 it hooks up nicely with whatever high level language I'm using.

And for what it's worth, I've seen quite a few senior level SQL developers write some pythonic SQL queries that would put my regular Python programming to shame.

wruza6y ago

It can also play a devil’s advocate game with your data. Last time I wrote a 4-level join on complex conditions and ctes to find the latest effective non-deleted non-property-deactivated maybe-from-a-group-default-markupped product sale price it worked like charm, really. Until it broke two months later and started multiplicate rows in a frontend. It wasn’t me who designed that schema, if that matters. We already sold that part of business at that time and a poor guy who took that on a support had to repeat my hair pulling investigation into this madness again and how to coerce it into working, while production suffered badly. If a solution was imperative-y, it wouldn’t show such an effect naturally (loop over products, lookup a price by an algo, maybe presort few tables). I would make it like that hands down, but requirements were to use the database as is and no workaround was suitable. Now imagine this error in a financial analysis before an aggregation and no clear human-controlled sum checkpoints.

Don’t get me wrong, I see where SQL shines, but I also see where my (your, their, anyone’s) mind shines and where it will lose its traction and prediction abilities. People are fine with some level of declarativeness, but add some more and it turns into hardest puzzles where formulating it correctly is a hardly provable task itself. While that’s true for imperative code, it breaks along the way of your thinking, and not at randomly ‘optimized’ expressions. UB issues in C language group is essentially the same sort of trouble.

perl4ever6y ago

This is a dangerous attitude, because it leads to someone like yourself being assigned a task and doing it in a loop in the procedural language associated with your database, where it could be done in one query that is orders of magnitude faster.

And although it may be a matter of opinion, but I find a query that runs one or more orders of magnitude faster than another query to be a thing of beauty.

barrkel6y ago

If you think in terms of relational algebra, and forget about the SQL syntax, it's easier to see the beauty.

There's precious little difference between

    set.select { |x| x.a > 10 }
      .group_by { |x| x.c }
      .map { |c, xs| xs.length }

and

    select count(*)
    from set
    where a > 10
    group by c

except the latter has more scope for optimization.

emmelaich6y ago

This is key! Programmers can easily tie themselves in knots answering these questions with a 'normal' programming language.

Whether or not you're actually writing SQL, it's a good conceptual approach.

combatentropy6y ago

If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.

The ironic thing is that the way developers like to deal with data today is more like how they did in the early days of COBOL, to which SQL was an improvement.

The first computerized databases were navigational, which just means hierarchical objects. You "navigated" through the data to find the parts that were interesting for a given query, just like with JSON you might loop around through the properties. In 1973 Charles Bachman wrote a book called The Programmer as Navigator.

These data structures were insidious, because: (1) you wind up with duplicated data, vulnerable to getting out of sync with itself, and (2) complex queries can get slow. For example, imagine an array of Customer objects. Each has an Orders field, which is an array of Order objects. Each of those has, among other things, fields for the item name and description, and so on. With this structure, it's easy to fetch all the orders of a certain customer, but it's slow and complex for other queries, like the total number of orders for each item. For that, you might duplicate the data into a different structure. It was just like NoSQL, only there was no SQL at the time. It was PreSQL.

Programmers are immediately attracted to such data structures because they are amenable to the first few pages you have in mind to build. It's really easy to run those nested objects through a template and output HTML, and it's straightforward to take data from a form and save it as one of these objects. As your application grows and spirals, though, your original data structures become more and more cumbersome and less suited to the new pages you have to make.

This was a problem in the 1960s and 70s just as much as today, which is why E. F. Codd wrote his papers, most famously "A Relational Model of Data for Large Shared Data Banks." You might say, relational? Those old navigatorial objects sounded like they had lots of relationships. But it is a popular misconception that Relational here meant the relationships among tables (i.e., foreign keys). Dr. Codd was a mathematician, and he meant the mathematical term relation, which is essentially a grid of values, a table. So they were called relational databases not because you could relate one table to another but because they were databases made up of relations (tables).

Tabular data solves the speed problem in navigational data. But now fetching your data is even more tedious, if you have to navigate those tables by hand (Loop through Table A. If the value in cell 12 > 10, then save it to a temporary variable...). But in that very same paper, Codd also proposed a very high-level language for working with the tables. It wasn't called SQL. IBM came up with SQL specifically, after examining Codd's papers (who in fact was a researcher at IBM). Believe me, SQL was an improvement. Codd's original language, called Alpha, was mathematical hieroglyphics. The foundation was solid but the user friendliness was lacking. SQL was an attempt to have the same nature but resemble English instead of Mathematics.

But the two pillars, tabular data structures and a high-level query language, were introduced simultaneously and are both equally part of what makes SQL what it is. Which one would you like to remove?

Chesterton's Fence comes to mind when watching programmers meet SQL:

"In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, 'I don't see the use of this; let us clear it away.' To which the more intelligent type of reformer will do well to answer: 'If you don't see the use of it, I certainly won't let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.'" --- https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence

(The rest of your comment is very compelling, just have a nitpick.)

> If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.

I continue to urge my team to capitalize SQL keywords. This is because the vast majority of SQL queries in our codebases are embedded in another language, as a string. Syntax highlighting is not available (I know there are some tools for this in some environments for some host languages, but it's not widely available or even remotely a solved problem). Static analysis tools for this scenario are generally hard to come by. Every syntactical hint is a godsend for reading and comprehending these queries. I also encourage quoting every identifier even if it isn't strictly required, and extensively using whitespace to make a query's structure more apparent.

If I were writing and reading SQL under better circumstances, it's quite likely I would have different preferences.

WrtCdEvrydy6y ago

SQL is an abstraction and I wish there was something on top of it supporting loops but that's why we have proper programming languages...

https://github.com/fishtown-analytics/dbt

If you think of SQL as specifying (without implementing) a pure function operating on a set—rather than imperative steps or statements operating on lists—you'll have a lot better luck reasoning about how to use it to operate on data. CTEs, window functions and joins can all be used to accomplish things you'd do in a loop in an imperative language. It really is a different paradigm, in the same way as logic, functional and object-oriented programming are different paradigms. But if you embrace the paradigm you will surely learn how to accomplish whatever you would in a paradigm more familiar to you.

barrkel6y ago

The trouble is that any loop you might write may be looping over the wrong thing.

Probably the biggest single determinant of a SQL query's performance is the order of joins. The best join order is dependent on how many rows result after each join, which in turn depends on how selective the predicates are that can be applied to each table, and that depends on the data distribution. The database does this with statistics. That means it can change the join order - the nesting level of your respective loops - as the data distribution changes.

tfehring6y ago

You can write loops in (the procedural extensions of) SQL, but IME it's very rare that you'd want to.

The bigger issue is the lack of performant abstractions - I copy-paste orders of magnitude more code in SQL than in any other language I've used. There are basically just views and functions, and both can carry substantial (read: multiple orders of magnitude) performance penalties compared to the copy-paste approach, especially when you try to nest or compose them. Materialized views rectify this somewhat but they come with various RDBMS-specific limitations and gotchas.

rpedela6y ago

One of the best data tools I have ever used: DBT. The big concept is that everything is a SELECT and handles most DDL and DML under the hood. It also provides the ability to add scripting, such as loops, with Jinja. It is primarily meant for OLAP and ELT, but could be used for some OLTP too.

perl4ever6y ago

I have rewritten a T-SQL loop that called a query each time through with one query that was a thousand times faster. Seriously.

te_chris6y ago

Recursive ctl’s in Postgres?

danbmil996y ago· 12 in thread

This is idiotic. Why in the world would testing for rote memorization of something anyone can look up easily be a reasonable filter for talent and experience in a programming role?

A friend of mine did numerous interviews at a large company, hours out of his time and those of the interviewers, only to be caught up by some inane SQL question asked by a know-nothing after the entire process of interviews had been completed.

Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?

It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.

sgustard6y ago

"Idiotic" is a strong word. The question is: if candidate A answers these successfully, and candidate B does not, is there a reason to believe that candidate A is a better fit for your SQL data analyst position?

You may say no, that any competent person can read the book and learn this stuff. But in my hiring experience there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL. Once you've hired that person they do not magically turn into an efficient or self-aware programmer.

mellow20206y ago

> there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL

Such questions don't test for that, they test for rote memorization. I don't learn things by heart I only use rarely, and certainly not to help you with a problem I'm not causing and that such memorization doesn't fix.

TheBobinator6y ago

First of all, in your example, the DB server is doing the other 97 lines of work that java had to do; one approach uses data structures the other uses a SDE designed for data management and access. I think most employers would prefer a developer that can do both.

Second, the most important skill to DBA or Data Scientist role and to an employer is finding someone that understands the data domain well enough to point out obvious mistakes, bad approaches and thinking to managers and provide leadership, not just access, to information.

Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.

foxyv6y ago

I agree with most of what you are saying. But sometimes it's better to go with Java code instead of creating 3 line SQL commands.

Then again, this depends a lot on your architecture. For instance, if your database is running on expensive hardware with very limited CPU and memory quotas then often it's better to export intensive problems to the application. Also a lot of the more complicated SQL commands are implemented differently in all the dialects of MySQL, MSSQL, Oracle and DB2. Sticking to simpler queries lends itself to cross platform compatibility if you need that.

I think that when you run into someone who knows these in depth things about SQL you either have a smart technically aware person or a rote learner. I think the best bet is to ask them how they would use the different stuff and why. Rote learners will hit a brick wall or answer generically.

raz32dust6y ago

These examples are not obscure at all. If the job involves working with a lot of complex SQL, then these are some great questions to test the candidate's deep understanding of relational queries. It will be critical in their ability to maintain and write SQL quickly. Time and accuracy is of the essence in many such roles. However, the "if" is very important and often ignored. A web developer doesn't need to be interviewed for complex SQL.

kmonsen6y ago

That’s because evaluating talent is really hard, and this is the best proxy we have. The dirty truth is that promotion’s is usually even more arbitrary/political. I’ve seen extremely talented people be denied deserved promotions multiple times and people so bad I have no idea why they have a job get promotions. It’s more about who you know than what you bring to the table.

m12k6y ago

It isn't the best proxy though, because it's not even trying to simulate the conditions under which real work is done (e.g. in real work, most developers will often look up details, and it's no problem at all - what matters is how they think about solving problems, that they know what they don't know, and that they fit into the team). This is a proxy for 'is this an experienced person with a strong memory?', not 'is this a qualified candidate who would perform well in the position?'. It's quite possible to fit the latter and not the former. If you're the size of e.g. Google so your problem is more to weed out bad candidates quickly rather than catch all good candidates, then sure, go ahead (false negatives are acceptable to completely avoid false positives). If you are a smaller company, then your pool of applicants is quite limited and it's much more important that you don't pass over any qualified candidates that happen to fall into your net.

TrackerFF6y ago

While I agree that testing someone on concepts is more favorable than testing someone on tool specifics, I absolutely think there's merit to (tool) proficiency.

If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.

jleach826y ago

This isn't rote memorization at all. They're typical real-world problems to solve, and the point is to ensure the candidate can "think their way through" the problem to come up with a solution: not to come up with a solution using a specific function or keyword.

cozos6y ago

A good interview would allow you to read SQL documentation to solve this. If you knew it off the top of your head, you'd ace the interview.

If you can't solve the problems after looking it up, then perhaps you are not a SQL expert. I think these sort of questions are valuable.

PopeDotNinja6y ago

I can't blame someone for wanting to filter for SQL expertise. I don't know if these questions are a good way to do it. Maybe they'd be good bonus questions for when you're trying to different between two good enough applicants.

geekone6y ago

Do you have suggestions on how to get real with this?

ryanisnan6y ago· 10 in thread

Great article. I can't help but feel like SQL is a poor choice for some of this stuff, though. More often than not, I find it much easier to pull the raw data into memory, and use a higher level language to do these sorts of queries. I am all for knowing the intricacies of SQL, as the cost for not can be very high, but I'm curious for your opinion here.

bdcravens6y ago

For small datasets, pretty much any approach will work. Once you hit hundreds of millions of records (which isn't even that big of a dataset), SQL still performs well on pretty modest hardware.

yen2236y ago

What do you mean by SQL?

In my experience, once you're hitting hundreds of millions of records, implementation details of your database engine will start to matter. A database designed for transactional workloads like Postgres will start to choke on aggregate and window functions, often taking minutes to run instead of milliseconds. A columnar database like Redshift (which exposes a SQL interface) will breeze through it without a sweat.

jmiserez6y ago

The first example "month over month" will never result in a large dataset if you just do count(distinct user_id) ... group by month order by month. There's only max 12 values per year.

That query is much simpler, safer, faster, no self-join or windowing necessary, and you can properly handle missing months in your dataset in your higher-level language (which the provided solution doesn't do BTW).

And you're still getting the performance boost from the DB indexes for grouping and sorting.

tragomaskhalos6y ago

These are interview questions first and foremost, so always going to be in an artificial context. But for real-world work I would use SQL plus the relevant amount of aggregation/joining to pull rows into memory, then do the remaining manipulation in the higher-level lang. "The relevant amount of aggregation/joining" is obviously a moveable judgement call based on experience, but one thing I have found is that the more complex yr SQL becomes the easier it is for cardinality errors to creep in, and of course in general you can't easily see the intermediate results, so dropping into a high-level language can also be safer and less error-prone.

karatestomp6y ago

That's fine for tiny amounts of data, but it's also how you end up with (for example) Ruby taking four minutes to do what a carefully-crafted SQL query can do in four seconds (yes, this stuff happens a lot in the wild)

vajrabum6y ago

Yah, somebody handed me a ruby script that looped over a query and was causing a whole bunch of other processes to fail because the result wasn't available in a timely way. My boss was impressed by my having reduced the run time from 22 minutes to 5 by pushing that join into the database.

tfehring6y ago

I work in R regularly and do most of my more complex data manipulation in R whenever I can, but most of these examples are simple enough that I think it makes sense to keep them in SQL. The only ones that struck me as really weird to do in SQL were the cumulative cash flows and the "histograms"/binning one, though the ones requiring window functions might also be slightly easier in R than in SQL.

Also, a lot of these would need to be done in SQL in practice since the data wouldn't fit in memory. Any solution that requires loading the table [login (user_id, timestamp)] into memory probably won't scale very well!

thomzi12OP6y ago

Good q! +1 to the other replies on this comment. I have two points worth mentioning:

(1) The flavor of SQL I use at work supports macros, which are functions that can take in parameters like a function in R/Python might. So, the SQL is "turbo-charged" in that sense and some of the value-added of switching over to Python/R is diminished.

(Big Query has UDFs, which seem similar: https://cloud.google.com/bigquery/docs/reference/standard-sq...)

(2) Like I mentioned in the doc, I personally use the SQL in these practice problems for ETLs on dashboards showing trends. AFAIK, much easier/efficient to write metrics for daily ETLs in SQL than R or Python, especially if these are top-line metrics like MAU.

Besides the many responses about data size, the premise of your question is baffling to me. This is exactly the kind of stuff SQL is for, it's designed specifically for doing this kind of work. Even on a smaller dataset, the "higher level" languages you speak of will certainly be able to produce the same results, but their implementations will almost certainly be more complex, with more points of failure, a great deal less flexible (what if the dataset grows?), and a whole lot more verbose... and all you get for all of those downsides is a more familiar language.

nerdbaggy6y ago

I use to think that too. This is a good talk about how smart SQL query engines are. It is an hour though https://youtu.be/wTPGW1PNy_Y

thomzi12OP6y ago· 9 in thread

Hey, HN! Since I couldn't find a good resource online for the self-join and window function SQL questions I've encountered over the years in interviews, I made my own study guide which I'm now sharing publicly as a Quip doc. Would love your feedback or thoughts!

mobileexpert6y ago

Leetcode has some SQL problems that I’ve found useful in interview prep: https://leetcode.com/problemset/database/

thomzi12OP6y ago

Great resource! I have personally found the top Leetcode SQL questions to be hit-or-miss without curation (too easy, too hard, doesn't cover topics like window functions / self-joins like I would encounter in interviews, etc.) but you're right, it is a great, interactive interview resource

bogomipz6y ago

I enjoyed reading this. My feedback is that although a few of the problems had links to blog posts that provided some analysis of the problem and solution most of these did not. I think if you added a short analysis or explanation of what makes the problem tricky and the thinking behind the solution it would be very beneficial.

beckingz6y ago

For cumulative averages why not use window functions?

thomzi12OP6y ago

Good catch -- I can add that in as an alternate solution. Thanks!

pwesner6y ago

Great resource. Thanks for sharing. Do you have any more resource recomondations besides the ones in the article? Maybe a book?

data4lyfe6y ago

We try to surface the best SQL interview questions at Interview Query (https://www.interviewquery.com/)

rollinDyno6y ago

Hello, how come you chose to use quip over Google docs? I'd expect you to choose the latter since you work for Google.

thomzi12OP6y ago

Yeah, it's because (1) I started this doc over a year ago when I was still at Salesforce (which owns/employees use Quip) and (2) AFAIK Google Docs still doesn't have native code block support :/

deepsun6y ago· 8 in thread

Checked just the first two answers:

1. MoM Percent Change

It's better to use windowing functions, I believe it should be faster than self-join.

2. It seems that the first solution is wrong -- it returns whether "a"-s parent is Root/Inner/Leaf, not "a" itself.

I'd instead add a "has_children" column to the table, and then it would be clear.

Second solution works, but without optimization it's 1 query per row due to nested query -- slow, but not mentioned.

deepsun6y ago

Answer 4 is very very bad, you're doing O(N^2) JOIN. It's not just slow, it will fail on bigger data.

The question just screams for windowing functions, and cumsum is a canonical example for them.

Sorry post author, you'd fail my interview :)

thomzi12OP6y ago

Thanks for the feedback! The first section is intentionally about self-joins since they get asked about in interviews, but other people have brought up that window functions are more efficient, so I'll add in those solutions as well.

Sorry you feel that way! Thankfully my employer felt differently :)

lonelappde6y ago

Would you really interview and hire for a job based on SQL tricks instead of spending $50 to give someone a reference book?

heed6y ago

Also for Q1 - what if there is gap one or more months? It seems we should first generate a month series based on the min and max months, and left join to the series to account for months with zero activity.

thomzi12OP6y ago

Sure! Practically speaking, as a data analyst, I would probably notice a missing month when plotting the trend of MAU over time.

(You can make "but what if a month is missing?" a latter part of a multi-part interview question)

Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.

meritt6y ago

> It's better to use windowing functions, I believe it should be faster than self-join.

Your statement is making the assumption you have completely dense data and you can simply offset a number of rows to get the desired denominator. Sparse data is a very common occurrence, and now your MoM/YoY/XoX queries are completely incorrect.

radiowave6y ago

Good point. I often find the easiest approach is to join against a sequence to 'unsparsify' the data, then use the window function. I'd guess this is likely still faster than a self-join, unless the data is very sparse.

ScottWhigham6y ago

Yep. Same here. I stopped reading after the second solution. I liked the questions though - just maybe the answers were too myopic.

gtrubetskoy6y ago· 8 in thread

One problem with this article is the number of times the solution involves COUNT(DISTINCT).

One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".

0az6y ago

What is wrong with DISTINCT?

barbegal6y ago

DISTINCT generally requires the results to be sorted which has O(n^2) worst performance so it can have a big performance hit on a query. It is best to make your database structure such that queries only return distinct data. E.g. by disallowing duplicates

marcosdumay6y ago

Nearly every time, it's a symptom of bad data normalization.

But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).

barrkel6y ago

In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.

GlennS6y ago

It covers up bad queries, so you may not see an underlying data duplication problem.

Often better to group explicitly so you know what's actually going on.

cultofmetatron6y ago

seriously.. I'm building out some functionality using plpgsql and have used it. This is going to be haunting my dreams

yread6y ago

Huh? If you have a table with attributeid, sampleid and value how would you count how many samples have a value in any attribute? Exists subquery?

matwood6y ago

In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.

fnord776y ago· 7 in thread

I think for a lot of people, SQL is a skill that doesn't stick.

You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore

The days of dedicated SQL programers are mostly gone.

matwood6y ago

I disagree. I've been programming for over 20 years, and SQL has been one of the only constants used at every job. I'm surprised how many other programmers I run into who are don't know or are even scared of learning SQL. Which is a shame because data will almost always outlive the program(s).

giantDinosaur6y ago

I used to think this was true, back in the days when I used 'Group by' mindlessly and fiddled with the syntax dumbly until I got the right answer. Once you spend some time thinking about the concepts of it all you should do much better at remembering and if you totally forget the SQL syntax that doesn't really matter (Modulo the hideously annoying random differences between the databases themselves, but that's true of anything.)

jleach826y ago

I completely disagree. A good full-stack dev knows SQL pretty well: at least well enough to be able to work through the general questions in the document. Good SQL skills are essential, and the lack of good SQL developers I would attribute mostly to ORMs that abstract it away so mediocre developers can mostly ignore it (which makes the job a whole lot harder for people that actually DO know how a relational database works, and give it the respect it deserves)

doppel6y ago

I really don't believe that to be true. For our application(s) at work, the database is always the hardest part to scale and tehrefore also the biggest performance bottleneck, and the difference between good and bad SQL can be magnitudes of order of performance impact (on the query itself and locking up the DB for other queries).

Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.

xeromal6y ago

I'm sure data scientists and such probably use SQL at least somewhat regularly.

vangelis6y ago

What I forget about SQL is the little inconsistencies between platforms. It's a standard with a dozen carve outs. Do I use TOP or LIMIT? What's the wildcard syntax?

thomzi12OP6y ago

Many data analysts use exclusively SQL and R/Python

namdnay6y ago· 7 in thread

Very interesting. I never really "got" declarative languages, I remember a very long time ago I was working with Oracle and you could see the "execution plan" for your SQL queries. I kept wondering "why can't I build my queries directly with this?" - it seems so much simpler to my brain than SQL itself.

> I never really "got" declarative languages

As I responded to another comment, I have to wonder if that's because you haven't spent the time to become proficient enough to appreciate them. I understand that a lot of programmers decry "magic" and want to get at the underlying steps and components that produce a given result, but literally all of us work at some level of abstraction above that because it would be impossible not to. None of us are directly transforming sand with electricity into computed values.

You're describing being more comfortable with an imperative set of statements than a declarative description of results. Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test, except that the description is less freeform and must conform to a certain syntax and structure so that a machine can write the test for you.

It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.

wruza6y ago

Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test

Except that you know/expect how exactly it will be executed and if it doesn’t, maybe test should not really pass. SQL in general is full of this “we write declarative queries expecting this exact imperative result and investigating if it’s not”. It’s much like an interview question: it may have many different answers, but you must provide the one that satisfies a grumpy plan builder guy. I know sql and use it when it’s shorter/more descriptive, but sometimes you just want to take them rows and do them things, without leaving a database process and its guarantees.

Not much db experience, but the fact that such powerful engines (i.e. acid, indexing, good ods, pooling, etc) are always hidden beyond some cryptic uppercase frontend with a weak execution layer always bothered me. Just give me that postgres-grade index scan directly in C, dammit. /rant (inb4 just write some parts as a sp in a language of choice)

greggyb6y ago

You can. First, find or implement a library in your language of choice to store indexed data on disk. Then implement some very hairy logic to make sure your read and write operations are ACID. Then choose an algorithm for the query you'd like to write. And then write it.

I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.

If you would prefer a more fine-grained control over execution of queries, you could probably get very far by starting with the SQLite code base and working at a more primitive level through its library.

TurkTurkleton6y ago

You can't write query plans directly because you would have to manually take into account a number of factors that the query planner considers for you automatically, such as size of the table, statistics on the distribution of values, whether there are indexes that could be used to speed the query, and so on. Some SQL dialects (like Microsoft's T-SQL) do give you some ability to influence the decisions the query planner makes, though, like forcing it to use specific indexes, or forcing it to use scans or seeks.

perl4ever6y ago

As a practical matter, when you can't control the optimizer or affect how the DBAs configure things, you break your huge query into multiple ones with temporary tables.

This is from the perspective of trying to get queries to run in 5 minutes instead of 30 minutes instead of hours or days or forever, not brief transactions measured in milliseconds. And it's not something I figured out on my own, but by paying attention to the guy who never talked but was consistently 10x faster in producing reports than anyone.

The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.

marcosdumay6y ago

And then comes Oracle and insists on applying equality filters first, because, duh, they are fast, and never take any of those other details into consideration.

Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.

keeperofdakeys6y ago

Data changes, so the best query plan changes with it. SQL was built to handle this data-dependent environment.

Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.

There is more than one type of join (INNER, LEFT OUTER, etc), and more than one join algorithm (neesed loop, merge, hash, etc). All these change based on the data. Even the join order can have a huge impact on query time, and needs to adapt based on the number of rows you'll pull from each table.

A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.

SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.

Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.

mosburger6y ago· 6 in thread

Worth noting that this isn't all ANSI-SQL... e.g. I'm pretty sure WITH is a Postgres thing?

greggyb6y ago

CTEs are implemented by most (all?) major RDBMS platforms and were introduced in the SQL:1999 standard revision.

- SQL:1999 https://en.wikipedia.org/wiki/SQL:1999#Common_table_expressi...

- SQL standardization https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...

combatentropy6y ago

The WITH clause, otherwise known as Common Table Expressions, is in ANSI SQL99. Common table expressions are supported by all of the major databases: PostgreSQL, Microsoft, Oracle, MySQL, MariaDB, and SQLite. They are also in certain minor ones, like Teradata, DB2, Firebird, and HyperSQL. Recursive WITH clauses are especially useful.

matwood6y ago

Worth noting that MySQL only got CTE's in 8.0 while something like MSSQL has had them since ~2005. The reason this is important is that something like AWS Aurora only supports up to MySQL 5.7, and thus no CTEs. :/

Minor49er6y ago

Not anymore. Other implementations, like MariaDB and SQLite, have adopted common table expressions. They're basically syntactic sugar for subqueries in most implementations, but they can make some queries much more readable

maxlamb6y ago

the WITH clause is now in SQL Server and Oracle

bdcravens6y ago

I don't work with Oracle, but SQL Server has had CTE's since 2005.

oyoun6y ago· 5 in thread

I think SQL is a language to be known by every programmer. With the right query, you may solve a problem that may take 100 lines in other languages.

It is so usefull, reliable and does not change every year.

TrackerFF6y ago

With the extreme popularity of pandas, I think a lot of Python programmers would be amazed how clean and easy-to-read SQL queries look like, compared to the (downright) mess that's being written to query pandas dataframes.

bradleyjg6y ago

Depends on the transformation. Dropping a couple of columns in a wide dataset is extraordinarily ugly in sql vs anything else, for example.

thomzi12OP6y ago

Agreed! Pandas is great but at this point instead of using logic to solve data structuring tasks I often find myself googling for an optimized built-in Pandas method to help me out. Leads to less elegant code -- not sure if it is less readable though.

Gatsky6y ago

Well R has packages that buy you similar functionality eg dplyr.

alexilliamson6y ago

Dplyr is IMO the best thing to come from R. The design is so much simpler than pandas and the operations mirror SQL operations so closely.

sk5t6y ago· 5 in thread

I'd ding the over-use of CTEs, when subselects are often more appropriate and better-performing. Kind of a "every problem a nail" thing going on here.

meritt6y ago

Most people have no idea what an optimization fence is and opt for CTEs because they yield "cleaner" queries, despite nearly always killing performance.

I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.

combatentropy6y ago

Do you mean subselects in the SELECT clause or in the FROM? Usually subselects in the SELECT clause have been the source of dogslowness. Reorganizing them into the FROM clause sped up queries 1,000 times. And I thought CTEs were just a way of moving subselects from the midst of the FROM clause to the top of the statement, where they're easier to understand.

sk5t6y ago

In the FROM clause -- although I understand future versions of pgsql will address this, the "WITH foo AS ... " approach can be horribly wasteful, materializing all the results, whereas "SELECT * FROM ( SELECT f(x) AS a FROM baz ) qq limit 1" lets the optimizer do its thing.

matwood6y ago

I mostly agree. I'd start with the simplest possible query and move from there. I enjoy SQL, but once I start querying large datasets or doing anything remotely complex, the 'right' solution is the one that is fastest. And fast always depends. Typically I have to write up a solution a few different ways, test and tweak.

At least on recent versions of Postgres, CTEs are often equal in performance to subqueries (if they're side-effect free and non-recursive they get inlined).

Lightbody6y ago· 4 in thread

I think these are great. But I think there should be some representation around locking / concurrency / deadlock topics. Those tend to be the hardest because you can’t clearly recreate the right/wrong answer in a local test environment. Speaking as a person who waited far too long in his career to fully appreciate these topics, I wish I had been pushed to learn them much earlier.

jandrewrogers6y ago

Behaviors around locking and concurrency are not standard. Not only do they vary widely across RDBMS implementations, they also vary depending on how you configure a specific RDBMS implementation in many cases.

Some databases use lock structures that automatically detect and resolve deadlocks, so from a user standpoint there are no deadlocks but deadlock resolution has visible side effects that are implementation defined.

hobs6y ago

Yeah but the functions and style used in these answers vary across implementations (date_trunc, etc.)

I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.

Also, for your language I wouldn't expect all mid tier SQL devs to be able to write a recursive CTE from memory (though its useful, you can just look it up again), but something like breaking apart a query plan on your platform of choice is way more important as that muscle allows you to tell if your CTE was crap or not.

bigtechdataeng6y ago

I suspect more SQL is written to support analytics than online systems. Locking and concurrency are used in a specific type of application, namely oltp.

I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.

thomzi12OP6y ago

To your point, I personally write SQL for analytics and product/business purposes, not system monitoring

S_A_P6y ago· 4 in thread

I flip back and forth between deep diving in (my case) SQL Server skills and .NET Manipulation. In the world I live, it makes the most sense to do set based manipulation in SQL and logical entity based logic in C#. I work in a unique enterprise niche that has about 4 options based one either java or .net. Sql knowledge definitely gives you a leg up for complex reporting, and there are cases where I love being able to debug super quickly when comparing inputs to outputs. However, when I run into a SQL script that is 5000+ lines long and have to debug it, I much prefer the .NET side of the fence. Should someone ever come up with a bridge that gives you .NET level of visibility into the active datasets in a SQL query I would pay them 4 figures without question...

yellowapple6y ago

I take it you've already looked into CLR stored procedures? I haven't used 'em in a whole lot of depth, but they do seem to at least give some of the tools for a "best of both worlds" approach.

beckingz6y ago

5000 + line SQL scripts????

What would one of these do?

S_A_P6y ago

Legacy code from Powerbuilder days. But it does things like calculate month over month inventory, Mark to Market value, Risk, or things related to Energy and commodities trading.

vsareto6y ago

The ones I've seen handle tons and tons of edge cases or do extra validation. The SQL programming languages are pretty verbose in terms of syntax.

cameronh906y ago· 4 in thread

Out of curiosity: what is the use case for SQL window functions in application programming? Unlike most SQL, it doesn’t seem to reduce the order of the data coming back from the server, nor do anything especially faster than can be done on the client - and has the disadvantage of extra load on the database (which is harder to scale).

Is it only useful for ad hoc/analytical queries, or am I missing something?

dntbnmpls6y ago

It allows you to "group by/aggregate/etc" on a row by row basis rather than once over the entire table. So whle GROUP BY defines a set to aggregate over the entire table ( and hence you get 1 row back ), window functions creates a criteria by which each row gets a set ( aka window ) to aggregate over for that particular row. In addition, you can define a "frame" over that "window" for even more refined aggregation. I suppose window functions are used more in the OLAP space than OLTP, but it is a very useful part of SQL. In my opinion, windows function is the most impressive and important part of the language.

People can define, describe and provide examples but with SQL, it won't sink in until you try it out yourself and have the "Ah ha" moment. If you have a database server and some test data, try writing a few window function queries or try some online examples.

kthejoker26y ago

Why do you think it's not "especially faster" than doing it on the client?

The data is already sorted, partitioned, in memory .. adding a moving average calc or max value per category is certainly faster than fetching the set to disk, recreating it in an intermediate structure and then calculating the new value with all that partitioning, sorting, etc. to be redone client side.

You can certainly use them in app dev for things like:

* figuring out how many things are ahead of you in a queue * snapshotting (what's changed since the last time you were here) * comparing something to some other sample for outliers or inconsistencies

These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.

cameronh906y ago

I suppose it depends on the language. Normally I’m writing in either Java, C#, Rust or C - and in those I’ve never found that having the DB run a moving average to be any faster, provided the result comes back in the right order. Indeed, if I need to use another column in the result set, the deserialisation overhead means it’s normally slower. For reference I normally write financial time series processing code which is where window functions ostensibly could be quite useful.

However certainly if your app is written in Python or Ruby I can see there being a big difference.

The reason I mention analytics is more to do with the ease of scaling out the DB than the nature of the queries. In busy OLTP databases usually one wants to keep the work off the database because they’re difficult and expensive to scale.

etiennebch6y ago

I use them for cursor based pagination. You can use rank and a where clause to apply the limit. You can also compute max rank to determine whether there are more data to fetch after limit. Finally, if you do left joins where you may have multiple rows per objet, you use dense rank instead of rank

xupybd6y ago· 3 in thread

For me the hardest part of the first question is understanding the acronyms. I think MoM is month on month. But MAU, no idea.

thomzi12OP6y ago

Hey! MAU = Monthly Active Users. Sorry about that, I think this could be described as data analyst/analytics jargon. Good feedback to spell these out, thanks!

ziftface6y ago

I agree, it almost seems they were put there to add artificial noise to the problem. It doesn't seem like you're selecting for the right skills here.

thomzi12OP6y ago

Sorry about that -- in real life you would just ask the interviewer what MAU meant if they didn't spell it out.

The purpose was to make the questions more realistic, since at least in my experience in data analyst interviews the questions are asked in the context of actual business or product situations ... like company leaders, PMs, or others wanting to understand trends in MAU.

zozbot2346y ago· 2 in thread

No questions/examples featuring recursive CTE's? They tend to come up in anything involving queries over trees or graphs. They're also a relatively new feature where having some examples to show how they work may be quite helpful.

magicalhippo6y ago

Took me some time to wrap my head around writing a recursive query. Also had to study the docs a while to get "merge into" to work right, for doing combined insert/update.

hotsauceror6y ago

One of the first tasks I give associate DBAs is writing a recursive CTE to get a blocking chain. It’s an interesting and useful exercise.

vasilakisfil6y ago· 2 in thread

I always thought that I suck in SQL but if these are medium to hard then I am not that bad actually.

chucky_z6y ago

SQL for Smarties is the book that has proven, without a doubt, I am bad at SQL. https://www.amazon.com/Joe-Celkos-SQL-Smarties-Programming/d...

If you want to get undeniably good with SQL, this is the book.

taberiand6y ago

These are easy enough to do in your head.

Also I would do a lot of them differently. And I don't think the answer to #6 is valid (in T-SQL at least).

hotsauceror6y ago· 2 in thread

RDBMS platforms without function indexes means that some of these queries will force a row-by-row execution over your entire table. Enjoy running SELECT ... FROM a INNER JOIN b WHERE DATEPART(a.date, month) = b.whatever on a table with 500 million rows in it.

paulryanrogers6y ago

Thankfully even MySQL has them, at least as of version 8

hotsauceror6y ago

Yeah, I’m jealous of all the postgresql and MySQL people. SQL Server doesn’t have them and this kind of query would cause pain.

https://github.com/timescale/timescaledb/issues/271

snidane6y ago· 2 in thread

SQL is better than 99% of the nosql alternatives out there.

But one thing it falls apart are these time series data processing tasks.

It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.

SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).

In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.

In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.

Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.

Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?

3dbrows6y ago

Have you seen TimescaleDB? https://www.timescale.com/

snidane6y ago

I've seen it.

Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.

Asof join is still an open issue in their issue tracker so it is not usable as a time series database.

https://www.postgresql.org/docs/current/tablefunc.html

iblaine6y ago· 2 in thread

>After an interview in 2017 went poorly — mostly due to me floundering at the more difficult SQL questions they asked me

Could be you dodged a bullet. A company with advanced interview questions may have some ugly SQL code. For jobs that lean heavily on SQL, I expect candidates to know things like windowing & dynamic variables, in SQL & an ORM library. For SWE's, I feel basic SQL is fine.

redis_mlc6y ago

I agree. Using advanced SQL as a hiring signal would be silly in that it just discards a lot of programming candidates for little benefit.

I'd be happy if they knew what EXPLAIN was, since that impacts production on a daily basis.

Also, the OP is basically "fighting the last battle" again. Most interviews don't filter out candidates based on SQL.

Source: DBA.

jerglingu6y ago

Yes! Window functions and decent demonstration of datasets with varying granularity on top of the basic join/aggregation assessment is all that is needed in the SQL for interviews. And if you're an engineer, attention to redefining the query plan through appropriate use of temp tables. Anything else is a poor use of precious interview time for both parties, in my opinion

ollien6y ago· 2 in thread

This is neat, but it would be really helpful if these examples included some kind of sample output of what was expected.

thomzi12OP6y ago

Some of them have sample output but yes, it would be helpful to extend this to all examples. Good feedback!

toohotatopic6y ago

Then you can expect your candidates to exactly provide that answer.

pier256y ago· 2 in thread

Off topic but... anyone knows what they used for the rich text editor?

It uses React but I imagine there is some other library like ProseMirror here.

WoahNoun6y ago

It's likely all/mostly custom. Quip is saleforce's Word/Docs competitor. (It's pretty good, tbh.)

pier256y ago

It rendered that doc pretty fast although editing is not enabled.

I've never used Office online but Google docs is terrible performance wise.

gigatexal6y ago· 2 in thread

Yup those questions are a stretch for me too! Love that though.

Where are the pivot and unpivot questions?

thomzi12OP6y ago

Hmm, haven't seen those a ton at work / in interviews! Would love to see some examples though if you have any.

Mode's SQL tutorial uses SUM(CASE ...) and CROSS JOIN to mimic pivots: https://mode.com/sql-tutorial/sql-pivot-table/

petepete6y ago

I think they're specific to Oracle and SQL Server. It's a pity because they can be very useful.

PostgreSQL can do it via the tablefunc extension and there's `\crosstabview` which is built into psql.

https://www.postgresql.org/docs/12/app-psql.html#APP-PSQL-ME...

arh686y ago· 1 in thread

For the second one, it seems most natural to reach for exists, or something (I have not tried this code..)

  select
    node
    , (case when parent is null then 'Root'
            when exists (
              select * from tree c
              where c.parent = node
            ) then 'Inner'
            else 'Leaf'
      end) "label"
  from tree

EDIT: also, in the fourth, it seems like you'd want to partition the window function, who cares about order. Something like

  sum(cash_flow) over (partition by date) "cumulative_cf"

combatentropy6y ago

A subquery in the from-clause is usually lighter than the select-clause. I just so happen to have a table with this structure and hundreds of thousands of rows. This way runs in half the time. It is surprising, in fact, that the difference isn't wider, because Postgres's explain-command says it costs 1/100 as much.

  select
      t.node,
      case
          when t.parent is null then 'Root'
          when p.parent is null then 'Leaf'
          else 'Inner'
      end as label
  from tree t
      left join (select distinct parent from tree) p on t.node = p.parent

ineedasername6y ago· 1 in thread

I think I'd be able to do all of these in my daily work, probably not as efficiently, with minor references to syntax guides (I don't use window functions often enough).

In an interview, presumably my logic would, hopefully, shine through minor issues of syntax.

Where would that put me? Maybe "okay to decent" when dealing with "medium-hard" questions?

I would fail utterly at DBA management SQL and stored procedures, my responsibilities skew towards data analysis.

dmurray6y ago

Yeah, I'd be something similar and also wouldn't consider myself a database expert. I think this question list is tilted towards "can you express this complicated abstract transformation in legal SQL" than the practical day to day considerations of building and working with databases.

userbinator6y ago· 1 in thread

A bit off-topic, but this is another one of those sites that show nothing without JS, and looking at the source reveals something a little more amusing than usual: content that's full of nullness.

mrspeaker6y ago

At least they do it with a bit of flair though: the thin blue "loading" bar works without JS!

beckingz6y ago· 1 in thread

For creating a table of dates, what are our thoughts on:

select * from (select adddate('1970-01-01',t4.i10000 + t3.i1000 + t2.i100 + t1.i10 + t0.i) selected_date from (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v where selected_date between '2016-01-01' and now()

This works in MySQL / MariaDB.

klysm6y ago

Oh yeah that’s super readable

cryptozeus6y ago· 1 in thread

It would be helpful to show output result for each.

thomzi12OP6y ago

Yep, Ollien made a similar point. Currently some of the questions have expected output but not all.

anonfunction6y ago· 1 in thread

The first solution for MAU has the wrong sign for the percentage change column:

Previous MAU 1000 Current MAU 2000 Percent Change -100

thomzi12OP6y ago

Thanks, anonfunction! I've corrected it

ridaj6y ago

The first few answers are unidiomatic where I work. Analytical functions would be vastly preferable to self joins, especially in the case of the join with an inequality that is proposed to compute running totals, which I assume would have horrible performance on large datasets

nogabebop236y ago

The problem with hard SQL problems is that they are often one of two camps:

1. use some underlying implementation detail of the particular RDBMS or proprietary extension to the standard

2. are essentially tricks, like the typical ridiculous interview problem designed to "make you think outside the box". Yes, you can do almost anything in SQL but often you should not.

I get the perspective here is data analysis where you probably need to know more SQL than the standard developer, but I still feel you should be testing for solid understanding of basics, understanding of the relational algebra concepts being used and awareness of some typical gotchas or bad smells. That's it. They'll be able to google the precise syntax for that tricky by date query when you're not guaranteed to have data for every month or whatever on-demand.

nurettin6y ago

This is more like "we want to make sure you can use recursive CTE" questions. To add some variety to medium-"hard" SQL questions you could add some lateral joins, window functions (especially lag if you want to get creative) and compound logic statements in where clauses.

dzonga6y ago

some of the problems with SQL, is it was written to solve problems when hardware was expensive. BCF, n all the normal forms etc. when doing analytics you want a flat table that's it. & when working with a flat table for analytics they're other tools better for analysis than sql e.g pandas. or sql like language used by column databases. once you've a flat table, you no longer have to do joins etc.

ojr6y ago

I’ve done my fair share of complex sql queries and complex data migrations, asking about JOIN during a random interview is unfair unless documentation.

sadhana12346y ago

goood one

dang6y ago

We've changed the URL from https://gstudent.quip.com/2gwZArKuWk7W to what that redirects to.

j / k navigate · click thread line to collapse

287 comments

195 comments · 37 top-level

minimaxir6y ago· 25 in thread

That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.

steelframe6y ago

ses19846y ago

If you do get an offer, and choose not to take it, that's dodging a bullet.

If you don't get the offer there's no bullet to dodge, right?

</pedandtry>

JohnTHaller6y ago

I've been using SQL for 24 years and don't think I've ever used BETWEEN ROW before. Did read up on it now, though.

psaux6y ago

I wrote sql for years for a forex platform used by some of the top dogs (literal). Pretty hardcore sql and had to be very fast, never used BETWEEN ROW.

thom6y ago

nogabebop236y ago

There are so many specialized aspects that you'll never use outside of an interview or an edge case when you can learn it on-demand.

markus_zhang6y ago

ignoramous6y ago

> A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.

geebee6y ago

"mocked me".

That's really reprehensible.

One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.

One basic tenet here is that you don't "mock" a candidate. Seriously, wow.

thomzi12OP6y ago

Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!

minimaxir6y ago

yep! :) Glad you made use of it!

lonelappde6y ago

Why would you ask interview questions to quiz on skills that can be learned in a week?

This is only useful if you tell candidates to study a book for a week first.

Otherwise you are filtering for narrowminded memorizers, not smart people who can learn and solve problems

onemoresoop6y ago

Theres no shame in knowing what it is/the theory behind a convept but not knowing the syntax, especially if the interview is not for a DBA. Id be honest and ask to look it up online or move on..

truculent6y ago

I don’t think it’s fair to say these examples are know-it-or-not.

xivzgrev6y ago

Then you’ve only encountered shitty interviewers. The point isn’t if you know some syntax, that can be easily taught / googled. The point is how you think about the problem.

In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key

WoahNoun6y ago

vsareto6y ago

If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.

megaframe6y ago

I think it depends on your data / query patterns. Self Inner/Left/Right joins are one of my most common queries. I often need to chain joins in various ways to get the desired output.

crazygringo6y ago

jlj6y ago

> A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance

I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving and collaboration skills than the actual SQL they write.

noelsusman6y ago

That's pretty funny. I've been doing data science for ten years or so and I've never heard of BETWEEN ROW before.

kyberias6y ago

As if "data science" would be a field where the full scope of SQL is covered. SQL was developed for the enterprise world.

WoahNoun6y ago

I use self-joins and windows constantly. It's way easier to have our massive redshift cluster do those calculations in parallel than to try and do it in python on a VM

minimaxir6y ago

mycall6y ago

The problem with SQL windowing functions (OVER ORDER BY, LAG/LEAD, RANK, etc) is that they are nondeterministic.

revscat6y ago· 19 in thread

gigatexal6y ago

Where you see ugliness I see beauty. I guess it is just what clicks for someone and what doesn’t. But together our strengths make for one really good programmer hence why I like working on teams.

fiddlerwoaroof6y ago

Scene_Cast26y ago

With SQL as an analysis / insights tool (as opposed to prod or dashboard use), answering one (nontrivial) question leads to follow-up questions that need a lot of query restructuring.

codeulike6y ago

I'd wager that if you re-wrote those snippets as Ruby or Scala operating against something tabular like CSV files, with all the joins and aggregates and so on done in code, it would look uglier.

* ok sometimes sql (thinking specifically of SELECT statements) gets code-y, e.g. with inline formulas. But generally its more on the specification side.

BoysenberryPi6y ago

>SQL never elicits a response beyond “the task is finished and it does what I want”...

And for what it's worth, I've seen quite a few senior level SQL developers write some pythonic SQL queries that would put my regular Python programming to shame.

wruza6y ago

perl4ever6y ago

And although it may be a matter of opinion, but I find a query that runs one or more orders of magnitude faster than another query to be a thing of beauty.

barrkel6y ago

If you think in terms of relational algebra, and forget about the SQL syntax, it's easier to see the beauty.

There's precious little difference between

    set.select { |x| x.a > 10 }
      .group_by { |x| x.c }
      .map { |c, xs| xs.length }

and

    select count(*)
    from set
    where a > 10
    group by c

except the latter has more scope for optimization.

emmelaich6y ago

This is key! Programmers can easily tie themselves in knots answering these questions with a 'normal' programming language.

Whether or not you're actually writing SQL, it's a good conceptual approach.

combatentropy6y ago

If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.

The ironic thing is that the way developers like to deal with data today is more like how they did in the early days of COBOL, to which SQL was an improvement.

Chesterton's Fence comes to mind when watching programmers meet SQL:

(The rest of your comment is very compelling, just have a nitpick.)

> If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.

If I were writing and reading SQL under better circumstances, it's quite likely I would have different preferences.

WrtCdEvrydy6y ago

SQL is an abstraction and I wish there was something on top of it supporting loops but that's why we have proper programming languages...

https://github.com/fishtown-analytics/dbt

barrkel6y ago

The trouble is that any loop you might write may be looping over the wrong thing.

tfehring6y ago

You can write loops in (the procedural extensions of) SQL, but IME it's very rare that you'd want to.

rpedela6y ago

perl4ever6y ago

I have rewritten a T-SQL loop that called a query each time through with one query that was a thousand times faster. Seriously.

te_chris6y ago

Recursive ctl’s in Postgres?

danbmil996y ago· 12 in thread

This is idiotic. Why in the world would testing for rote memorization of something anyone can look up easily be a reasonable filter for talent and experience in a programming role?

Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?

It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.

sgustard6y ago

mellow20206y ago

> there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL

TheBobinator6y ago

Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.

foxyv6y ago

I agree with most of what you are saying. But sometimes it's better to go with Java code instead of creating 3 line SQL commands.

raz32dust6y ago

kmonsen6y ago

m12k6y ago

TrackerFF6y ago

While I agree that testing someone on concepts is more favorable than testing someone on tool specifics, I absolutely think there's merit to (tool) proficiency.

If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.

jleach826y ago

cozos6y ago

A good interview would allow you to read SQL documentation to solve this. If you knew it off the top of your head, you'd ace the interview.

If you can't solve the problems after looking it up, then perhaps you are not a SQL expert. I think these sort of questions are valuable.

PopeDotNinja6y ago

geekone6y ago

Do you have suggestions on how to get real with this?

ryanisnan6y ago· 10 in thread

bdcravens6y ago

For small datasets, pretty much any approach will work. Once you hit hundreds of millions of records (which isn't even that big of a dataset), SQL still performs well on pretty modest hardware.

yen2236y ago

What do you mean by SQL?

jmiserez6y ago

The first example "month over month" will never result in a large dataset if you just do count(distinct user_id) ... group by month order by month. There's only max 12 values per year.

And you're still getting the performance boost from the DB indexes for grouping and sorting.

tragomaskhalos6y ago

karatestomp6y ago

vajrabum6y ago

tfehring6y ago

thomzi12OP6y ago

Good q! +1 to the other replies on this comment. I have two points worth mentioning:

(Big Query has UDFs, which seem similar: https://cloud.google.com/bigquery/docs/reference/standard-sq...)

nerdbaggy6y ago

I use to think that too. This is a good talk about how smart SQL query engines are. It is an hour though https://youtu.be/wTPGW1PNy_Y

thomzi12OP6y ago· 9 in thread

mobileexpert6y ago

Leetcode has some SQL problems that I’ve found useful in interview prep: https://leetcode.com/problemset/database/

thomzi12OP6y ago

bogomipz6y ago

beckingz6y ago

For cumulative averages why not use window functions?

thomzi12OP6y ago

Good catch -- I can add that in as an alternate solution. Thanks!

pwesner6y ago

Great resource. Thanks for sharing. Do you have any more resource recomondations besides the ones in the article? Maybe a book?

data4lyfe6y ago

We try to surface the best SQL interview questions at Interview Query (https://www.interviewquery.com/)

rollinDyno6y ago

Hello, how come you chose to use quip over Google docs? I'd expect you to choose the latter since you work for Google.

thomzi12OP6y ago

Yeah, it's because (1) I started this doc over a year ago when I was still at Salesforce (which owns/employees use Quip) and (2) AFAIK Google Docs still doesn't have native code block support :/

deepsun6y ago· 8 in thread

Checked just the first two answers:

1. MoM Percent Change

It's better to use windowing functions, I believe it should be faster than self-join.

2. It seems that the first solution is wrong -- it returns whether "a"-s parent is Root/Inner/Leaf, not "a" itself.

I'd instead add a "has_children" column to the table, and then it would be clear.

Second solution works, but without optimization it's 1 query per row due to nested query -- slow, but not mentioned.

deepsun6y ago

Answer 4 is very very bad, you're doing O(N^2) JOIN. It's not just slow, it will fail on bigger data.

The question just screams for windowing functions, and cumsum is a canonical example for them.

Sorry post author, you'd fail my interview :)

thomzi12OP6y ago

Sorry you feel that way! Thankfully my employer felt differently :)

lonelappde6y ago

Would you really interview and hire for a job based on SQL tricks instead of spending $50 to give someone a reference book?

heed6y ago

thomzi12OP6y ago

Sure! Practically speaking, as a data analyst, I would probably notice a missing month when plotting the trend of MAU over time.

(You can make "but what if a month is missing?" a latter part of a multi-part interview question)

Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.

meritt6y ago

> It's better to use windowing functions, I believe it should be faster than self-join.

radiowave6y ago

ScottWhigham6y ago

Yep. Same here. I stopped reading after the second solution. I liked the questions though - just maybe the answers were too myopic.

gtrubetskoy6y ago· 8 in thread

One problem with this article is the number of times the solution involves COUNT(DISTINCT).

One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".

0az6y ago

What is wrong with DISTINCT?

barbegal6y ago

marcosdumay6y ago

Nearly every time, it's a symptom of bad data normalization.

But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).

barrkel6y ago

GlennS6y ago

It covers up bad queries, so you may not see an underlying data duplication problem.

Often better to group explicitly so you know what's actually going on.

cultofmetatron6y ago

seriously.. I'm building out some functionality using plpgsql and have used it. This is going to be haunting my dreams

yread6y ago

Huh? If you have a table with attributeid, sampleid and value how would you count how many samples have a value in any attribute? Exists subquery?

matwood6y ago

In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.

fnord776y ago· 7 in thread

I think for a lot of people, SQL is a skill that doesn't stick.

The days of dedicated SQL programers are mostly gone.

matwood6y ago

giantDinosaur6y ago

jleach826y ago

doppel6y ago

Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.

xeromal6y ago

I'm sure data scientists and such probably use SQL at least somewhat regularly.

vangelis6y ago

What I forget about SQL is the little inconsistencies between platforms. It's a standard with a dozen carve outs. Do I use TOP or LIMIT? What's the wildcard syntax?

thomzi12OP6y ago

Many data analysts use exclusively SQL and R/Python

namdnay6y ago· 7 in thread

> I never really "got" declarative languages

It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.

wruza6y ago

Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test

greggyb6y ago

I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.

TurkTurkleton6y ago

perl4ever6y ago

As a practical matter, when you can't control the optimizer or affect how the DBAs configure things, you break your huge query into multiple ones with temporary tables.

The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.

marcosdumay6y ago

And then comes Oracle and insists on applying equality filters first, because, duh, they are fast, and never take any of those other details into consideration.

Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.

keeperofdakeys6y ago

Data changes, so the best query plan changes with it. SQL was built to handle this data-dependent environment.

Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.

A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.

SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.

Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.

mosburger6y ago· 6 in thread

Worth noting that this isn't all ANSI-SQL... e.g. I'm pretty sure WITH is a Postgres thing?

greggyb6y ago

CTEs are implemented by most (all?) major RDBMS platforms and were introduced in the SQL:1999 standard revision.

- SQL:1999 https://en.wikipedia.org/wiki/SQL:1999#Common_table_expressi...

- SQL standardization https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...

combatentropy6y ago

matwood6y ago

Minor49er6y ago

maxlamb6y ago

the WITH clause is now in SQL Server and Oracle

bdcravens6y ago

I don't work with Oracle, but SQL Server has had CTE's since 2005.

oyoun6y ago· 5 in thread

I think SQL is a language to be known by every programmer. With the right query, you may solve a problem that may take 100 lines in other languages.

It is so usefull, reliable and does not change every year.

TrackerFF6y ago

bradleyjg6y ago

Depends on the transformation. Dropping a couple of columns in a wide dataset is extraordinarily ugly in sql vs anything else, for example.

thomzi12OP6y ago

Gatsky6y ago

Well R has packages that buy you similar functionality eg dplyr.

alexilliamson6y ago

Dplyr is IMO the best thing to come from R. The design is so much simpler than pandas and the operations mirror SQL operations so closely.

sk5t6y ago· 5 in thread

I'd ding the over-use of CTEs, when subselects are often more appropriate and better-performing. Kind of a "every problem a nail" thing going on here.

meritt6y ago

Most people have no idea what an optimization fence is and opt for CTEs because they yield "cleaner" queries, despite nearly always killing performance.

I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.

combatentropy6y ago

sk5t6y ago

matwood6y ago

At least on recent versions of Postgres, CTEs are often equal in performance to subqueries (if they're side-effect free and non-recursive they get inlined).

Lightbody6y ago· 4 in thread

jandrewrogers6y ago

hobs6y ago

Yeah but the functions and style used in these answers vary across implementations (date_trunc, etc.)

I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.

bigtechdataeng6y ago

I suspect more SQL is written to support analytics than online systems. Locking and concurrency are used in a specific type of application, namely oltp.

I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.

thomzi12OP6y ago

To your point, I personally write SQL for analytics and product/business purposes, not system monitoring

S_A_P6y ago· 4 in thread

yellowapple6y ago

I take it you've already looked into CLR stored procedures? I haven't used 'em in a whole lot of depth, but they do seem to at least give some of the tools for a "best of both worlds" approach.

beckingz6y ago

5000 + line SQL scripts????

What would one of these do?

S_A_P6y ago

Legacy code from Powerbuilder days. But it does things like calculate month over month inventory, Mark to Market value, Risk, or things related to Energy and commodities trading.

vsareto6y ago

The ones I've seen handle tons and tons of edge cases or do extra validation. The SQL programming languages are pretty verbose in terms of syntax.

cameronh906y ago· 4 in thread

Is it only useful for ad hoc/analytical queries, or am I missing something?

dntbnmpls6y ago

kthejoker26y ago

Why do you think it's not "especially faster" than doing it on the client?

You can certainly use them in app dev for things like:

* figuring out how many things are ahead of you in a queue * snapshotting (what's changed since the last time you were here) * comparing something to some other sample for outliers or inconsistencies

These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.

cameronh906y ago

However certainly if your app is written in Python or Ruby I can see there being a big difference.

etiennebch6y ago

xupybd6y ago· 3 in thread

For me the hardest part of the first question is understanding the acronyms. I think MoM is month on month. But MAU, no idea.

thomzi12OP6y ago

Hey! MAU = Monthly Active Users. Sorry about that, I think this could be described as data analyst/analytics jargon. Good feedback to spell these out, thanks!

ziftface6y ago

I agree, it almost seems they were put there to add artificial noise to the problem. It doesn't seem like you're selecting for the right skills here.

thomzi12OP6y ago

Sorry about that -- in real life you would just ask the interviewer what MAU meant if they didn't spell it out.

zozbot2346y ago· 2 in thread

magicalhippo6y ago

Took me some time to wrap my head around writing a recursive query. Also had to study the docs a while to get "merge into" to work right, for doing combined insert/update.

hotsauceror6y ago

One of the first tasks I give associate DBAs is writing a recursive CTE to get a blocking chain. It’s an interesting and useful exercise.

vasilakisfil6y ago· 2 in thread

I always thought that I suck in SQL but if these are medium to hard then I am not that bad actually.

chucky_z6y ago

SQL for Smarties is the book that has proven, without a doubt, I am bad at SQL. https://www.amazon.com/Joe-Celkos-SQL-Smarties-Programming/d...

If you want to get undeniably good with SQL, this is the book.

taberiand6y ago

These are easy enough to do in your head.

Also I would do a lot of them differently. And I don't think the answer to #6 is valid (in T-SQL at least).

hotsauceror6y ago· 2 in thread

paulryanrogers6y ago

Thankfully even MySQL has them, at least as of version 8

hotsauceror6y ago

Yeah, I’m jealous of all the postgresql and MySQL people. SQL Server doesn’t have them and this kind of query would cause pain.

https://github.com/timescale/timescaledb/issues/271

snidane6y ago· 2 in thread

SQL is better than 99% of the nosql alternatives out there.

But one thing it falls apart are these time series data processing tasks.

In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.

Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?

3dbrows6y ago

Have you seen TimescaleDB? https://www.timescale.com/

snidane6y ago

I've seen it.

Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.

Asof join is still an open issue in their issue tracker so it is not usable as a time series database.