SQL Guide for Data Science Interivews

Overview
As a data scientist, you will be working with data a great deal of your time and so will have to write many queries to retrieve it and derive meaningful information from it. SQL is one of the core skills to query data at any scale.
General Workflow
In most workflows, quick slicing and dicing of data in SQL comprises the first step in analytics, and hence being able to efficiently write basic SQL queries is a crucial skill. Although many think that SQL simply involves using SELECTs and JOINs, powerful SQL workflows include many other extremely useful operators. For example, utilizing subqueries allows manipulation of subsets of data thereby allowing other operations to be performed later on in an analysis, and window functions allow data to be cut without using a GROUP BY to explicitly combine rows. The questions asked during interviews involving SQL are usually quite practical to the company at hand. For instance, a company such as Facebook might ask about user or app analytics, whereas one such as Amazon might ask about products and purchases instead.
Basic commands
Here are some basic but important SQL commands you should definitely know before your interview. We’ll cover SQL syntax and code later on with examples.
- CREATE TABLE: Creates a table in a relational database and, depending on what database you use (e.g., MySQL), can also be used to define the table’s schema.
- INSERT: Inserts a row or a set of rows at one time into a given table.
- UPDATE: Modifies already-existing data.
- DELETE: Removes a row (or a group of rows) from a database.
- SELECT: Selects certain columns from a table, a part of most queries.
- INNER JOIN: Combines multiple tables and will preserve the rows where column values match in the tables being combined. The word INNER is optional and is rarely used.
- LEFT OUTER JOIN: Combines multiple tables by matching on the column names provided, while preserving all the rows from the first table of the join.
- FULL OUTER JOIN: Combines multiple tables by matching on the columns provided, while preserving all rows. Note: The word OUTER is optional.
- GROUP BY: Groups/aggregates rows having the specified contents of a specific column or set of columns.
- WHERE: Provides a condition on which to filter before any grouping is applied.
- HAVING: Provides a condition on which to filter after any grouping is applied.
- ORDER BY: Sorts results in ascending or descending order according to the contents of a specific column or set of columns.
- DISTINCT: Returns only distinct values.
In practice, interviewers typically assess that you can formulate the correct query to use when asked a simple question (often concerning a business use case).
Useful Tidbits
Generally, SQL queries involve the following concepts:
Aggregation: Uses COUNT, SUM, or any aggregation function over any set of rows:
SELECT COUNT(*) FROM users ...
Joins: Frequently used in queries involving multiple tables. For example, the following query obtains the total time spent by username:
SELECT users.name, SUM(sessions.time_spent) AS total_time_spent FROM users u JOIN sessions s ON u.user_id = s.user_id GROUP BY 1
Filtering: SQL contains various ways to compare rows, the most common of which use = and <> (not equal), >, and <, along with regex and other types of logical and filtering clauses such as OR and AND. For example, the example below uses LIKE to match user names on regex:
SELECT * FROM users WHERE username <> 'unknown' OR username LIKE '%123%'
Common Table Expressions (CTEs) and Subqueries:
CTEs define a query and then allow it to be referenced later using an alias. They provide a very useful means of breaking up large queries into more manageable pieces and are also useful when the various queries being made follow a natural order. For example, below is a CTE which gets a subset of users, prior to another analysis being done for those same users:
WITH (SELECT * FROM users WHERE join_date > '2020-01-01') AS u SELECT u.name, SUM(sessions.time_spent) AS total_time_spent FROM u JOIN sessions s ON u.user_id = s.user_id GROUP BY 1
Subqueries are inline in the query itself and can be used only once:
SELECT * FROM users AS u JOIN (SELECT user_id, time_spent FROM sessions WHERE session_date > '2020-01-01') AS s ON u.user_id = s.user_id
Query questions usually involve a set of tables relevant to the company. For example, at Facebook, such questions would revolve around querying user and app usage data, whereas, at Uber, they would be about rider and driver data.
Window Functions
Window functions perform calculations across a set of rows (much like an aggregation function) but do not group those rows (as aggregation functions do). Instead, they involve not only an aggregation function but also an OVER clause and allow the window to be narrowed by using PARTITION BY:
SELECT *, SUM(time_spent) OVER
(PARTITION BY user_id ORDER BY join_time ASC)
FROM sessions
This particular example yields total time spent at the row level (without a GROUP BY per user_id being needed) within session data.
Note that window functions and aggregation functions cannot be used in the same query. Both, however, are handy in various scenarios. You'll want to involve window functions whenever aggregation is needed, but with the rows maintained.
Sample Interview Questions
Easy:
- Assume you have an events table (app_id, event_id, timestamp) on app analytics. Write a query to get the click-through rate per app in 2020.
- Assume you are given a transactions table (order_id, user_id, product_id, spend, transaction_date) for spending activity by product type. Write a query to calculate the cumulative spending for each product over time in chronological order.
- Assume you are given a table of transactions (user_id, spend, transaction_date) from users. Write a query to obtain the third purchase of every user.
Medium:
- Assume you have a table of user logins (user_id, login_date) containing information on user logins. Write a query to obtain the number of reactivated users (i.e., those who didn't log in the previous month, who then logged-in during the current month).
- Assume you are given a table user sessions (session_id, start_time, end_time). A session is considered to be concurrent with another user’s session if they overlap. Write a query to obtain the user session that is concurrent with the largest number of other user sessions.
- Assume you are given the table below containing information on user transactions (transaction_id, user_id, amount, transaction_date) for a particular business. Write a query to obtain the account’s rolling 7-day balance.
Sample Selected Solutions
Easy:
First, we obtain the purchase sequence numbers for each user. We can do this by using the ROW_NUMBER window function, where we PARTITION by the user_id and ORDER by the transaction_date fields, calling the resulting field a purchase number. From there, we can simply take all purchases having a purchase number equal to 3.
WITH nums AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) AS purchase_num FROM transactions ) SELECT * FROM nums WHERE purchase_num = 3
Medium:
First, we need to obtain the total daily transactions using a simple SUM and GROUP BY operation. Having the daily transactions, we then perform a self join on the table using the condition that the transaction date for one transaction occurs within 7 days of the other, which we can check by using the DATE_ADD function along with a condition that the earlier date doesn't precede the later date:
WITH daily_transactions AS ( SELECT transaction_date, SUM(amount) AS total_amount FROM user_transactions GROUP BY transaction_date) SELECT t2.transaction_date SUM(t1.amount) AS weekly_rolling_total FROM daily_transactions t1 INNER JOIN daily_transactions t2 ON t1.transaction_date > DATE_ADD('DAY', -7, t2.transaction_date) AND t1.transaction_date <= t2.transaction_date GROUP BY t2.transaction_date ORDER BY t2.transaction_date ASC
Are you interviewing for data science jobs or are you trying to hone your data science skills? Check out our newsletter, Data Science Prep, to get data science interview questions straight to your inbox.