SQL Guide for Data Science Interivews

Data Science Prep, the best interview prep for data science jobs.

Overview

As a data scientist, you will be working with data a great deal of your time and so will have to write many queries to retrieve it and derive meaningful information from it. SQL is one of the core skills to query data at any scale.

General Workflow

In most workflows, quick slicing and dicing of data in SQL comprises the first step in analytics, and hence being able to efficiently write basic SQL queries is a crucial skill. Although many think that SQL simply involves using SELECTs and JOINs, powerful SQL workflows include many other extremely useful operators. For example, utilizing subqueries allows manipulation of subsets of data thereby allowing other operations to be performed later on in an analysis, and window functions allow data to be cut without using a GROUP BY to explicitly combine rows. The questions asked during interviews involving SQL are usually quite practical to the company at hand. For instance, a company such as Facebook might ask about user or app analytics, whereas one such as Amazon might ask about products and purchases instead.

Basic commands

Here are some basic but important SQL commands you should definitely know before your interview. We’ll cover SQL syntax and code later on with examples.

  • CREATE TABLE: Creates a table in a relational database and, depending on what database you use (e.g., MySQL), can also be used to define the table’s schema.
  • INSERT: Inserts a row or a set of rows at one time into a given table.
  • UPDATE: Modifies already-existing data.
  • DELETE: Removes a row (or a group of rows) from a database.
  • SELECT: Selects certain columns from a table, a part of most queries.
  • INNER JOIN: Combines multiple tables and will preserve the rows where column values match in the tables being combined. The word INNER is optional and is rarely used.
  • LEFT OUTER JOIN: Combines multiple tables by matching on the column names provided, while preserving all the rows from the first table of the join.
  • FULL OUTER JOIN: Combines multiple tables by matching on the columns provided, while preserving all rows. Note: The word OUTER is optional.
  • GROUP BY: Groups/aggregates rows having the specified contents of a specific column or set of columns.
  • WHERE: Provides a condition on which to filter before any grouping is applied.
  • HAVING: Provides a condition on which to filter after any grouping is applied.
  • ORDER BY: Sorts results in ascending or descending order according to the contents of a specific column or set of columns.
  • DISTINCT: Returns only distinct values.

In practice, interviewers typically assess that you can formulate the correct query to use when asked a simple question (often concerning a business use case).

Useful Tidbits

Generally, SQL queries involve the following concepts:

  1. Aggregation: Uses COUNT, SUM, or any aggregation function over any set of rows:

     SELECT COUNT(*) FROM users ...
  2. Joins: Frequently used in queries involving multiple tables. For example, the following query obtains the total time spent by username:

     SELECT users.name, SUM(sessions.time_spent) AS total_time_spent
     FROM users u JOIN sessions s ON u.user_id = s.user_id
     GROUP BY 1
  3. Filtering: SQL contains various ways to compare rows, the most common of which use = and <> (not equal), >, and <, along with regex and other types of logical and filtering clauses such as OR and AND. For example, the example below uses LIKE to match user names on regex:

     SELECT * FROM users
     WHERE username <> 'unknown' OR username LIKE '%123%'
  4. Common Table Expressions (CTEs) and Subqueries:

    CTEs define a query and then allow it to be referenced later using an alias. They provide a very useful means of breaking up large queries into more manageable pieces and are also useful when the various queries being made follow a natural order. For example, below is a CTE which gets a subset of users, prior to another analysis being done for those same users:

     WITH (SELECT * FROM users WHERE join_date > '2020-01-01') AS u
    
     SELECT u.name, SUM(sessions.time_spent) AS total_time_spent
     FROM u JOIN sessions s ON u.user_id = s.user_id
     GROUP BY 1

    Subqueries are inline in the query itself and can be used only once:

     SELECT * FROM users AS u JOIN
     (SELECT user_id, time_spent FROM sessions
         WHERE session_date > '2020-01-01') AS s
     ON u.user_id = s.user_id

Query questions usually involve a set of tables relevant to the company. For example, at Facebook, such questions would revolve around querying user and app usage data, whereas, at Uber, they would be about rider and driver data.

Window Functions

Window functions perform calculations across a set of rows (much like an aggregation function) but do not group those rows (as aggregation functions do). Instead, they involve not only an aggregation function but also an OVER clause and allow the window to be narrowed by using PARTITION BY:

SELECT *, SUM(time_spent) OVER
(PARTITION BY user_id ORDER BY join_time ASC)
FROM sessions

This particular example yields total time spent at the row level (without a GROUP BY per user_id being needed) within session data.

Note that window functions and aggregation functions cannot be used in the same query. Both, however, are handy in various scenarios. You'll want to involve window functions whenever aggregation is needed, but with the rows maintained.

Sample Interview Questions

Easy:

  1. Assume you have an events table (app_id, event_id, timestamp) on app analytics. Write a query to get the click-through rate per app in 2020.
  2. Assume you are given a transactions table (order_id, user_id, product_id, spend, transaction_date) for spending activity by product type. Write a query to calculate the cumulative spending for each product over time in chronological order.
  3. Assume you are given a table of transactions (user_id, spend, transaction_date) from users. Write a query to obtain the third purchase of every user.

Medium:

  1. Assume you have a table of user logins (user_id, login_date) containing information on user logins. Write a query to obtain the number of reactivated users (i.e., those who didn't log in the previous month, who then logged-in during the current month).
  2. Assume you are given a table user sessions (session_id, start_time, end_time). A session is considered to be concurrent with another user’s session if they overlap. Write a query to obtain the user session that is concurrent with the largest number of other user sessions.
  3. Assume you are given the table below containing information on user transactions (transaction_id, user_id, amount, transaction_date) for a particular business. Write a query to obtain the account’s rolling 7-day balance.

Sample Selected Solutions

Easy:

  1. First, we obtain the purchase sequence numbers for each user. We can do this by using the ROW_NUMBER window function, where we PARTITION by the user_id and ORDER by the transaction_date fields, calling the resulting field a purchase number. From there, we can simply take all purchases having a purchase number equal to 3.

     WITH nums AS (
         SELECT *,
             ROW_NUMBER() OVER (PARTITION BY user_id
               ORDER BY transaction_date)
                   AS purchase_num
         FROM transactions
     )
    
     SELECT *
     FROM nums
     WHERE purchase_num = 3

Medium:

  1. First, we need to obtain the total daily transactions using a simple SUM and GROUP BY operation. Having the daily transactions, we then perform a self join on the table using the condition that the transaction date for one transaction occurs within 7 days of the other, which we can check by using the DATE_ADD function along with a condition that the earlier date doesn't precede the later date:

     WITH daily_transactions AS (
         SELECT transaction_date, SUM(amount) AS total_amount
         FROM user_transactions
         GROUP BY transaction_date)
    
     SELECT t2.transaction_date
         SUM(t1.amount) AS weekly_rolling_total
     FROM daily_transactions t1
     INNER JOIN daily_transactions t2
         ON t1.transaction_date > DATE_ADD('DAY', -7, t2.transaction_date)
             AND t1.transaction_date <= t2.transaction_date
     GROUP BY t2.transaction_date
     ORDER BY t2.transaction_date ASC

Are you interviewing for data science jobs or are you trying to hone your data science skills? Check out our newsletter, Data Science Prep, to get data science interview questions straight to your inbox.

Ace your data science interview

Start thinking like a data scientist by solving real interview questions in your inbox!