Coding Guide for Data Science Interivews

Overview
Coding is important for data scientists, especially for those involved in putting their analyses into production. As such, many companies will test coding ability through basic data structures and algorithms questions. Topics include: arrays, trees, etc., along with algorithms such as BFS/DFS and dynamic programming. As with coding interviews, you are expected to understand runtime and space efficiency. Since there are many topics and resources on data structures and algorithms, this post will serve as a brief guide to common topics.
General Approach
It is important to understand the question at hand and clarify any assumptions that need to be made. Examples would include: asking about inputs, outputs, and basic edge cases (missing or null inputs). Once you understand what is being asked, you want to start with a "brute-force" solution which might be inefficient but will still solve the problem. Then you want to work towards a more optimized solution and start coding when you have an idea of how the more optimal solution might look like. It is good to explain your code while coding since interviewers are often more interested in your thought process rather than just the end code result (which still matters of course). Make sure that you name variables appropriately, use helper functions where needed, and can provide test cases as needed to ensure code correctness. Along the way, your interviewer may give you feedback which should guide the remainder of the interview.
Time and Space Complexity
Determining the runtime and space usage (how much memory is used) of an algorithm is essential for coding interviews. Generally we care about the worst case scenario for performance. The most common technique, Big O notation, generally describes the "worst-case upper bound," i.e. the maximum runtime or space usage of an algorithm, relative to the input size N.
For example, consider an algorithm running on an array of size N. Here are examples of runtime complexities, from fastest to slowest:
- O(1): Constant time. Example: getting a value at a particular index from an array
- O(log N): Logarithmic time. Example: binary search on a sorted array
- O(N): Linear time. Example: using a for-loop to traverse through an array
- O(N log N): Log-linear time. Example: running mergesort on an array
- O(N^2): Quadratic time. Example: iterating over every pair of elements in an array using a double for-loop
- O(2^N): Exponential time. Example: recursively generating all binary numbers that are N digits long
- O(N!): Factorial time. Example: generating all permutations of an array
The same Big-O runtime analysis concepts apply analogously to space complexity. For example, if we need to store a copy of an input array with N elements, that would be an additional O(N) space. If we wanted to store an adjacency matrix among N nodes, we would need O(N^2) space to keep the N-by-N sized matrix.
For a basic example for both runtime and space complexity analysis, we can look at binary search, where we are searching for a particular value within a sorted array. The code which implements this algorithm is below (with an extra set of conditions that returns the closest if the exact value is not found):
def binary_search(a, k):
lo, hi = 0, len(a) - 1
best = lo
while lo <= hi:
mid = lo + (hi - lo) // 2
if a[mid] < k:
lo = mid + 1
elif a[mid] > k:
hi = mid - 1
else:
best = mid
break
if abs(a[mid] - k) < abs(a[best] - k):
best = mid
return best
If we start binary search with an input of N elements, then at the next iteration, we would only need to search through N/2 elements, and so on. The runtime complexity for binary search is O(log N) since at each iteration we cut the remaining search space in half. The space complexity would simply be O(N) since the array is size N, and we do not need auxiliary space.
Arrays
An array is a series of elements stored sequentially in memory. Arrays are optimal for accessing elements at particular indices, with an O(1) access and index time. However, they are slower for searching and deleting a specific value, with an O(N) runtime, unless sorted.
Common array interview questions include:
- moving all the negative elements to one side of an array
- merging two sorted arrays
- finding specific sub-sequences of elements within the array, like the longest consecutive subsequence or the maximum product subarray.
For jobs where Python knowledge is important, interviews may cover list comprehensions, due to their expressiveness and ubiquity in codebases. As an example, below, we use a list comprehension to create a list of the first 10 positive even numbers. Then, we use another list comprehension to find the cumulative sum of the first list:
a = [x*2 for x in range(1, 11)] # list creation
c = [sum(a[:x]) for x in range(len(a)+1)] # cumulative sum
Linked Lists
A linked list is composed of nodes with data that have pointers to other nodes. The first node is called the head, and the last node is called the tail. Linked lists can be circular, where the tail points to the head. They can also be doubly-linked, where each node has a reference to both the previous and next nodes. Linked lists are optimal for insertion and deletion, with O(1) insertion time at the head or tail, but are worse for indexing and searching, with a runtime complexity of O(N) for indexing and O(N) for search.
Common linked list questions include:
- reversing a linked list
- detecting a cycle in a linked list
- removing duplicates from a sorted linked list
- checking if a linked list represents a palindrome.
As an example, below we reverse a linked list. Said another way, given the input linked list 4 → 1 → 3 → 2, we want to write a function which returns 2 → 3 → 1 → 4. To implement this, we first start with a basic node class:
class Node:
def __init__(self, val):
self.val = val
self.next = None
Then we create the LinkedList class, along with the method to reverse its elements. The reverse function iterates through each node of the linked list. At each step, it does a series of swaps between the pointers of the current node and its neighbors.
class LinkedList:
def __init__(self):
self.head = None
def reverse(self):
prev = None
curr = self.head
while curr:
next = curr.next
curr.next = prev
prev = curr
curr = next
self.head = prev
Stacks & Queues
A stack is a data structure that allows adding and removing elements in a last-in, first-out (LIFO) order. This means that the element that is added last is the first element to be removed. Another name for adding and removing elements from a stack is pushing and popping. Stacks are usually implemented using an array or linked list. In contrast, queue is a data structure that allows adding and removing elements in a first-in, first-out (FIFO) order. Queues are also usually implemented using an array or linked list. The main difference between a stack and a queue is the removal order: in the stack, there is a LIFO order, whereas in a queue it's a FIFO order. Stacks are generally used in recursive operations, whereas queues are used in more iterative processes. Common stacks and queues interview questions include:
- writing a parser to evaluate regular expressions (regex)
- evaluating a math formula using order of operations rules
- running a breadth-first or depth-first search through a graph
An example interview problem that uses a stack is determining whether a string has balanced parentheses (every type of left-side parentheses is accompanied by valid right-side parentheses). For instance, the string "({}((){}))" is correctly balanced, whereas the string "{}(})" is not balanced due to the last character ')'. The algorithm is as follows:
If we see starting parentheses (a left-side one), we push it onto the stack
If we see an ending parentheses (a right-side one), we check that it's the same type as the most recently seen left-side parentheses on the stack.
If the parentheses are of the same type, then we pop from the stack. If they don't match, we return false since the parentheses are mismatched.
We continue until we're done parsing the input, and the stack is empty (every pair of parentheses was correctly accounted for), in which case, we return true, or the stack is not empty, in which case, we return false.
def check_balance(s):
left_side = ["(", "{", "["] # left parentheses
right_side = [")", "}", "]"] # right parentheses
stack = [] # stack
for i in s:
if i in left_side:
stack.append(i) # push onto stack
elif i in right_side:
pos = right_side.index(i) # get right char
# check match
if len(stack) == 0 or (left_side[pos] != stack[len(stack)-1]):
return False
else:
stack.pop() # remove and continue
if len(stack) == 0: # balanced
return True
else:
return False
Hash Maps
A hash map stores key-value pairs. For every key, a hash map uses a hash function to compute an index, which locates the bucket where that key's corresponding value is stored. In Python, a dictionary offers support for key-value pairs and has the same functionality as a hash map.
While a hash function aims to map each key to a unique index, there can be collisions where different keys have the same index. With a good hash function, you can expect that the elements are overall distributed evenly throughout the hash map so any lookups, insertions, or deletions for a key take O(1) time.
Due to its optimal runtime properties, hash maps make a frequent appearance in coding interview questions. Common hash map questions include:
- finding the unions or intersection of two lists
- finding the frequency of each word in a piece of text
- finding four elements a, b, c and d in a list such that a+b = c+d.
An example interview question that uses a hash map is determining whether an array contains two elements that sum up to some value k. For example, say we have a list [3, 1, 4, 2, 6, 9] and k = 11. In this case, we return true since 2 and 9 sum up to 11. The brute-force method to solving this problem is to use a double for-loop and sum up every pair of numbers in the array, which provides an O(N^2) solution. But, by using a hash map, we only have to iterate through the array with a single for-loop. In the for-loop, for each element we'd check whether the complement of the number (target - that number) exists in the hash map, achieving an O(N) solution:
def check_sum(a, target):
d = {} # create dictionary
for i in a:
if (target - i) in d: # check hashmap
return True
else:
d[i] = i # add to hashmap
return False
Trees
A tree is a basic data structure that has a root node and subtrees of children nodes. The most basic kind of tree is a binary tree, where each node has at most two children nodes. Binary trees can be implemented with a left and right child node, like below:
class TreeNode:
def __init__(self, val):
self.val = val
self.left = None
self.right = None
There are various types of traversals and basic operations that can be performed on trees. For example, in an in-order traversal, we first process the left subtree of a node, then process the current node, and finally process the right subtree:
def inorder(node):
if node is None:
return []
else:
return inorder(node.left) + [node.val] + inorder(node.right)
For searching, insertion, and deletion, the worst-case runtime for a binary tree is O(N), where N is the number of nodes in the tree. Common tree questions asked include writing functions to get various properties of a tree, like the depth of a tree or the number of leaves in a tree. Usually, tree questions the boil down to traversing through the tree and recursively passing some data in a top-down or a bottoms-up manner. Coding interview problems often cover two specific types of trees: Binary Search Trees and Heaps.
Binary Search Trees
A binary search tree (BST) is composed of a series of nodes, where any node in a left subtree is smaller than or equal to the root, and any node in the right subtree is larger than or equal to the root. When BSTs are height-balanced, so that no one leaf is much deeper than another leaf from the root, searching for elements becomes very efficient. By cutting the search space in half at each iteration, BSTs support search, insertion, and deletion in O(log N) runtime.
Common BST questions include:
- testing if a binary tree has the BST property
- finding the k-th largest element in a BST
- finding the lowest common ancestor between two nodes (the closest common node to two input nodes such that both input nodes are descendants of that node)
An example implementation of a BST using the TreeNode class, with an insert function, is as follows:
class TreeNode:
def __init__(self, val):
self.val = val
self.left = None
self.right = None
class BST:
def __init__(self, val):
self.root = TreeNode(val)
def insert(self, node, val):
if node is not None:
if val < node.val:
if node.left is None:
node.left = TreeNode(val)
else:
self.insert(node.left, val)
else:
if node.right is None:
node.right = TreeNode(val)
else:
self.insert(node.right, val)
else:
self.root = TreeNode(val)
return
Heaps
Another common type of tree data structure is a heap. A max-heap is a type of heap where each parent node is greater than or equal to any child node. As such, the largest value in a max-heap is the root value of the tree, which can be looked up in O(1) time. Similarly, for a min-heap, each parent node is smaller than or equal to any child node, and the smallest value lies at the root of the tree and can be accessed in constant time.
To maintain the heap property, there is a sequence of operations known as "heapify" whereby values are "bubbled up/down" within the tree based on what value is being inserted or deleted. The runtime of this heapify operation is the height of the tree, O(log N). Inserting or deleting is O(log N) because the heapify operation runs to maintain the heap property. The search runtime is O(N) since every node may need to be checked in the worst case scenario.
Commonly asked heap interview questions include:
- finding the K largest or smallest elements within an array
- finding the current median value in a stream of numbers
- sorting an almost-sorted array (where elements are just a few places off from their correct spot)
To demonstrate the use of heaps, below we find the k-largest elements in a list, using the heapq package in Python:
import heapq
k = 5
a = [13, 5, 2, 6, 10, 9, 7, 4, 3]
heapq.heapify(a) # creates heap
heapq.nlargest(k, a) # finds k-largest
Graphs
Graphs are composed of nodes (vertices) and a set of edges that connect the nodes. Edges are often represented in two ways: an adjacency matrix or an adjacency list. In the case of an adjacency matrix, there is a n-by-n matrix for N nodes where the entries are either boolean values (denoting an edge is present or not between two nodes) or a weight. In contrast, an adjacency list simply stores a list of neighbors for each node. An example implementation of a graph is below. The Vertex class uses an adjacency list to keep track of its neighbors with weights:
class Vertex:
def __init__(self, val):
self.val = val
self.neighbors = {}
def add_to_neighbors(self, neighbor, w): # add to neighbor dict with weight
self.neighbors[neighbor] = w
def get_neighbors(self):
return self.neighbors.keys()
class Graph:
def __init__(self):
self.vertices = {}
def add_vertex(self, val):
new_vertex = Vertex(val)
self.vertices[val] = new_vertex
return new_vertex
def add_edge(self, u, v, weight): # add edge from u to v
if u not in self.vertices:
self.add_vertex(Vertex(u))
if v not in self.vertices:
self.add_vertex(Vertex(v))
self.vertices[u].add_to_neighbors(self.vertices[v], weight)
def get_vertex(self, val):
return self.vertices[val]
def get_vertices(self):
return self.vertices.keys()
def __iter__(self):
return iter(self.vertices)
BFS and DFS
Interview problems related to graphs usually involve traversing the graph, either in a breadth-first-search (BFS) or in a depth-first-search (DFS). In BFS, you start with one node and add it to a queue. For each node in the queue, you process the node and add its neighbors to the queue. In DFS, you also start with one node, but instead of processing all the neighbors next, you recursively process each neighbor one by one. Below is an example of BFS using a queue for iterative processing:
def bfs(graph, v):
n = len(graph.vertices)
visited = [False] * (n+1)
queue = []
queue.append(v)
visited[v] = True
while queue:
curr = queue.pop(0)
for i in graph.get_vertex(curr):
if visited[i.val] == False:
queue.append(i.val)
visited[i.val] = True
return visited
Below is a primitive example of DFS, where visited tracks the set of processed nodes, and v is the node currently being processed:
def dfs_helper(graph, v, visited):
visited.add(v)
for neighbor in graph.get_vertex(v).get_neighbors():
if neighbor.val not in visited:
dfs_helper(graph, neighbor.val, visited)
return visited
def dfs(graph, v):
visited = set()
return dfs_helper(graph, v, visited)
The runtime for both is O(E+V), where E is the number of edges in the graph and V is the number of vertices. For most basic use cases, DFS and BFS are interchange-able.
Recursion
A recursive function is one that calls itself directly or indirectly. It can be broken down into two parts: the recursive case (the part which calls itself) and the base case (which terminates the recursion).
For an example of a recursive algorithm, consider the classic problem of producing Fibonacci numbers (0, 1, 1, 2, 3, 5, 8, etc.):
def fib(n):
if n == 0:
return 0
if n == 1 or n == 2:
return 1
else:
return fib(n-1)+fib(n-2)
If you draw the tree of recursive calls, you will see that many times, fib(n) for a particular n is called repeatedly, which leads to an undesirable exponential runtime and memory usage.
Dynamic Programming
Dynamic programming is when results of subproblems are stored to speed up recursive computations. For example, the fibonacci example can we written as follows where we store previous results in an array:
dp = [0] * 1000
def fib(n):
if n == 0 or n == 1:
return n
else:
dp[n] = fib(n-1) + fib(n-2)
return dp[n]
Two properties are needed for dynamic programming (DP) to be applicable:
optimal substructure
overlapping subproblems
Optimal substructure means that the problem can be solved optimally by breaking it down into subproblems and solving those subproblems. Overlapping subproblems indicates that the problem can be broken down into subproblems that are then reusable in other subproblems. Note that the Fibonacci satisfies the two requirements because:
fib(n) is found by solving the subproblems fib(n-1) and fib(n-2)
subproblems are overlapping: for both fib(n) and fib(n-1) require having solved fib(n-2)
Dynamic programming can be implemented with a top-down or bottom-up approach. In a top-down approach, we start with the top and break the problem into smaller chunks (as in the Fibonacci example). In practice, the top-down approaches often are implemented with recursion, and the intermediate results are cached in a hash table.
In contrast, in a bottom-up manner, we start with the smaller problems and continue till the top. These solutions are often implemented iteratively, where an n-dimensional array is used to cache previous results. An example for a bottom-up Fibonacci example would be:
def fib(n):
dp = [0 for _ in range(n+1)]
dp[1] = 1
for i in range(2, n+1):
dp[i] = dp[i-1] + dp[i-2]
return dp[n]
For a more complicated example, of dynamic programming, consider the knapsack problem: where we are choosing between N items, each with some positive value and positive weight. We want to maximize the amount of value obtained by selecting items with a total weight of no more than W.
The core logic is in deciding whether to take an item or not. If an item meets the weight requirement, then we either:
take the item, and get a new optimal maximum value, and now have a new weight constraint (current weight minus item's weight)
do not take the item, and continue with the current optimal maximum value and the current weight constraint
An example implementation is below:
def knapsackDP(values, weights, max_weight):
n = len(values)
dp = [[0 for x in range(max_weight+1)] for x in range(n+1)] # 2d-array
for i in range(n+1):
for w in range(max_weight+1):
if weights[i-1] <= w: # meets weight requirement
dp[i][w] = max(values[i-1]+dp[i-1][w-weights[i-1]],
dp[i-1][w]) # either take value or not
else:
dp[i][w] = dp[i-1][w] # not taking value
return dp[n][max_weight]
Sorting
Two different sorting algorithms - mergesort and quicksort - come up in coding interviews from time to time. While it's rare to be directly asked to code up these algorithms (code excluded below for simplicity), both sorts are often modified to solve a problem or used as an intermediate step within a larger coding interview solution. This is because sorting the input can expose some structure which makes the problem much easier.
Mergesort
Mergesort uses a "divide-and-conquer" approach as follows:
- Repeatedly divide the input into smaller subarrays, until a base case of a single element is reached (this single element subarray is considered sorted).
- Repeatedly merge the smaller sorted subarrays into bigger sorted subarrays, until the whole input is merged back together.
Overall, mergesort has a runtime complexity of O(N log N) and also requires O(N log N) space to support the auxiliary merging steps. An example implementation of mergesort which utilizes a helper function for merging sub-arrays.
Quicksort
Quicksort selects an arbitrary pivot element and puts all elements smaller than the pivot to the left of the pivot, and larger elements to the right of the pivot. The exchanging of elements occurs through swaps. This process of selecting a pivot and swapping the left and right elements occurs recursively, until the base case is hit, when just two elements are swapped into their correct relative order.
The worst-case runtime for quicksort is O(N^2), although in practice the expected runtime is closer to O(N log N) through picking a "smart" pivot whereby the elements are roughly divided into equal halves upon each iteration.
Sample Interview Questions
Easy Problems:
- Given an integer array, return the maximum product of any three numbers in the array. For example, for A = [1, 3, 4, 5] you should return 60 while for B = [-2, -4, 5, 3] you should return 40.
- Given a list of coordinates, write a function to find the k closest points (measured by Euclidean distance) to the origin. For example, if k = 3 and the points are: [[2, -1], [3, 2], [4, 1], [-1, -1], [-2, 2]] then return [[-1, -1], [2, -1], [-2, 2]].
- Given a binary tree, write a function to determine whether the tree is a mirror image of itself. Two trees are a mirror image of each other if their root values are the same and the left subtree is a mirror image of the right subtree.
Medium Problems:
- Given two strings A and B, write a function to return a list of all the start indices within A where the substring of A is an anagram of B. For example, if A = "abcdcbac" and B = "abc" then you want to return [0, 4, 5] since those are the starting indices of substrings of A that are anagrams of B.
- Given a binary tree, write a function to determine the diameter of the tree, which is the longest path between any two nodes.
- Given two lists X and Y, return their correlation.
Hard Problems:
- Given a number n, return the number of lists of consecutive positive integers that sum up to n. For example, for n = 9, return 3 since the consecutive positive integer lists are: [2, 3, 4], [4, 5], and [9]. Can you solve this in linear time?
- Given a continuous stream of integers, write a class with functions to add new integers to the stream, and a function to calculate the median at any time.
- Given a string with left and right parentheses, write a function to determine the length of the longest well-formed substring. For example, if the input string is ")(())(" then return 4 since the longest well-formed substring is "(())".
Are you interviewing for data science jobs or are you trying to hone your data science skills? Check out our newsletter, Data Science Prep, to get data science interview questions straight to your inbox.