Interview Cheat Sheet – Data Science & Machine Learning


Based upon feedback shared from clients and candidates, I’ve created a Data Science and Machine Learning Cheat Sheet for junior/entry level roles. The cheat sheet contains some of the most common questions asked and brief answer to each. These answers should be expanded upon with examples taken from your previous experience. 

What is Data Science?

Data Science is a combination of various tools, algorithms, and machine learning principles that aims to discover patterns from the raw data. This is different from a traditional statistician’s role as it focuses around predicting trends rather than explaining them.

Can you give examples of some of the key Python skills needed for data analysis?

  • Knowledge of lists, dictionaries, tuples, and sets.
  • Advanced knowledge of N-dimensional NumPy Arrays.
  • Advanced knowledge of Pandas dataframes.
  • Comfortable performing element-wise vector and matrix operations on NumPy arrays.
  • Knowledge using Scikit-learn.
  • Comfortable profiling the performance of a Python script and optimizing bottlenecks.

Why is data cleaning important for analysis?

Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with. This helps to increase the accuracy of the model in machine learning.

What steps are involved in a typical analytics project?

  • Understand the Business problem
  • Explore the data and become familiar with it.
  • Clean the data.
  • Run the model, analyze the result and amend the approach until the best possible outcome is achieved.
  • Validate the model using a new data set.
  • Start implementing the model and track the result to analyze the performance of the model over the period of time.

What is Selection Bias?

Selection bias is an error that occurs when a researcher decides who is going to be studied instead of the selection process being random. This may result in the distortion of statistical analysis, due to the method of collecting samples. If the selection bias is not considered, then some conclusions of the study may not be accurate.

There are various types of selection bias including sampling bias (a non-random sample of a population), time interval (a trial may be terminated early at an extreme value), data (subsets of data chosen to support a conclusion) and attrition (discounting data that did not run to completion).

What is the purpose of A/B Testing?

A/B testing, also known as split testing, is a marketing experiment wherein you “split” your audience to test a number of variations of a campaign to determine which performs better.

A/B testing can be valuable because audiences behave differently. Something that works for one company may not necessarily work for another. It is useful for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads.

What are the differences between overfitting and underfitting?

Overfitting is a statistical modeling error which occurs when a function is too closely fit to a limited set of data points. It is caused by a model being excessively complex. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting refers to a model that can neither model the training data nor generalize to new data. This arises when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.

What is Cluster Sampling?

Cluster sampling is used in statistics when it is difficult to study the target population spread across a wide area and simple random sampling cannot be applied. The whole population is subdivided into clusters, or groups, and random samples are then collected from each group.

What is Systematic Sampling?

Systematic sampling is a random sampling technique which is often chosen due to its simplicity and its periodic quality. In systematic random sampling, the researcher randomly picks the first item or subject from the population. The list is then progressed in a circular manner so once you reach the end of the list, it is progressed from the top again.

What is Machine Learning?

Machine Learning is the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics, it is used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.

What is the difference between Supervised and Unsupervised learning?

Supervised learning is typically used in the context of classification to map input to output labels, or regression, when mapping input to a continuous output. Common algorithms in supervised learning include logistic regression, naive bayes, support vector machines, artificial neural networks, and random forests.

The most common tasks within unsupervised learning are clustering, representation learning, and density estimation. In all of these cases, the researcher will be looking to learn the inherent structure of data without using explicitly provided labels. Some common algorithms include k-means clustering, principal component analysis, and autoencoders. Since no labels are provided, there is no specific way to compare model performance in most unsupervised learning methods.

What is Deep Learning?

Deep learning is a machine learning technique that teaches computers to learn by example. Deep learning is a key technology behind driverless cars and voice control in devices like phones, tablets and hands-free speakers. Deep learning is getting lots of attention as it’s achieving results that were not possible before.

In deep learning, a model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers.

What are Artificial Neural Networks?

Artificial neural networks are one of the main tools used in machine learning. They are brain-inspired systems which are intended to replicate the way humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are too complex or numerous for a human programmer to extract and teach the machine to recognize.

Artificial Neural Networks works on the same principle as a biological Neural Network. It consists of inputs which get processed with weighted sums and Bias, with the help of Activation Functions.

Working with companies internationally, I provide consultancy and recruitment services to help them harness data, design and build their ideal data architecture, use advanced analytics to unlock the full value of their data and uncover valuable insights.