10 Interview Questions For Aspiring Data Scientists
By Kamal Jacob
Data Science has emerged as one of the most rapidly adopted technology areas of the digital age. As a direct result of its growing demand, the demand for data science skills has been commensurately increasing. In a Forbes article, IBM predicts a 28% increase in demand for data scientists by 2020. It also talks about a consistent increase in the pay scales for the data scientists.
Whereas the demand is clearly understood, cracking data science interviews may seem like a daunting task if you are not prepared well enough to face an interview. We've to understand that every company/Startup takes the interview in a different format, some can ask extremely technical questions, while some interviewers might give you real-life problems to solve or assignments. However, to help you prepare better, here are the most popular interview questions you are likely to face in a data science interview:
Q1) How do you distinguish between supervised and unsupervised learning?
Ans) This is one of the most common questions. And the answer is best kept simple and straightforward. Following table depicts the differences between supervised and unsupervised learning:
1. Requires labeled input data.
1. Doesn’t require the labeled input data.
2. Uses a training data set.
2. Uses the input data set.
3. Enables classification and regression.
3. Enables Classification, Density Estimation, & Dimension Reduction
Q2) How do you explain a normal distribution and what are its characteristics?
Ans) A normal distribution represents a data set in which a majority of values cluster in the center of the range while the rest of the data is symmetrically arranged towards the right and left extremes.
If you draw this kind of data distribution on a curve, it takes the form of a bell-shaped curve.
Image source: https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2
Properties of the normal distribution
1. Values of the mean, median, and mode are equal.
2. It forms a bell-shaped curve which is symmetric about the mean.
3. The total area under the curve is equal to one.
4. The curve approaches the x-axis but never meets or touches the x-axis.
Q3) What is logistic regression? Please explain briefly
Ans) Logistic regression is a technique used in forecasting the binary outcome from a linear combination of predictor variables. It describes the data and explains the relationship between a dependent binary variable and one or more independent variables.
Q4) What is a confusion matrix?
Ans) Confusion matrix is a 2x2 matrix that helps understand the true and false negatives and positives. An example of a confusion matrix is as given in the image below:
In order to create a confusion matrix, performance evaluation is performed using a test data set that must contain the correct labels and predicted labels. In addition, there is a binary classifier that classifies the predictions as negative or positive.
This classifier produces four outcomes:
a) True positive(TP) — Correct positive prediction
b) False positive(FP) — Incorrect positive prediction
c) True negative(TN) — Correct negative prediction
d) False negative(FN) — Incorrect negative prediction
Basic measures derived from the confusion matrix
1. error Rate = (FP+FN)/(P+N)
2. Accuracy = (TP+TN)/(P+N)
3. Sensitivity (Recall or True positive rate) = TP/P
4. Specificity (True negative rate) = TN/N
5. Precision (Positive predicted value) = TP/(TP+FP)
6. F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC) where b is commonly 0.5, 1, 2.
Q5) Please explain univariate, bivariate and multivariate analysis.
Ans) Univariate, bivariate and multivariate analysis is the analysis techniques that use a single, double or multiple variables respectively.
The most important utilization of the univariate analysis is to summarize the data and find patterns within a dataset based on just one variable.
On the other hand, a bivariate analysis works on the basis of the relationship between two data sets. Chi-squared tests and t-tests are the kinds of tools that can be used to analyze when the data sets have a correlation.
Multivariate analysis works when there are more than two variables present in the dataset to be analyzed.
Q6) Define Linear regression in statistics.
Ans) Linear regression is one of the most popular methods used in analytics. It is used for describing relationships between a dependent variable and multiple independent variables. The main advantages of using the linear regression are as listed below:
1. It helps understand the correlation and direction of the data
2. It helps in making sure that the data model is valid and useful for the scenario
3. It finds extensive usage in the cause-effect scenarios. For instance, it helps determine the effect of a certain action to determine various outcomes.
Q7) What are Recommender Systems?
Ans) Recommender systems are a specific type of information filtering systems that are employed for predicting the preference or ratings users will typically assign to a product. They are extensively used in e-commerce, book catalogs, movie catalogs, news, research, music catalogs, blogs, etc.
Q8) What is the difference between Cluster and Systematic Sampling?
Ans) Cluster and Systematic sampling, both are sampling techniques suitable for different use cases.
Cluster sampling focuses on saving costs on obtaining a sample and is performed in two steps. Whereas it is assumed to be less accurate than other sampling procedures, it is used when it’s difficult to complete a list of the entire population. It creates random subsets of data from the entire population and then, identify a further random sample of data from this to perform analysis.
On the other hand, systematic sampling involves performing probability sampling wherein sample members from a big population as selected based on a random starting point and a fixed, periodic interval. While being simple, it is also flexible enough to allow some process to be used for sample selection. It minimizes randomness and ensures the entire population is evenly sampled.
Q9) What are decision trees?
Ans) Decision tree is one of the supervised machine learning algorithms. It is primarily used for regression and classification. It involves breaking the data set down into smaller subsets incrementally. Along with the incremental breakdown of the dataset into the smaller ones, an associated decision tree is developed. As a final outcome, you build a decision tree that has decision nodes and leaf nodes. An interesting aspect of the decision trees is that it can handle categorical as well as the numerical data, both.
Q10) What is Entropy and Information gain in Decision tree algorithm?
Ans) Entropy and Information gain are concepts used in the core algorithm for creating a decision tree. This algorithm is called ID3.
The ID3 algorithm checks the homogeneity of a sample data set using the entropy variable. Entropy is equal to zero in case of a completely homogeneous sample while it has a value of one in case of an equally divided data set.
The Information gain variable is dependent upon the reduction in the value of entropy after a data set is split on an attribute. Constructing a decision tree is all about finding attributes that return the highest information gain.
The following equation is used to calculate the information gain.
Gain(T,X) = Entropy (T) – Entropy (T,X)
These are some of the most commonly asked questions during a data science interview. To be able to answer these questions, it is very important that you understand the basics of data science. If you are a fresher looking to make a career in the field of data science, it is best to learn it from experts by taking an online course in Data Science from Manipal ProLearn.
The course is designed to you provide deep learning and detailed knowledge about various concepts, tools as well as algorithms of Data Science, AI, and Machine Learning. With a combination of instructor-led online training and real case studies, you will be able to brush up your skills and earn a certification that is highly valued across industries. What else? You get free python learning classes too!
Successful completion of a course will ensure you can answer all the above interview questions and many more to land a job in the field of data science.