We Offer 100% Job Guarantee Courses (Any Degree / Diploma Candidates / Year GAP / Non-IT / Any Passed Outs). Placement Records
Hire Talent (HR):+91-9707 240 250

Interview Questions

Data Science Interview Questions and Answers

Data Science Interview Questions and Answers

Data Science Interview Questions and Answers for beginners and experts. List of frequently asked Data Science Interview Questions with answers by Besant Technologies.
We hope these Data Science interview questions and answers are useful and will help you to get the best job in the networking industry. This Data Science interview questions and answers are prepared by Data Science Professionals based on MNC Companies expectation. Stay tuned we will update New Data Science Interview questions with Answers Frequently. If you want to learn Practical Data Science Training then please go through this Data Science Training in Chennai Data Science Training in Bangalore.

Best Data Science Interview Questions and Answers

Besant Technologies supports the students by providing Data Science interview questions and answers for the job placements and job purposes. Data Science is the leading important course in the present situation because more job openings and the high salary pay for this Data Science and more related jobs. We provide the Data Science online training also for all students around the world through the Gangboard medium. These are top Data Science interview questions and answers, prepared by our institute experienced trainers.

Data Science Interview Questions and Answers for Placements

Here is the list of most frequently asked Data Science Interview Questions and Answers in technical interviews. These Data Science questions and answers are suitable for both freshers and experienced professionals at any level. The questions are for intermediate to somewhat advanced Data Science professionals, but even if you are just a beginner or fresher you should be able to understand the answers and explanations here we give.

In this post, you will get the most important and top 150+ Data science Interview Questions and Answers, which will be very helpful and useful to those who are preparing for jobs.

Q1. (Given a Dataset) Analyze this dataset and give me a model that can predict this response variable.

Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance.
Determine if the problem is classification or regression
Favor simple models that run quickly and you can easily explain.
Mention cross validation as a means to evaluate the model.
Plot and visualize the data.

Q2. What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?

The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation).
When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data.
This issue can be overcome by using a more general learning method.
This can occur when:
P(y|x) are the same but P(x) are different. (covariate shift)
P(y|x) are different. (concept shift)
The causes can be:
Training samples are obtained in a biased way. (sample selection bias)
Train is different from test because of temporal, spatial changes. (non-stationary environments)
Solution to covariate shift
importance weighted cv

Q3. What are some ways I can make my model more robust to outliers?

We can have regularization such as L1 or L2 to reduce variance (increase bias).
Changes to the algorithm:
Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones.
Use robust error metrics such as MAE or Huber Loss instead of MSE.
Changes to the data:
Winsorizing the data
Transforming the data (e.g. log)
Remove them only if you’re certain they’re anomalies not worth predicting

Q4. What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be appropriate?

MSE is more strict to having outliers. MAE is more robust in that sense, but is harder to fit the model for because it cannot be numerically optimized. So when there are less variability in the model and the model is computationally easy to fit, we should use MAE, and if that’s not the case, we should use MSE.
MSE: easier to compute the gradient, MAE: linear programming needed to compute the gradient
MAE more robust to outliers. If the consequences of large errors are great, use MSE
MSE corresponds to maximizing likelihood of Gaussian random variables

Q5. What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?

Accuracy: proportion of instances you predict correctly. Pros: intuitive, easy to explain, Cons: works poorly when the class labels are imbalanced and the signal from the data is weak
AUROC: plot fpr on the x axis and tpr on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who’s who. Pros: Works well when testing the ability of distinguishing the two classes, Cons: can’t interpret predictions as probabilities (because AUC is determined by rankings), so can’t explain the uncertainty of the model
logloss/deviance: Pros: error metric based on probabilities, Cons: very sensitive to false positives, negatives
When there are more than 2 groups, we can have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case.

Q6. What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What’s the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.)

Things to look at: N, P, linearly seperable?, features independent?, likely to overfit?, speed, performance, memory usage
Logistic Regression:
features roughly linear, problem roughly linearly separable
robust to noise, use l1,l2 regularization for model selection, avoid overfitting
the output come as probabilities
efficient and the computation can be distributed
can be used as a baseline for other algorithms
(-) can hardly handle categorical features
with a nonlinear kernel, can deal with problems that are not linearly separable
(-) slow to train, for most industry scale applications, not really efficient
Naive Bayes:
computationally efficient when P is large by alleviating the curse of dimensionality
works surprisingly well for some cases even if the condition doesn’t hold
with word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization
(-) conditional independence of every other feature should be met
Tree Ensembles:
good for large N and large P, can deal with categorical features very well
non parametric, so no need to worry about outliers
GBT’s work better but the parameters are harder to tune
RF works out of the box, but usually performs worse than GBT
Deep Learning:
works well for some classification tasks (e.g. image)
used to squeeze something out of the problem

Q7. What is regularization and where might it be helpful? What is an example of using regularization in a model?

Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients.

Q8. Why might it be preferable to include fewer predictors over many?

When we add irrelevant features, it increases model’s tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc.
curse of dimensionality
adding random noise makes the model more complicated but useless
computational cost
Ask someone for more details.

Q9. Given training data on tweets and their retweets, how would you predict the number ofretweets of a given tweet after 7 days after only observing 2 days worth of data?

Build a time series model with the training data with a seven day cycle and then use that for a new data with only 2 days data.
Ask someone for more details.
Build a regression function to estimate the number of retweets as a function of time t
to determine if one regression function can be built, see if there are clusters in terms of the trends in the number of retweets
if not, we have to add features to the regression function
features + # of retweets on the first and the second day -> predict the seventh day

Q10. How could you collect and analyze data to use social media to predict the weather?

We can collect social media data using twitter, Facebook, instagram API’s. Then, for example, for twitter, we can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, the features created from the tweeted content itself. Then use a multi variate time series model to predict the weather.
Ask someone for more details. Get Data Science Training in Kalayan Nagar Bangalore.

Q11. How would you construct a feed to show relevant content for a site that involves userinteractions with items?

We can do so using building a recommendation engine. The easiest we can do is to show contents that are popular other users, which is still a valid strategy if for example the contents are news articles. To be more accurate, we can build a content based filtering or collaborative filtering. If there’s enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed. If there isn’t, we can recommend similar items based on vectorization of items (content based filtering).

Q12. How would you design the people you may know feature on LinkedIn or Facebook?

Find strong unconnected people in weighted connection graph
Define similarity as how strong the two people are connected
Given a certain feature, we can calculate the similarity based on
friend connections (neighbors)
Check-in’s people being at the same location all the time.
same college, workplace
Have randomly dropped graphs test the performance of the algorithm
ref. News Feed Optimization
Affinity score: how close the content creator and the users are
Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote
Time decay: the older the less important

Q13. How would you predict who someone may want to send a Snapchat or Gmail to?

for each user, assign a score of how likely someone would send an email to
the rest is feature engineering:
number of past emails, how many responses, the last time they exchanged an email, whether the last email ends with a question mark, features about the other users, etc.
Ask someone for more details.
People who someone sent emails the most in the past, conditioning on time decay.

Q14. How would you suggest to a franchise where to open a new store?

build a master dataset with local demographic information available for each location.
local income levels, proximity to traffic, weather, population density, proximity to other businesses
a reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.)
any data on the local franchise owner-operators, to the degree the manager
identify a set of KPIs acceptable to the management that had requested the analysis concerning the most desirable factors surrounding a franchise
quarterly operating profit, ROI, EVA, pay-down rate, etc.
run econometric models to understand the relative significance of each variable
run machine learning algorithms to predict the performance of each location candidate

Q15. In a search engine, given partial data on what the user has typed, how would you predict the user’s eventual search query?

Based on the past frequencies of words shown up given a sequence of words, we can construct conditional probabilities of the set of next sequences of words that can show up (n-gram). The sequences with highest conditional probabilities can show up as top candidates.
To further improve this algorithm,
we can put more weight on past sequences which showed up more recently and near your location to account for trends
show your recent searches given partial data

Q16. Given a database of all previous alumni donations to your university, how would you predict which recent alumni are most likely to donate?

Based on frequency and amount of donations, graduation year, major, etc, construct a supervised regression (or binary classification) algorithm.

Q17. You’re Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this?

Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct
Ask someone for more details.
Based on the number of past pickups
account for periodicity (seasonal, monthly, weekly, daily, hourly)
special events (concerts, festivals, etc.) from tweets

Q18. How would you build a model to predict a March Madness bracket?

One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament
Some extensions:
Experiment with different ways of consolidating the 2 team vectors into one (e.g concantenating, averaging, etc)
Consider using a RNN type model that looks at time series data.

Q19. You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this?

This is equivalent to making the model more robust to outliers. [message type=”simple” bg_color=”#eeeeee” color=”#333333″]Probability[/message]

Q20. What is the key assumption for Naive Bayes?

Naïve Bayes assumption tells that all independent variables are equally important as well
independent of each other. The reality doesn’t support this idea much. But surprisingly Naïve Bayes
model sometimes works efficient for classification problem.

Q21. Bobo the amoeba has a 25%, 25%, and 50% chance of producing 0, 1, or 2 o spring, respectively. Each of Bobo’s descendants also have the same probabilities. What is the probability that Bobo’s lineage dies out?

p=1/4+1/4p+1/2p^2 => p=1/2

Q22. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the proba- bility that you see at least one shooting star in the period of an hour?

1-(0.8)^4. Or, we can use Poisson processes

Q23. What is the skewed Distribution & uniform distribution?

Uniform Distribution is identified when the data spread is equal in the range. Right/Left skewed data
is something if data is distributed on any of one side of the plot.

Q24. How can you get a fair coin toss if someone hands you a coin that is weighted to come up heads more often than tails?

Flip twice and if HT then H, TH then T.

Q25. You have an 50-50 mixture of two normal distributions with the same standard deviation. How far apart do the means need to be in order for this distribution to be bimodal?

more than two standard deviations

Q26. Given draws from a normal distribution with known parameters, how can you simulate draws from a uniform distribution?

plug in the value to the CDF of the same random variable

Q27. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?


Q28. You have a group of couples that decide to have children until they have their first girl, afterwhich they stop having children. What is the expected gender ratio of the children that are born?What is the expected number of children each couple will have?

gender ratio is 1:1. Expected number of children is 2. let X be the number of children until getting a female (happens with prob 1/2). this follows a geometric distribution with probability 1/2

Q29. How many ways can you split 12 people into 3 teams of 4?

the outcome follows a multinomial distribution with n=12 and k=3. but the classes are indistinguishable

Q30. Your hash function assigns each object to a number between 1:10, each with equal probability. With 10 objects, what is the probability of a hash collision? What is the expected number of hash collisions? What is the expected number of hashes that are unused?

the probability of a hash collision: 1-(10!/10^10)
the expected number of hash collisions: 1-10*(9/10)^10
the expected number of hashes that are unused: 10*(9/10)^10

Q31. You call 2 UberX’s and 3 Lyfts. If the time that each takes to reach you is IID, what is theprobability that all the Lyfts arrive first? What is the probability that all the UberX’s arrive first?

Lyfts arrive first: 2!*3!/5!
Ubers arrive first: same

Q32. I write a program should print out all the numbers from 1 to 300, but prints out Fizz instead if the number is divisible by 3, Buzz instead if the number is divisible by 5, and FizzBuzz if the number is divisible by 3 and 5. What is the total number of numbers that is either Fizzed, Buzzed, or FizzBuzzed?


Q33. On a dating site, users can select 5 out of 24 adjectives to describe themselves. A match isdeclared between two users if they match on at least 4 adjectives. If Alice and Bob randomly pick adjectives, what is the probability that they form a match?

24C5*(1+5(24-5))/24C5*24C5 = 4/1771

Q34. A lazy high school senior types up application and envelopes to n different colleges, but puts the applications randomly into the envelopes. What is the expected number of applications that went to the right college?


Q35. Let’s say you have a very tall father. On average, what would you expect the height of his son to be? Taller, equal, or shorter? What if you had a very short father?

Shorter. Regression to the mean

Q36. What’s the expected number of coin flips until you get two heads in a row?

the expected number of coin flips until you get two tails in a row.

Q37. Let’s say we play a game where I keep flipping a coin until I get heads. If the first time I get heads is on the nth coin, then I pay you 2n-1 dollars. How much would you pay me to play this game?

less than $3

Q38. You have two coins, one of which is fair and comes up heads with a probability 1/2, and the other which is biased and comes up heads with probability 3/4. You randomly pick coin and flip it twice, and get heads both times. What is the probability that you picked the fair coin?

4/13 [message type=”simple” bg_color=”#eeeeee” color=”#333333″]Data Analysis[/message]

Q39. Let’s say you’re building the recommended music engine at Spotify to recommend peoplemusic based on past lis- tening history. How would you approach this problem?

collaborative filtering

Q40. What is R2? What are some other metrics that could be better than R2 and why?

goodness of fit measure. variance explained by the regression / total variance
the more predictors you add the higher R^2 becomes.
hence use adjusted R^2 which adjusts for the degrees of freedom
or train error metrics

Q41. What is the curse of dimensionality?

High dimensionality makes clustering hard, because having lots of dimensions means that everything is “far away” from each other.
For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases
All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample.
The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data.
We should conduct PCA to reduce dimensionality

Q42. Is more data always better?

It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help.
It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc.
Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data.

Q43. What are advantages of plotting your data before performing analysis?

Data sets have errors. You won’t find them all but you might find some. That 212 year old man. That 9 foot tall woman.
Variables can have skewness, outliers etc. Then the arithmetic mean might not be useful. Which means the standard deviation isn’t useful.
Variables can be multimodal! If a variable is multimodal then anything based on its mean or median is going to be suspect.

Q44. How can you make sure that you don’t analyze something that ends up meaningless?

Proper exploratory data analysis.
In every data analysis task, there’s the exploratory phase where you’re just graphing things, testing things on small sets of the data, summarizing simple statistics, and getting rough ideas of what hypotheses you might want to pursue further.

Then there’s the exploitatory phase, where you look deeply into a set of hypotheses.

The exploratory phase will generate lots of possible hypotheses, and the exploitatory phase will let you really understand a few of them. Balance the two and you’ll prevent yourself from wasting time on many things that end up meaningless, although not all.

Q45. What is the role of trial and error in data analysis? What is the role of making a hypothesisbefore diving in?

data analysis is a repetition of setting up a new hypothesis and trying to refute the null hypothesis.

The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is.

Q46. How can you determine which features are the most important in your model?

run the features though a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles.
Look at the variables added in forward variable selection

Q47. How do you deal with some of your predictors being missing?

Remove rows with missing values – This works well if 1) the values are missing randomly (see Vinay Prabhu’s answer for more details on this) 2) if you don’t lose too much of the dataset after doing so.
Build another predictive model to predict the missing values – This could be a whole project in itself, so simple techniques are usually used here.
Use a model that can incorporate missing data – Like a random forest, or any tree-based method.

Q48. You have several variables that are positively correlated with your response, and you thinkcombining all of the variables could give you a good prediction of your response. However, you see that in the multiple linear regression, one of the weights on the predictors is negative. What could be the issue?

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.
Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.
principal component regression

Q49. Let’s say you’re given an unfeasible amount of predictors in a predictive modeling task. What are some ways to make the prediction more feasible?


Q50. Now you have a feasible amount of predictors, but you’re fairly sure that you don’t need all of them. How would you perform feature selection on the dataset?

ridge / lasso / elastic net regression
Univariate Feature Selection where a statistical test is applied to each feature individually. You retain only the best features according to the test outcome scores
“Recursive Feature Elimination”:
First, train a model with all the feature and evaluate its performance on held out data.
Then drop let say the 10% weakest features (e.g. the feature with least absolute coefficients in a linear model) and retrain on the remaining features.
Iterate until you observe a sharp drop in the predictive accuracy of the model.

Q51. Your linear regression didn’t run and communicates that there are an infinite number of best estimates for the regression coefficients. What could be wrong?

p > n.
If some of the explanatory variables are perfectly correlated (positively or negatively) then the coefficients would not be unique.

Q52. You run your regression on different subsets of your data, and that in each subset, the betavalue for a certain variable varies wildly. What could be the issue here?

The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely.
What is the main idea behind ensemble learning? If I had many different models that predicted the same response variable, what might I want to do to incorporate all of the models? Would you expect this to perform better than an individual model or worse?
The assumption is that a group of weak learners can be combined to form a strong learner.
Hence the combined model is expected to perform better than an individual model.
average out biases
reduce variance
Bagging works because some underlying learning algorithms are unstable: slightly different inputs leads to very different outputs. If you can take advantage of this instability by running multiple instances, it can be shown that the reduced instability leads to lower error. If you want to understand why, the original bagging paper( http://www.springerlink.com/cont…) has a section called “why bagging works”
Boosting works because of the focus on better defining the “decision edge”. By reweighting examples near the margin (the positive and negative examples) you get a reduced error (see http://citeseerx.ist.psu.edu/vie…)
Use the outputs of your models as inputs to a meta-model.
For example, if you’re doing binary classification, you can use all the probability outputs of your individual models as inputs to a final logistic regression (or any model, really) that can combine the probability estimates.

One very important point is to make sure that the output of your models are out-of-sample predictions. This means that the predicted value for any row in your dataframe should NOT depend on the actual value for that row.

Q53. Given that you have wi data in your o ce, how would you determine which rooms and areasare underutilized and overutilized?

If the data is more used in one room, then that one is over utilized! Maybe account for the room capacity and normalize the data.

Q54. How would you quantify the influence of a Twitter user?

like page rank with each user corresponding to the web pages and linking to the page equivalent to following.

Q55. You have 100 mathletes and 100 math problems. Each mathlete gets to choose 10 problems to solve. Given data on who got what problem correct, how would you rank the problems in terms of difficulty?

One way you could do this is by storing a “skill level” for each user and a “difficulty level” for each problem. We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem.* Then we maximize the likelihood of the data to find the hidden skill and difficulty levels.
The Rasch model for dichotomous data takes the form:
{\displaystyle \Pr\{X_{ni}=1\}={\frac {\exp({\beta _{n}}-{\delta _{i}})}{1+\exp({\beta _{n}}-{\delta _{i}})}},}
where is the ability of person and is the difficulty of item}.

Q56. You have 5000 people that rank 10 sushis in terms of salt- iness. How would you aggregate this data to estimate the true saltiness rank in each sushi?

Some people would take the mean rank of each sushi. If I wanted something simple, I would use the median, since ranks are (strictly speaking) ordinal and not interval, so adding them is a bit risque (but people do it all the time and you probably won’t be far wrong).

Q57. Given data on congressional bills and which congressio- nal representatives co-sponsored the bills, how would you determine which other representatives are most similar to yours in voting behavior? How would you evaluate who is the most liberal? Most republican? Most bipartisan?

collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative
for liberal and republican parties, find the mean vector and find the representative closest to the center point

Q58. How would you come up with an algorithm to detect plagiarism in online content?

reduce the text to a more compact form (e.g. fingerprinting,

bag of wor
ds) then compare those with other texts by calculating the similarity

Q59. You have data on all purchases of customers at a grocery store. Describe to me how you would program an algorithm that would cluster the customers into groups. How would you determine the appropriate number of clusters include?

choose a small value of k that still has a low SSE (elbow method)
Statistical Inference

Q60. In an A/B test, how can you check if assignment to the various buckets was truly random?

Plot the distributions of multiple features for both A and B and make sure that they have the same shape. More rigorously, we can conduct a permutation test to see if the distributions are the same.
MANOVA to compare different means

Q61. What might be the benefits of running an A/A test, where you have two buckets who areexposed to the exact same product?

Verify the sampling algorithm is random.

Q62. What would be the hazards of letting users sneak a peek at the other bucket in an A/B test?

The user might not act the same suppose had they not seen the other bucket. You are essentially adding additional variables of whether the user peeked the other bucket, which are not random across groups.

Q63. What would be some issues if blogs decide to cover one of your experimental groups?

Same as the previous question. The above problem can happen in larger scale.

Q64. How would you conduct an A/B test on an opt-in feature?

Ask someone for more details.

Q65. How would you run an A/B test for many variants, say 20 or more?

one control, 20 treatment, if the sample size for each group is big enough.
Ways to attempt to correct for this include changing your confidence level (e.g. Bonferroni Correction) or doing family-wide tests before you dive in to the individual metrics (e.g. Fisher’s Protected LSD).

Q66. How would you run an A/B test if the observations are extremely right-skewed?

lower the variability by modifying the KPI
cap values
percentile metrics
log transform

Q67. I have two different experiments that both change the sign-up button to my website. I want to test them at the same time. What kinds of things should I keep in mind?

exclusive -> ok

Q68. What is a p-value? What is the difference between type-1 and type-2 error?

type-1 error: rejecting Ho when Ho is a true
type-2 error: not rejecting Ho when Ha is true
Q49. You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a buyer selects the listing. How would you test this hypothesis?

For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups.
Ask someone for more details.

Q69. How would you design an experiment to determine the impact of latency on userengagement?

The best way I know to quantify the impact of performance is to isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test.

Q70. What is maximum likelihood estimation? Could there be any case where it doesn’t exist?

A method for parameter optimization (fitting a model). We choose parameters so as to maximize the likelihood function (how likely the outcome would happen given the current data and our model).
maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. MLE can be seen as a special case of the maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters, or as a variant of the MAP that ignores the prior and which therefore is unregularized.
for Gaussian mixtures, non-parametric models, it doesn’t exist

Q71. What’s the difference between a MAP, MOM, MLE estimator? In which cases would you want to use each?

MAP estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution.
MOM sets moment values and solves for the parameters. MOM has not used much anymore because maximum likelihood estimators have higher probability of being close to the quantities to be estimated and are more often unbiased.

Q72. What is a confidence interval and how do you interpret it?

For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time.
if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.

Q73. What is unbiasedness as a property of an estimator? Is this always a desirable property when performing inference? What about in data analysis or predictive modeling?

Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias.
OTHER Important Data Science Interview Questions and Answers

Q74. What is the difference between population and sample in data?

Sample is the set of people who participated in your study whereas the population is the set of people to whom you want to generalize the results. For example – If you want to study the obesity among the children in India and you study 1000 children then those 1000 became sample whereas the all the children in the country is the population.

Sample is the subset of population.

Q75. What is the difference sample and sample frame?

Sample frame is the number of people who wanted to study whereas sample is the actual number of people who participated in your study. Ex – If you sent a marketing survey link to 300 people through email and only 100 participated in the survey then 300 is the sample survey and 100 is the sample.

Sample is the subset of sample frame. Both Sample and Sample Frame are subset of population.

Q76. What is the difference between univariate, bivariate and multivariate analysis?

Univariate analysis is performed on one variable, bivariate on two variable and multivariate analysis on two or more variables

Q77. What is difference between interpolation and extrapolation?

Extrapolation is the estimation of future values based on the observed trend on the past. Interpolation is the estimation of missing past values within two values in a sequence of values

Q78. What is precision and recall?

Precision is the percentage of correct predictions you have made and recall is the percentage of predictions that actually turned out to be true

Q79. What is confusion matrix?
  • Confusion matrix is a table which contains information about predicted values and actual values in a classification model
  • It has four parts namely true positive ,true negative, false positive and false negative
  • It can be used to calculate accuracy, precision and recall
Q80. What is hypothesis testing?

While performing the an experiment hypothesis testing to is used to analyze the various factors that are assumed to have an impact on the outcome of experiment

An hypothesis is some kind of assumption and hypothesis testing is used to determine whether the stated hypothesis is true or not

Initial assumption is called null hypothesis and the opposite alternate hypothesis

Q81. What is a p-value in statistics?

In hypothesis testing, p value helps to arrive at a conclusion. When p -value is too small then null hypothesis is rejected and alternate is accepted. When p-value is large then null hypothesis is accepted.

Q82. What is difference between Type-I error and Type-II error in hypothesis testing?

Type-I error is we reject the null hypothesis which was supposed to be accepted. It represents false positive
Type-II error represents we accept the null hypothesis which was supposed to be rejected. It represents false negative.

Q83. What are the different types of missing value treatment?
  • Deletion of values
  • Guess the value
  • Average Substitution
  • Regression based substitution
  • Multiple Imputation
Q84. What is gradient descent?

When building a statistical model the objective is reduce the value of the cost function that is associated with the model. Gradient descent is an iterative optimization technique used to determine the minima of the cost function

Q85. What is difference between supervised and unsupervised learning algorithms?

Supervised learning are the class of algorithms in which model is trained by explicitly labelling the outcome. Ex. Regression, Classification
Unsupervised learning no output is given and the algorithm is made to learn the outcomes implicity Ex. Association, Clustering

Q86. What is the need for regularization in model building?

Regularization is used to penalize the model when it overfits the model. It predominantly helps in solving the overfitting problem.

Q87. Difference between bias and variance tradeoff?

High Bias is an underlying error wrong assumption that makes the model to underfit. High Variance in a model means noise in data has been too taken seriously by the model which will result in overfitting.

Typically we would like to have a model with low bias and low variance

Q88. How to solve overfitting?
  • Introduce Regularization
  • Perform Cross Validation
  • Reduce the number of features
  • Increase the number of entries
  • Ensembling
Q89. How will you detect the presence of overfitting?

When you build a model which has very high model accuracy on train data set and very low prediction accuracy in test data set then it is a indicator of overfitting

Q90. How do you determine the number of clusters in k-means clustering?

Elbow method ( Plotting the percentage of variance explained w.r.t to number of clusters)
Gap Statistic
Silhouette method

Q91. What is the difference between causality and correlation?

Correlation is the measure that helps us understand the relationship between two or more variables
Causation represents that causal relationship between two events. It is also known to represent cause and effect
Causation means there is correlation but correlation doesn’t necessarily mean causation

Q92. Explain normal distribution?

Normal distribution is a bell shaped curve that represents distribution of data around its mean. Any normal process would follow the normal distribution.
Most of data points tend to concentrated around the mean. If a point is further away from the mean then it is less likely to appear

Q93. What are the different ways of performing aggregation in python using pandas?

Group by function
Pivot function
Aggregate function

Q94. What are merge two list and get only unique values?

List a = [1,2,3,4] List b= [1,2,5,6] A = list(set(a+b))

Q95. How to save and retrieve model objects in python?

By using a library called pickle you can train any model and store the object in a pickle file.
When needed in future you can retrieve the object and use the model for prediction.

Q96. What is an anomaly and how is it different from outliers?

Anomaly detection is identification of items or events that didn’t fit to the exact pattern or other items in a dataset. Outliers are valid data points that are outside the norm whereas anomaly are invalid data points that are created by process that is different from process that created the other data points

Q97. What is an ensemble learning?

Ensemble learning is the art of combining more than one model to predict the final outcome of an experiment. Commonly used ensemble techniques bagging, boosting and stacking

Q98. Name few libraries that is used in python for data analysis?

Scikit learn
Matplotlib\ seaborn

Q99. What are the different types of data?

Data is broadly classified into two types 1) Numerical 2) Categorical
Numerical variables is further classified into discrete and continuous data
Categorical variables
Systematic Sampling
Stratified Sampling
Quota Sampling are further classified into Binary, Nominal and Ordinal data

Q100. What is a lambda function in python?

Lambda function are used to create small, one-time anonymous function in python. It enables the programmer to create functions without a name and almost instantly

Q101. What are the different sampling methods?
  • Random Sampling
  • Systematic Sampling
  • Stratified Sampling
  • Quota Sampling
Q102. Common Data Quality Issues
  • Missing Values
  • Noise in the Data Set
  • Outliers
  • Mixture of Different Languages (like English and Chinese)
  • Range Constraints
Q103. What is the difference between supervised learning and un-supervised learning?

Supervised learning: Target variable is available and the algorithm learns for the train data

And applies to test data (unseen data).

Unsupervised learning: Target variable is not available and the algorithm does not need to learn

Anything beforehand.

Q104. What is Imbalanced Data Set and how to handle them? Name Few Examples?
  • Fraud detection
  • Disease screening

Imbalanced Data Set means that the population of one class is extremely large than the other

(Eg: Fraud – 99% and Non-Fraud – 1%)

Imbalanced dataset can be handled by either oversampling, undersampling and penalized Machine Learning Algorithm.

Q105. If you are dealing with 10M Data, then will you go for Machine learning (or) Deep learning Algorithm?
  • Machine learning algorithmsuits well for small data and it might take huge amount of time to train for large data.
  • Whereas Deep learning algorithm takes less amount of data to train due to the help of GPU(Parallel Processing).
Q106. Examples of Supervised learning algorithm?
  • Linear Regression and Logistic Regression
  • Decision Trees and Random Forest
  • SVM
  • Naïve Bayes
  • XGBoost
Q107. In Logistic Regression, if you want to know the best features in your dataset then what you would do?

Apply step function, which calculates the AIC for different permutation and combination of features and provides the best features for the dataset.

Q108. What is Feature Engineering? Explain with Example?

Feature engineering is the process of using domain knowledge of the data to create features for machine learning algorithm to work

  • Adding more columns (or) removing columns from the existing column
  • Outlier Detection
  • Normalization etc
Q109. How to select the important features in the given data set?
  • In Logistic Regression, we can use step() which gives AIC score of set of features
  • In Decision Tree, We can use information gain(which internally uses entropy)
  • In Random Forest, We can use varImpPlot
Q110. When does multicollinearity problem occur and how to handle it?

It exists when 2 or more predictors are highly correlated with each other.

Example: In the Data Set if you have grades of 2nd PUC and marks of 2nd PUC, Then both gives the same trend to capture, which might internally hamper the speed and time.so we need to check if the multi collinearity exists by using VIF(variance Inflation Factor).

Note: if the Variance Inflation Factor is more than 4, then multi collinearity problem exists.

Q111. What is Variance inflation Factors (VIF)?

Measure how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.

Q112. Examples of Parametric machine learning algorithm and non-parametric machine learning algorithm
  • Parametric machine learning algorithm– Linear Regression, Logistic Regression
  • Non-Parametric machine learning algorithm – Decision Trees, SVM, Neural Network
Q113. What are parametric and non-parametric machine learning algorithm? And their importance?

Algorithm which does not make strong assumptions are non-parametric algorithm and they are free to learn from training data. Algorithm that makes strong assumptions are parametric and it involves

  1. select the form for the function and
  2. learn the coefficients for the function from training data.
Q114. When does linear and logistic regression performs better, generally?

It works better when we remove the attributes which are unrelated to the output variable and highly co-related variable to each other.

Q115. Why you call naïve bayes as “naïve”?

Reason: It assumes that the input variable is independent, but in real world it is unrealistic, since all the features would be dependent on each other.

Q116. Give some example for false positive, false negative, true positive, true negative
  • False Positive – A cancer screening test comes back positive, but you don’t have cancer
  • False Negative – A cancer screening test comes back negative, but you have cancer
  • True Positive – A Cancer Screening test comes back positive, and you have cancer
  • True Negative – A Cancer Screening test comes back negative, and you don’t have cancer
Q117. What is Sensitivity and Specificity?

Sensitivity means “proportion of actual positives that are correctly classified” in other words “True Positive”

Specificity means “proportion of actual negatives that are correctly classified” “True Negative”

Q118. When to use Logistic Regression and when to use Linear Regression?

If you are dealing with a classification problem like (Yes/No, Fraud/Non Fraud, Sports/Music/Dance) then use Logistic Regression.

If you are dealing with continuous/discrete values, then go for Linear Regression.

Q119. What are the different imputation algorithm available?

Imputation algorithm means “replacing the Blank values by some values)

  • Mean imputation
  • Median Imputation
  • MICE
  • miss forest
  • Amelia
Q120. What is AIC(Akaike Information Criteria)?

The analogous metric of adjusted R² in logistic regression is AIC.

AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.

Q121. Suppose you have 10 samples, where 8 are positive and 2 are negative, how to calculate Entropy (important to know)

E(S) = 8/10log(8/10) – 2/10log(2/10)

Note: Log is à base 2

Q122. What is perceptron in Machine Leaning?

In Machine Learning. Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs

Q123. How to ensure we are not over fitting the model?
  • Keep the attributes/Columns which are really important
  • Use K-Fold cross validation techniques
  • Make use of drop-put incase of neural network
Q124. How the root node is predicted in Decision Tree Algorithm?

Mathematical Formula “Entropy” is utilized for predicting the root node of the tree.

Q125. What are the different Backend Process available in Keras?
  • TensorFlow
  • Theano
  • CNTK
Q126. Name Few Deep Learning Algorithm
  • TensorFlow
  • Theano
  • Lasagne
  • mxnet
  • blocks
  • Keras
  • CNTK
  • TFLearn
Q127. How to split the data with equal set of classes in both training and testing data?

Using Stratified Shuffle package

Q128. What do you mean by giving “epoch = 1” in neural network?

It means that “traversing the data set one time

Q129. What do you mean by Ensemble Model? When to use?

Ensemble Model is a combination of Different Models to predict correctly and with good accuracy.

Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.

Q130. When will you use SVM and when to use Random Forest?
  • SVM can be used if the data is outlier free whereas Naïve Bayes can be used even if it has outliers (since it has built in package to take care).
  • SVM suits best for Text Classification Model and Random Forest suits for Binomial/Multinomial Classification Problem.
  • Random Forest takes care of over fitting problem with the help of tree pruning
Q131. Applications of Machine Learning?
  • Self Driving Cars
  • Image Classification
  • Text Classification
  • Search Engine
  • Banking, Healthcare Domain
Q132. If you are given with a use case – ‘Predict whether the transaction is fraud (or) not fraud”, which algorithm would you choose

Logistic Regression

Q133. If you are given with a use case – ‘Predict the house price range in the coming years”, which algorithm would you choose

Linear Regression

Q134. What is the underlying mathematical knowledge behind Naïve Bayes?

Bayes Theorem

Q135. When to use Random Forest and when to Use XGBoost?

If you want all core processors in your system to be utilized, then go for XGBoost(since it supports parallel processing) and if your data is small then go for random forest.

Q136. If you are training model gives 90% accuracy and test model gives 60% accuracy? Then what problem you are facing with?


Overfitting and can be reduced by many methods like (Tree Pruning, Removing the minute information provided in the data set).

Q137. In Google if you type “How are “it gives you the recommendation as “How are you “/”How do you do”, this is based on what?

This kind of recommendation engine comes from collaborative filtering.

Q138. What is margin, kernels, Regularization in SVM?
  • Margin – Distance between the hyper plane and closest data points is referred as “margin”
  • Kernels – there are three types of kernel which determines the type of data you are dealing with i) Linear, ii) Radial, iii) Polynomial
  • Regularization – The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example
Q139. What is Boosting? Explain how Boosting works?

Boosting is a Ensemble technique that attempts to create strong classifier from a number of weak classifiers

  • After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance by giving more weights to the misclassified one.
  • Models are created one after the other, each updating the weights on the training instance
Q140. What is Null Deviance and Residual Deviance (Logistic Regression Concept)?

Null Deviance indicates the response predicted by a model with nothing but an intercept

Residual deviance indicates the response predicted by a model on adding independent variables


Lower the value, better the model

Q141. What are the different method to split the tree in decision tree?

Information gain and gini index

Q142. What is the weakness for Decision Tree Algorithm?

Not suitable for continuous/Discrete variable

Performs poorly on small data

Q143. Why do we use PCA(Principal Components Analysis)?

These are important feature extraction techniques used for dimensionality reduction.

Q144. During Imbalanced Data Set, will you
  • Calculate the Accuracy only? (or)
  • Precision, Recall, F1 Score separately

We need to calculate precision, Recall separately

Q145.How to ensure we are not over fitting the model?
  • Keep the attributes/Columns which are really important
  • Use K-Fold cross validation techniques
  • make use of drop-put in case of neural network
Q146. Steps involved in Decision Tree and finding the root node for the tree

Step 1:- How to find the Root Node

Use Information gain to understand the each attribute information w.r.t target variable and place the attribute with the highest information gain as root node.

Step 2:- How to Find the Information Gain

Please apply the entropy (Mathematical Formulae) to calculate Information Gain. Gain (T,X) = Entropy(T) – Entropy(T,X) here represent target variable and X represent features.

Step3: Identification of Terminal Node

Based on the information gain value obtained from the above steps, identify the second most highest information gain and place it as the terminal node.

Step 4: Predicted Outcome

Recursively iterate the step4 till we obtain the leaf node which would be our predicted target variable.

Step 5: Tree Pruning and optimization for good results

It helps to reduce the size of decision trees by removing sections of the tree to avoid over fitting.

Q147. What is hyper plane in SVM?

It is a line that splits the input variable space and it is selected to best separate the points in the input variable space by their class(0/1,yes/no).

Q148. Explain Bigram with an Example?

Eg: I Love Data Science

Bigram – (I Love) (Love Data) (Data Science)

Q149. What are the different activation functions in neural network?

Relu, Leaky Relu , Softmax, Sigmoid

Q150. Which Algorithm Suits for Text Classification Problem?

SVM, Naïve Bayes, Keras, Theano, CNTK, TFLearn(Tensorflow)

Q151. You are given a train data set having lot of columns and rows. How do you reduce the dimension of this data?
  • Principal Component Analysis(PCA) would help us here which can explain the maximum variance in the data set.
  • We can also check the co-relation for numerical data and remove the problem of multi-collinearity(if exists) and remove some of the columns which may not impact the model.
  • We can create multiple dataset and execute them batch wise.
Q152. You are given a data set on fraud detection. Classification model achieved accuracy of 95%.Is it good?

Accuracy of 96% is good. But we may have to check the following items:

  • what was the dataset for the classification problem
  • Is Sensitivity and Specificity are acceptable
  • if there are only less negative cases, and all negative cases are not correctly classified, then it might be a problem

In-Addition it is related to fraud detection, hence needs to be careful here in prediction (i.e not wrongly predicting the fraud as non-fraud patient.

Q153. What is prior probability and likelihood?

Prior probability:
The proportion of dependent variable in the data set.

It is the probability of classifying a given observation as ‘1’ in the presence of some other variable.

Q154. How can we know if your data is suffering from low bias and high variance?” open=”no” style=”default” icon=”plus” anchor=”” class=””]

Random Forest Algorithm can be used to tackle high variance problem.in the cases of low bias and high variance L1,L2 regularization can help.

Q155. How is kNN different from kmeans clustering?

Kmeans partitions a data set into clusters, which is homogeneous and points in the cluster are close to each other. Whereas KNN tries to classify unlabelled observation based on its K surrounding neighbours.

Q156. Random Forest has 1000 trees, Training error: 0.0 and validation error is 20.00.What is the issue here?

It is the classical example of over fitting. It is not performing well on the unseen data. We may have to tune our model using cross validation and other techniques to overcome over fitting

Q157. Data set consisting of variables having more than 30% missing values? How will you deal with them?

We can remove them, if it does not impact our model
We can apply imputation techniques (like MICE, MISSFOREST,AMELIA) to avoid missing values

Q158. What do you understand by Type I vs. Type II error?

Type I error occurs when – “we classify a value as positive, when the actual value is negative”

(False Positive)

Type II error occurs when – “we classify a value as negative, when the actual value if positive”

(False Negative)

Q159. Based on the dataset, how will you know which algorithm to apply?
  • If it is classification related problem,then we can use logistic,decision trees etc…
  • If it is Regression related problem, then we can use Linear Regression.
  • If it is Clustering based, we can use KNN.
  • We can also apply XGB, RF for better accuracy.
Q160. Why normalization is important?

Data Set can have one column in the range (10,000/20,000) and other column might have data in the range (1, 2, 3).clearly these two columns are in different range and cannot accurately analyse the trend. So we can apply normalization here by using min-max normalization (i.e to convert it into 0-1 scale).

Q161. What is Data Science?

Formally, It’s the way to Quantify your intuitions.
Technically, Data Science is a combination of Machine Learning, Deep Learning & Artificial
Intelligence. Where Deep Learning is the subset of AI.

Q162. What is Machine Learning?

Machine learning is the process of generating the predictive power using past data(memory). It is a
one-time process where the predictions can fail in the future (if your data distribution changes).

Q163. What is Deep Learning?

Deep Learning is the process of adding one more logic to the machine learning, where it iterates
itself with the new data and will not fail in future, even though your data distribution changes. The
more it iterates, more it works better.

Q164. Where to use R & Python?

R can be used whenever the data is structed. Python is efficient to handle unstructured data. R can’t
handle high volume data. Python backend working with Theano/tensor made it easy to perform it as
fast comparing with R.

Q165. Which Algorithms are used to do a Binary classification?

Logistic Regression, KNN, Random Forest, CART, C50 are few algorithms which can perform Binary

Q166. Which Algorithms are used to do a Multinomial classification?

Naïve Bayes, Random Forest are widely used for multinomial classification.

Q167. What is LOGIT function?

LOGIT function is Log of ODDS ratio. ODDS ratio can be termed as the Probability of success divided
by Probability of failure. Which is the final probability value of your binary classification, where we
use ROC curve to get the cut-Off value of the probability.

Q168. What are all the pre-processing steps that are highly recommended?

• Structural Analysis
• Outlier Analysis
• Missing value treatments
• Feature engineering

Q169. What is Normal Distribution?

Whenever data that defines with having Mean = Median = Mode, then the data is called as normally
distributed data.

Q170. What is empirical Rule?

Empirical Rule says that whenever data is normally distributed, your data should be having the
distribution in a way of,
68 percent of your data spread is within Plus or Minus 1 standard deviation
95 percent of your data spread is within Plus or Minus 2 standard deviation
99.7 percent of your data spread is within Plus or Minus 3 standard deviation

Q171. What is Regression problem statement?” open=”no” style=”default” icon=”plus” anchor=”” class=””]

With the help of Independent variables(X), we predict target variable(Y), if your target variable
having infinite possibilities, then the problem will fall under Regression problem statement.

Q172. What are all the Error metrics for Regression problem statement?

Standard error metrics are RMSE & MAPE.
RMSE: Root Mean Squared Error (where we use least square values).
MAPE: Mean Absolute Percent Error (Here, we use absolute values).

Q173. What is R value in Linear regression?

R is the correlation coefficient. Which will be in the range of 0 to 1. If value is closer to 1, it means
that Independent variables are highly correlated to your target variable.
Can be given by the formula: (slope*standard deviation(X))/ standard deviation(Y)

Q174. What is an Outlier?

An outlier is an observation that lies in an abnormal distance from other values. In a sense, this
definition leaves it up to the analyst (or a consensus process) to decide what will be considered
Example:data – (2,1,1,3,4,2,1,4,5,6,2,6,8,9,64,1,7,9)
Only one data point is not in the distribution. You could see all data points are within the
range of 1-9. But one data point has a value of 64. Which can be considered as an Influential data

Q175. What are all the mechanisms which can identify Outliers?

Box plot is the standard mechanism which can be used in the univariate Analysis.
Scatter plot can be used for Bi-variate Analysis.

Q176. How can we treat Outliers?

Outliers should be to investigated first. Investigation should be in a way that, what is the reason
behind that outlier value? Is it possible to change those values by our investigations manually? If
can’t be treated manually, need to remove the observation if the values are highly deviated. If the
deviation is low, can keep the outliers as such and we can proceed.

Q177. What are all the standard imputations that can be carried for missing value treatments?

Mean, Median & Mode can be always the better replacements.
• Central Imputations
• KNN Imputations

Q178. What is the formula for calculating Upper whisker & Lower whisker value in Box plot?

Upper Whisker: Q3 + 1.5(IQR)
Lower Whisker: Q1 – 1.5(IQR)
IQR: Inter-Quartile Range. Which is given by Q3 – Q1.

Q179. What is the default data type of input() function in Python?

The default datatype is string. type(input()) will be string.

Q180. What is the output of the following code?

for i in range(10):




Ans: 9

Q181. What is the result of the following code?

j = ‘a’

For i in range(2):


Ans: ‘aaaa’

Q182. What is the result of the following code?

L = ()


Ans: Attribute error – tuple has no attribute append

Q183. What is the data type that can be used to store the name and password in pairs?

a = [i+j for i,j in zip(range( 10),range(1,11))]

print(a) : class=””]

Ans: [1,3,5,7,9,11,13,15,17,19]

Q184. What is the difference between set and tuple?
  • A set has no duplicate entries, whereas a tuple can have duplicate values.
  • Set is mutable, whereas tuple is immutable.
Q185. What is the function used to convert a list to a dictionary?
  • fromkeys() is the function used to convert a list, set, tuple to a dictionary.
  • The values from the list/set/tuple become the keys of the dictionary.
Q179. Code to print the odd numbers from 1 to 100 both inclusive?

print([i for i in range(2,101) if i%2!=0])

Q186. Code to print the prime numbers from 1 to 100?

prime = []

for i in range ( 2,101):

count= 0

for j in range(2,int((i**0.5)+1)):

if i%j==0:

count +=1

if count == 0:



Q187. What is the output of the following code?

a = 0

if a:




  1. ‘false’ – since 0 is interpreted as false
  2. What the output of the following code?

a  = ‘hello ’

if a:


Ans: ‘true’ – since any string value is interpreted as true.

Q188. What is the main algorithm to reduce the cost function of Linear regression?

Gradient descent algorithm is used to minimize the cost function of linear regression.

Q189. What is the cost function of decision tree algorithm?

Gini Index or entropy can both be the cost function of decision tree algorithm.

Q190. How can you generate a random number between 1 – 7 with only a die?

Multinomial Naïve Bayes is used for classification of more than two classes.

Q191. Why we use feature selection?

We use feature selection to filter variables,which are not necessary for predication and not have
sufficient datas.By filtering the variables,we improve the performance of the` model. Some of the
feature selection techniques are wrapper methods,embedded methods and filter methods.

Q192. How many data structures in Pandas?What are they?What is the difference between them?

Three data structures.
1.Series : They are one dimensional array with heterogenous datatypes.
2.DataFrame : They are two dimensional,having two or more rows and columns.
3.Panel : Three dimensional data

Q193. What is label-encoder?

Label encoder is used for converting categorical values into numerical values.Label encoder labels
the classes from 0 to n-classes -1. It is mainly used for categorical variables having two or more
different classes.

Q194. What is the difference between loc and iloc?

loc() is label-based while iloc() is index-based. loc() slices the particular data using row labels
and column labels ,while iloc() slices the particular data using row indexes and column indexes.

Q195. What are the feature selection techniques in wrapper methods?

1.Forward feature selection is starts with no feature & adds one at a time.

2.Backward feature elimination is starts with all features present & removes one feature at the time.

3.Bi-directional elimination(Stepwise Selection) is a combination of both forward & backward feature
selection methods.

Q196. In python,Dictionaries are ordered or unordered?

Yes.Dictionaries are ordered.They follow insertion order.This feature comes after python 3.6+.
Before that python dictionaries are unordered.However,in python 3.1 a feature “ordereddict” was
introduced to keeps the insertion order of the items.

Q197. Discuss bagging classifier.

Bagging classifier is a machine learning algorithm,it specifies results by having decision tree
as its base-estimator .Having max_samples as hyperparameter it allows number of samples used to
fit each decison tree.Thus,it aggregates each model by voting to obtain a final prediction .

Q198. What is data cleansing?

Data Cleansing is the process of fixing/removing duplicates and inappropriate data from a
table/record/database.Steps involved in cleaning data are removing duplicates,fix structural
data,removing outliers and handling missing/incorrect data.It helps to improve the data
quality and quick decision-making.

Q199. What are the python libraries for Data Visualization?


Q200. What are outliers?

Outliers are data values which are distant from other observations.An outlier with extremely
high increases variability and affects mean which results incorrect output.

Q201. what is data Standardization?

Standardization is a technique used in machine learning used to prevent large differences
between input values.We use StandardScalar in Scikit learn to transform input values with a
Gaussian distribution and differing means & standard deviations to a standard Gaussian
distribution with a mean of Zero and a standard deviation of One.

Q202. Discuss k-means clustering.

K-means clustering is one of the unsupervised learning algorithm comes under centroid-based
clustering algorithms.We partitioned set of points into K-groups .Each group is having a
centroid such that all points are having minimal distance from the centroid.

Q203. What is confusion matrix?

Confusion matrix is used to evaluate how a machine learning model is performing.It is table
of 2×2 matrix having actual values as columns and predicted values as rows in such a way that
it has true positive,true negative,false positive and false negative values.We need confusion
matrix to know our model’s accuracy,precison(how much we predict correctly in predicted
positive cases), recall and F1-score.

Q204. What is a nominal variable?

Nominal variables are also known as categorical variables which cannot be ordered.eg.sex,colors.

Q205. What is an ordinal variable?

Ordinal variables are also known as categorical variables are those variables which can be ordered
from lowest to highest or with any special measurement.eg.growth stages of human.

Q206. How to handle missing data?

Missing data reduce the model performance.We can handle missing data by removal of rows or
columns and some imputation techniques such as fill with zero, impute with mean , median or mode.

Q207. Define one-hot encoding.

One hot encoding is used to convert categorical data into numerical data in the form of 0s
and 1s.It is the process of creating dummy variables.One hot encoding is maximum used for
categorical variables having two different classes.

Q208. what is R2-score.

R2 score is also known as coefficient of determination & it is calculated by,
R2 = 1- SSres / SStot
SSres = the sum of squares of the residual errors.
SStot = the total sum of the errors.

R2 score is also known as coefficient of determination R2 score tells how
close our model fits with the actual data.It varies b/w 0 and 1. If r2
score is high means our model performance is good.

Q209. Discuss linear regression?

Linear regression is used to find the relationship b/w a independent variable(X)
& a dependent variable(Y) . linear regression can be done by fitting a linear equation
to observed data. A linear regression line has an equation of the shape
Y= mX + c
Y = dependent variable
m = slope
X = independent variable
c = intercept for a given line.

Q210. What are the different types of clustering methods?

1.Hierarchical Clustering methods (Algorithms- DIANA, AGNES, hclust etc.)
2.Partitioning methods (Algorithms – k-means, k-medians, k-modes)
3.Distribution-based Clustering methods (Algorithms- Gaussian Mixed Models, DBCLASD)
4.Density-based Clustering methods(Model-based methods)(Algorithms- DENCAST, DBSCAN)
5.Fuzzy Clustering methods(Algorithms- Fuzzy C Means, Rough k means)
6.Constraint Based Clustering methods(Algorithms- Decision Trees, Random Forest, Gradient Boosting)

Q211. What is Data Preprocessing?

Data Preprocessing is the process of converting raw data into an understandable,valuable data
which involves data cleaning,data transformation and data reduction.Data Preprocessing
is a data mining technique helps to obtain the quality data.

Q212. What is correlation?

Corelation describes how two variables are related to each other.Increase or decrease in one
variable will affect the other.

Q213. What is data Munging?

Data Munging (also known as Data Wrangling) is the process which involves the cleaning and
structuring raw data and transforming it into a desired format which helps to analysis and
make quicker decisions.

Q214. What is data analysis?

Data Analysis is the process of collecting, cleaning, structuring and Analysing the raw data
to learn more about the data to make smarter decisions.

Q215. What is data mining?

Data mining is the process of analysing and extracting information from raw data by uncovering
patterns and relationship from raw data which helps us to learn more about the dataset and
analyse the dataset in different ways.

Q216. Name some data mining techniques.

Tracking patterns
Classification Analysis
Outlier detection/Anomaly
Association Rule Mining
Clustering Analysis
Regression Analysis
Decision trees

Q217. What is precision and how to calculate it?

Precision is how much we predict correctly in predicted
positive cases. Precision is calculated by,

Precision = TP/ (TP + FP)
where TP is the True positive and FP is the False Positive

Q218. What is overfitting and underfitting?

Overfitting occurs when the trained data of the model is too close to the other data of the actual model.
Underfitting occurs when the trained data of the model is vary from the other data of the actual model.

Q219. What are the types in Data Analysis?

Descriptive Analysis
Predictive Analysis
Diagnostic Analysis
Prescriptive Analysis
Qualitative Analysis
Quantitative Analysis

Q220. What is recall and how to calculate it?

Recall is how much are predicted as positive in total positive cases.
Recall is calculated by,
Recall = TP/(TP + FN)
TP = True positive and FN = False Negative

Besant Technologies helps the student to get the best and top training and gives 100% placement assistance. We have extremely talented and highly skilled professionals as tutors and giving the coaching to students and also supporting for interview-related purposes. We are providing the best Data Science training in Chennai and Data Science training in Bangalore. This Data Science interview questions and answers will make you to get the complete knowledge and have the job in your hand.

These Data science interview questions and answers are prepared by tutors with more research and analysis and also by collecting various questions from some big companies. Do go through this Data Science Interview questions and answers, contact us if you have any doubts about these questions and answers.


Besant Technologies WhatsApp