Our Special Offer - Get 3 Courses at 24,999/- Only. Read more
Hire Talent (HR):+91-9707 240 250

# Data Science with Python Interview Questions and Answers

Q1) What is supervised learning?

Ans:  Supervised learning means learning from example. We have data and we have output too and we train a model to predict that output based on the feature columns we already have.

Q2) What is exploratory data analysis?

Ans:  EDA is basically a graphical interpretation of data. We perform different analysis and frame our data to give insightful results using EDA.

Q3) Which is your favorite algorithm and why?

Ans:  Linear Regression

Reasons :

1. Simple
2. Solves most types of data problems
3. Can be enhanced by feature engineering and tuning
Q4) Where is logistic regression used?

Ans:  Logistic regression is used when the target variable is categorical and not continuous. eg, email spam detection,  yes/no type.

It works on the probability of an event rather than predicting values.

Q5) Which command is used to know basic statistics of data in one go ?

Ans :  .describe()

Q6) What is the p-value?

Ans:  p-value is a statistical value of how your data is matching with null hypothesis.

Higher p-values determine how much is your data is with true null

Q7) What is the confusion matrix?

Ans:  Confusion matrix is for classification problems and it helps to understand the model through its various metrics and how well it generalized the test data

Q8) What is hypothesis testing?

Ans :  Hypothesis testing includes null hypothesis and alternative hypothesis. We make a statement and assume it to be a null hypothesis and then after statistical inferences , we choose whether to select null hypothesis or reject it

Q9) What is imputation?

Ans :  Imputation is the way to replace missing and null values with the value suitable according to data sample. For non-normal or non-Gaussian type of data , generally median or mean is used for imputation. For nearly normally distributed data, the mode is correct value of imputation.

Sometimes, imputation also depends on dataset and can be a random value.

Q10) What are outliers?

Ans:  Sometimes, some data points are either too small or too big from the normal data points available in the dataset. These are called outliers

Q11) What is precision and recall?

Ans :  In simple words , precision tells you predict something positive, how many times its actually positive and recall tells out of actual positive data , how many times we predicted correctly

Q12) Describe k-NN algorithm

Ans :  k-NN algorithm is a supervised algorithm where we determine a constant value k and classify a  data point within a vicinity of k data points to which class it actually belongs to.

Q13) What is bias-variance trade-off?

Ans :  Bias-variance trade-off is like a see-saw . When data exhibits higher variance , bias is increased to reduce variance and vice-versa. When model is simple and features are minimal , then it shows high bias and low variance and when model has too many features then it shows low bias and high variance

Q14) Explain underfitting and overfitting

Ans :  Underfitting is when model cant cover enough data points and is unable to predict on most data points whereas overfitting is where model tries to cover each and every data point and cant be generalized as would fail on some random dataset.

Q15) Which plot is suitable to detect outliers?

Ans: Box plot

Q16) Why computation of NumPy is faster than list or loops( if or else )
• Numpy uses vector multiplication process
• Numpy is written in C which runs behind the screen which makes it faster
• Numpy arrays are more compact than lists, that is they take less storage than lists
Q17) Is it possible to do positional and label based indexing of a data frame(df) ?

Yes, it is possible by using the function .loc() and .iloc() where .loc() can be used to index based on the location of the element and .iloc() can be used for indexing based on the index of that element.

Q18) Is it a good practice to replace the null values present in the data frame ?

The whole dataset or data frame is valuable if the null values present are less than 40 or 30% if the null values are more than 50% then we don’t get much valuable information. It is good to add information to a dataset but a bad method to exaggerate information, but often it is better to let missing values be there and continue with the analysis rather than manipulating the available information

Q19) What are the different types of joins? and which join is preferred for joining databases ?

There different types of datasets generally we can state as an inner join, outer join, left join, right join. For merging of datasets generally full outer join is preferred because this will merge the data and keep data which is common to both datasets along with data which is unique to both datasets.

Q20) What is the difference between ordered and unordered categorical variables?

The ordered categorical variable will have some kind of ordering or hierarchy in their set, like high salary low salary, months of years, etc., but unordered categorical variable don’t have a notion of high or low ex: colors, types of loans, etc.,

Q21) Which one is preferred to describe the characteristics of a group or population, mean or median? and why ?

Median is always preferred over the mean to describe the characteristics of a population because mean takes quantity aggregates it, and says how it looks if it is evenly distributed and also mean always affected by outliers hence it will not describe the exact characteristics.

Q22) Correlation or covariance? which one cleanly describes the relationship between two variables ?

Correlation measures both the strength and direction of the relationship between two variables, whereas covariance indicates only the direction of the relationship between two variables so it is always preferred to use correlation.

Q23) What is the conditional probability or Dependent event ?

The event which is affected by previous events or the events that are already happened.

If there are ‘n’ people in a group then how many a. total number of handshakes b. unique handshakes are possible

1. a – n(n-1) 2. b – n(n-1)/2

Q24) Give me some examples of events where the binomial distribution is applicable
• Tossing a coin ‘n’ number  of times
• Asking n number of people who are randomly selected if they are older than 30 years.
• Drawing 3 red balls from a bag, putting them back after drawing it.
Q25) Is failure to reject null hypothesis is same as accepting null hypothesis ?

NO both are not same, if there are not sufficient evidence to support the alternate hypothesis it means we fail to reject the null hypo but it doesn’t mean that we accepted the null hypothesis.

Q26) How can you differentiate supervised and unsupervised ML in terms of output ?

Forgiven dataset, if we already know how the correct output will look like then it will be a supervised ML, but in case of unsupervised ML we don’t have any idea how output will come.

Q27) What is RSS in regression ?

RSS stands for Residual Sum of Squares it is the sum of the squares of the variance of the data points.

Q28) How the r2 score describes the strength of the ML model ?

R2 score mainly gives you or tells you how perfectly the line fits for the given set of data, forex if the r2 score is 0.73 it means that we can explain or cover 73% of the variance present in the given set of data.