Data Analysis Workout 12 - Linear Regression Essentials: Understanding Key Assumptions

EnterpriseDNA · September 20, 2023, 6:36am

Title: Linear Regression Essentials: Understanding Key Assumptions

Description:

Linear regression is foundational in data analysis. Dive into its core assumptions to ensure robust and meaningful results in this concise workout.

Scenario:

Imagine you’re analyzing house prices using various factors like size, age, and location. You opt for linear regression. How do you ensure your analysis is sound?

Objectives:

By the end of this workout, you should be able to:

Recognize the core assumptions behind linear regression.
Understand the implications if these assumptions are not met.
Conceptualize the steps to validate these assumptions.

Interactive Task:

Given your understanding of real estate, answer the following:

If you plot residuals against predicted house prices and notice a funnel shape, which assumption is likely violated?

Your Answer: ________________________

If two predictor variables, say house size and number of rooms, provide the same information in predicting price, which assumption is at risk?

Your Answer: ________________________

Why is it crucial to ensure the residuals of your model are approximately normally distributed?

Your Answer: ________________________

Questions:

Which assumption is crucial for hypothesis testing in linear regression?

i) Linearity of relationships

ii) Independence of residuals

iii) Homoscedasticity

iv) Normal distribution of residuals

What does the assumption of “independence of residuals” imply in the context of time series data?

i) Residuals should show a clear trend over time.

ii) Residuals from one time point should not be predictive of residuals from another time point.

iii) Residuals should always increase over time.

iv) All residuals should be identical.

.

Duration: 20 minutes

Difficulty: Intermediate

Period :
This workout will be released on Wednesday, September 20, 2023, and will end on Thursday, October 05, 2023. But you can always come back to any of the workouts and solve them.

Keith · October 1, 2023, 11:45pm

Hi @EnterpriseDNA.

Here is my solution to this workout:

Interactive Task:

If you plot residuals against predicted house prices and notice a funnel shape, which assumption is likely violated?

Answer:
If you notice a funnel shape when you plot residuals against predicted house prices, it suggests that the assumption of homoscedasticity is likely violated. Homoscedasticity is the assumption that the variance of the errors is constant across all levels of the independent variables. In other words, the spread of residuals should be approximately the same across all predicted values.

A funnel shape indicates heteroscedasticity, which means the spread of residuals varies across predicted values. This can lead to inefficient parameter estimates and incorrect conclusions about the relationships between predictors and the response variable.

To address this issue, you might consider applying a transformation to the response variable (like a log transformation), using weighted least squares instead of ordinary least squares, or using robust standard errors.

If two predictor variables, say house size and number of rooms, provide the same information in predicting price, which assumption is at risk?

Answer:
If two predictor variables, such as house size and number of rooms, provide the same information in predicting price, the assumption of no multicollinearity is at risk.

Multicollinearity refers to a situation where two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this case, it becomes difficult for the model to estimate the relationship between each predictor variable and the outcome independently because the predictors are related to each other.

The presence of multicollinearity can lead to unstable estimates of the regression coefficients, which can make it difficult to ascertain the effect of individual predictors on the response variable. It can also result in wider confidence intervals for effect estimates, leading to a lack of statistical significance for important variables.

To detect multicollinearity, you can use variance inflation factor (VIF), correlation matrix or scatter plots. To deal with multicollinearity, you might consider dropping one of the variables, combining the correlated variables into one, or using regularization techniques.

Why is it crucial to ensure the residuals of your model are approximately normally distributed?

Answer:
The assumption of normally distributed residuals is crucial in linear regression for several reasons:

Confidence Intervals and Hypothesis Tests: The normality assumption is needed to make inferences about the regression parameters, such as constructing confidence intervals or conducting hypothesis tests. These inferences rely on the sampling distribution of the regression coefficients, which is assumed to be normal.
Model Accuracy: If the residuals are not normally distributed, it could indicate that the model is not correctly specified. For example, a non-linear relationship may exist between the predictors and the response variable, or important interaction effects might be missing.
Predictive Performance: A violation of the normality assumption can lead to poor predictive performance. If the residuals show patterns of non-normality, it suggests that the model’s errors are systematically off in certain cases, which can lead to inaccurate predictions.
Outlier Detection: Normality of residuals is also used in detecting outliers. Observations that have large residuals, when the residuals are normally distributed, are potential outliers.

To check for normality, you can use graphical methods (like Q-Q plots or histograms) or statistical tests (like the Shapiro-Wilk test). If the residuals are not normally distributed, you might consider transforming the response variable or using a different type of model that does not make this assumption.

Questions:

Which assumption is crucial for hypothesis testing in linear regression?
Answer:

iv) Normal distribution of residuals

What does the assumption of “independence of residuals” imply in the context of time series data?

Answer:

ii) Residuals from one time point should not be predictive of residuals from another time point.

Thanks for the workout.
Keith