Saturday 7 October 2017

edX Analytics Edge Unit 2 Summary

Linear regression can be used for predictions in many fields like wine tasting and sports statistics. Linear regression is where we predict a dependent variable based on some independent variables.

Consider a one-variable linear regression first - which can be plotted as a simple x vs y graph. Here x is the independent variable and y is the dependent variable which needs to be predicted.  If we take the average of the dependent variable and draw a line, we will get a baseline model. The goal is to design a model better than the baseline by drawing a predictive line through the data.

Both baseline and the linear model which we will design will not give the perfect values for y.
The difference between the actual y value and the predicted y value is the error residual. The sum of error residuals at each data point is SSE(Sum of Squared Errors).
The difference between the actual y value and the average y value is the error in baseline model. The sum of such errors at each data point is SST(Total Sum of Squares).

Root Means Squared Error RMSE=sqrt(SSE/N)
where N is the number of data points.
R2= 1- SSE/SST

If SSE=SST, then R2=0. This is the baseline model case.
When SSE<SST, we reduce the errors in our model- which is our goal. In this case, R2 will be close to 1. So when designing the model, we should look for higher R2 value.

If we consider multiple variable linear regression, we can write y as follows:
y=b0+b1x1+b2x2+b3x3+...+bnxn+error
here b0 is the inetercept
b1,b2..bn are coefficients
error is the residual

Our goal is to find the best coefficients.

Multiple variables can impact y. As we add more variables, R2 value improves. But as keep on adding more and more variable, the amount by which R2 improves decreases. So we need to make sure we are not adding irrelevant variables which just complicate the model.

In R:

m1 = lm(y ~ x1+x2, data=DF1)
DF1 is a data frame read from a csv file.
x1,x2 are the independent variable and y is the dependent variable to be predicted.
m1 is the model output.

summary(lm)- gives the intercept, coefficients and residuals. It also gives which variable is significant for the model.
We can also see R2 and adjusted R2 values. Adjusted R2 value is R2 value adjusted as per the number of independent variables used relative to the data points. Adding an irrelevant variable will make the adjusted R2 to drop.
Std Error is the residual.
t=coefficient/Std Error. This needs to be high.
P(t)-probability that the coefficient is close to 0. We want this to be as small as possible. Based on these values, you can see stars beside the variable.
*** - P<0.001
**- 0.001<P<0.01
*-0.01<P<0.05
.-0.05<p<0.1


SSE = sum(m1$residuals^2)

cor(DF1$V1, DF1$V2) - gives the correlation between two variables.
cor(DF1) - gives table of correlation among all variables.

A high correlation between two independent variables may give misleading variable significance values. In general, if correlation >0.7 or <-0.7 , variables should be chosen carefully checking which will perform better.

Training data- data used to build a model
Test data- new data to test how the model will perform

p1= predict(m1, newdata=DFT1)
DFT1 is the data frame from test data.
m1 is the model we have built using training data

SSE=sum((DFT1$y- p1)^2)
SST= sum((DFT1$y-mean(DFT$y))^2)

R2=1-SSE/SST

The model on test data can even give negative R2 values indicating that it's not performing well.

m2 = step(m1) gives a simplified model but not necessarily the best model.


No comments:

Post a Comment