Linear regression is useful in predicting
continuous outcomes. But if we consider
categorical data like quality of health care-poor/good, election results-republican/democrat, etc, we have to round the outcome to 0/1. We generally use expert assessment in such cases. But this is time consuming and can not possible consider huge amounts of data.
Logistic/Logit regression is useful in these cases where it predicts the
probability of a variable being true.
We use
Logistic Response Function to predict the probability that y=1
P(y=1)=1/1+e^-(b0+b1x1+b2x2+...+bkxk)
Positive coefficients b1,b2..bk will increase the probability of y=1 and negative coefficients increase the probability of y=0.
Odds=P(y=1)/P(y=0)
if Odds>1, P(y=1) is more
P(y=0)=1-P(y=1)
Based on this,
log(Odds)=b0+b1x1+...+bkxk. This is called the
logit function.
We need to have a baseline model just like linear regression. In classification problems,
baseline model predicts the
most frequent outcome - which means the highest frequency between y=1 and y=0. This can be easily obtained using
table(DF1$y). You can calculate
accuracy using highest frequency/total number of observations. Our goal is to be better than this accuracy.
Earlier we split training and test data using some condition on an independent variable in subset function. For example,
DFTr= subset(DF1, V1==x).
Now, we will
randomly split the data into training and test data. For this, we need to install caTools R package.
install.packages("caTools")
Select a CRAN mirror and download the package.
library(caTools) -- loads the library.
set.seed(10)- initializes the random generator.
split= sample.split(DF$y, SplitRatio=0.75)
In this case, data will be split such that 75% of the data is training data and 25% is test data.
Though the data is randomly split, it is split such that
y is balanced in both data sets. For example, if y=1 for 75% of the initial data, then data is split such that y=1 for 75% of training data and 75% of test data.
DFTr= subset(DF1, split==TRUE)
DFTe= subset(DF1, split==FALSE)
Just like lm(), we have glm() here which stands for
generalized linear model.
m1=glm(y ~ v1+v2, data=DFTr, family=binomial)
family=binomial gives a logistic regression model.
summary(m1)
This also gives
AIC which is similar to adjusted R2 in linear model - number of variables used w.r.t number of observations. AIC should be less.
PTr=predict(m1,type="response")
type="response" gives probabilities.
You can find the
average prediction for y=1 using
tapply(PTr, DFTr$y,mean)
To convert probabilities to predictions, we can use a
threshold.
If P(y=1)>=t, then y=1.
If P(y=1)<t, then y=0.
Select threshold based on what type of errors are better for you.
- If threshold is very high, then we predict y=1 very rarely. So the errors here would be actual outcome being y=1, but our predict is y=0.
- If threshold is very small, then we predict y=1 very often. Errors here would be actual outcome being y=0, but our predict id y=1.
If there's no preference, we can simply choose t=0.5.
We can make this
quantitative by using
confusion/classification matrix. This has actual outcomes as rows and predicted outcomes as columns.
|
|
Predicted=0
|
Predicted=1
|
|
Actual=0
|
True Negatives(TN)
|
False Positives(FP)
|
|
Actual=1
|
False Negatives(FN)
|
True Positives(TP)
|
TN and TP are the ones we got right. FN and FP are the ones which we got wrong.
Sensitivity/True Positive Rate = TP/TP+FN
False Negative Error Rate=1-sensitivity
Specificity/True Negative Rate = TN/TN+FP
False Positive Error Rate=1-specificity
Accuracy=TN+TP/N
Error rate=FP+FN/N
If t is high, specificity will be high and sensitivity will be low and vice versa.
We can get the confusion matrix using
table(DFTr$y, PTr>t).
We can get a
visualization of threshold values using
Receiver Operator Characteristic(ROC) curve which has
sensitivity on y-axis and 1-specificity(False Positive Rate) on x-axis. ROC curve starts at (0,0) or t=1 and ends at (1,1) or t=0.
To get ROC curve in R, we need to install ROCR package.
install.packages("ROCR")
library(ROCR)
ROCRPred = prediction(PTr, DFTr$y)
ROCRPerf = performance(ROCRPred, "tpr","fpr")
tpr and fpr are arguments for x and y axis.
plot(ROCRPerf, colorize=TRUE,print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1.7))
This plot will be colorized and will have threshold labels adjacent to the curve.
auc=as.numeric(performance(ROCRPred,"auc")@y.values)
This gives the
Area Under Curve AUC which is an absolute measure of
quality of prediction.
We can then test our model on test data just like we did on training data.
PTe = predict(m1,type="response",newdata=DFTe)
In case of
missing data, we can use
multiple imputation. For this we need to install and load "
mice" package.
set.seed(144)
I1= complete(mice(DF1)
We can add the variables from I1 to DF1 that has missing data.
We can improve baseline model by using a
sign function that returns 1 if it is passed a positive number, -1 if it is passed a negative number and 0 if it is passed 0.
table(DFTe$y,sign(DFTe$v1))