Logistic Regression Using R

1. what is Logistic Regression?
Logistic Regression is one among the machine learning algorithms used for solving classification problems. it is used to estimate probability whether an instance belongs to a category or not. If the estimated probability is bigger than threshold, then the model predicts that the instance belongs to that class, alternatively it predicts that it doesn’t belong to the category as shown in fig 1. This makes it a binary classifier. Logistic regression is employed where the worth of the variable is 0/1, true/false or yes/no.

Example 1

Suppose we have an interest to understand whether a candidate will pass the doorway exam. The results of the candidate depends upon his attendance within the class, teacher-student ratio, knowledge of the teacher and interest of the scholar within the subject are all independent variables and result’s variable . the worth of the result are going to be yes or no. So, it’s a binary classification problem.

1. Why Logistic Regression, Not linear regression
Linear Regression models the connection between variable and independent variables by fitting a line as shown in Fig

In linear regression, the worth of predicted Y exceeds from 0 and 1 range. As discussed earlier, Logistic Regression gives us the probability and therefore the value of probability always lies between 0 and 1. The logistic function is defined as:

1 / (1 + e^-value)

Where e is that the base of the natural logarithms and value is that the actual numerical value that you simply want to rework . The output of this function is usually 0 to 1.

The equation of linear regression is

Y=B0+B1X1+…+BpXp

Logistic function is applied to convert the output to 0 to 1 range

P(Y=1)=1/(1+exp(?(B0+B1X1+…+BpXp)))

We need to reformulate the equation in order that the linear term is on the proper side of the formula.

log(P(Y=1)/1?P(Y=1))= B0+B1X1+…+BpXp

where log(P(Y=1)/1?P(Y=1)) is named odds ratio.

4. How to find the threshold value

res<-predict(model,training,type=”response”)

library(ROCR)

ROCRPred=prediction(res,training\$target)

ROCRPerf<-performance(ROCRPred,”tpr”,”fpr”)

plot(ROCRPerf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

While selecting the edge value, we should always lookout that true positive rate should be maximum and false negative rate should be minimum. Because, if an individual has disease, but the model is predicting that he’s not having disease, it’s going to cost someone’s life

The plot shows that if we take threshold=0.4, true positive rate increase.

res<-predict(model,testing,type=”response”)

table(Actualvalue=testing\$target,Predictedvalue=res>0.4)

Here, we will see that the worth of True negative decreases from 7 to five .

Accuracy of the model

The accuracy of the model is coming an equivalent if we use threshold value=0.4 or 0.5 but just in case of threshold value=0.4, truth negative cases decrease. So, it’s better to require 0.4 value as threshold value. the choice of threshold value depends upon the utilization case. just in case of medical problems, our focus is to decrease true negatives because if an individual has disease, but the model is predicting that he’s not having disease, it’s going to cost someone’s life.