Application of Logistic Regression using R Programming
Application of Logistic Regression using R Programming
Experience
of Statswork.com
In this
blog, we will look into the practical application of Logistic Regression using
R in detail. Performing Logistic Regression is not an easy task as it demands
to satisfy certain assumptions as like Linear Regression. Before reading
further, I would suggest you read and understand how Linear Regression works.
Now, let us
understand what Logistic regression is.
Logistic Regression Analyses
Suppose you have a medical data to predict the medical condition
of a patient having high blood pressure (BP) or not along with other
observations such as Age, Smoking habits, Weight, or Body mass Index BMI, blood
cholesterol levels, Diastolic and Systolic BP value, Gender, etc.
Source: adapted from Nimmala et al., 2018[1]
First thing you do is to select the response or dependent
variable, and in this case, patients having high BP or not will be response
variable or dependent variable (refer Table 1 for cut-off) and others are
explanatory or independent variables. Dependent variable needs to be coded as 0
and 1 while explanatory variable can be a continuous variable or ordinal
variable.
Table 1. Blood pressure range
|
||||
BP
|
Low
|
Normal
|
Borderline
|
High
|
Systolic
|
<90
|
90–130
|
131–140
|
140
|
Diastolic
|
<60
|
60–80
|
81–90
|
>90
|
In such a scenario, one wishes to predict the outcome by applying
suitable model namely the logistic regression model.
Definition:
Logistic regression is a type of
predictive model which can be used when the target variable is a categorical
variable with two categories. Logistic regression is used for the prediction of
the probability of occurrence of an event by fitting the data into a logistic
curve. Like many forms of regression analysis it makes use of predictor
variables, variables may be either numerical or categorical.
Logistic
regression is to predict the presence or absence of a characteristic or outcome
based on values of a set of predictor variables while in this case prediction
of high BP. It is suitable for models where the dependent variable is
dichotomous. Logistic regression coefficients can be used to estimate odds
ratios (OD) for each of the independent variables in the model.
The
main aim of logistic regression is to find the best fitting model to describe
the relationship between the dichotomous characteristic of interest and a set
of independent (predictor or explanatory) variables. Simply, it predicts the probability
of occurrence of an event by fitting data to a logit function.
Logistic regression generates the
coefficients of a formula to predict a logit transformation of the probability
of the presence of the characteristic of interest:
Logit
(p) = b0 + b1X1 + b2X2 +…+
bkXk
where p is the probability of the presence
of the characteristic of interest. The logit transformation is defined as the
logged odds:
Odds =
=
And,
Logit (p) =
log (
) =
log (p) – log (1-p)
Rather than choosing parameters that
minimize the sum of squares. The table below shows the prediction model for
childhood high blood pressure
Table 1. Prediction model for childhood
high blood pressure.
Adapted from Hamoen et al., 2018[2]
Logistic regression in R
R is an easier platform to fit a
logistic regression model using the function glm(). Now, I will explain, how to
fit the binary logistic model for the Titanic dataset that is available in
Kaggle. The first thing is to frame the objective of the study. In this case,
one wishes to predict the survival variable with 1 as survived and 0 as not
survived along with other independent variables.
Primary step is to load the training
data into the console using the function read.csv ().
train.data<-
read.csv('train.csv',header=T,na.strings=c(""))
Before going to the fitting part, one
should check whether there exists missing data. It can be achieved by sapply()
function in R.
sapply(train.data,function(x)
sum(is.na(x)))
PassengerId Survived Pclass Name Sex
0 0 0 0 0
Age SibSp Parch Ticket Fare
177 0 0
0 0
Cabin
Embarked
687 2
From
the above result of the sapply function, we can see that the variable “cabin”
has many missing entries. ,Also, the variable “PassengerId” contains missing
data. We will skip these two variables for the analysis as it is not relevant
to the objective of the study.
We can omit those variables and make a
subset of the data as new with the subset () function.
data <- subset(train.data,select=c(2,3,5,6,7,8,10,12))
where the numeric values
in the select argument is the columns in the data file.
Next step is to check for other missing
entries in the subset data. There are different ways to replace the NA’s with
either the mean or median of the data. Here, I used the mean for replacing NA’s
like
data$Age[is.na(data$Age)] <-
mean(data$Age,na.rm=T)
Fitting Logistic Regression Model
Before fitting the model, we split the
data into two sets, namely training and testing set. The training set is used
to fit the model where the latter is used for testing purpose.
traindata<- data[1:800,]
testdata<- data[801:889,]
Let’s start fitting the model in R and
make sure you specify family = binomial in glm() function since our response is
binary.
lrmodel<- glm(Survived
~.,family=binomial(link='logit'),data=traindata)
The result of the model can be obtained
using the following command:
summary(lrmodel)
Call:
glm(formula =
Survived ~ ., family = binomial(link = "logit"),
data = traindata)
Deviance Residuals:
Min
1Q Median 3Q
Max
-2.6064 -0.5954
-0.4254 0.6220 2.4165
Coefficients:
Estimate Std. Error z value
Pr(>|z|)
(Intercept) 5.137627
0.594998 8.635 < 2e-16 ***
Pclass -1.087156 0.151168
-7.192 6.40e-13 ***
Sexmale -2.756819
0.212026 -13.002 < 2e-16 ***
Age -0.037267 0.008195
-4.547 5.43e-06 ***
SibSp -0.292920 0.114642
-2.555 0.0106 *
Parch -0.116576 0.128127
-0.910 0.3629
Fare 0.001528 0.002353
0.649 0.5160
EmbarkedQ -0.002656
0.400882 -0.007 0.9947
EmbarkedS -0.318786
0.252960 -1.260 0.2076
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
(Dispersion
parameter for binomial family taken to be 1)
Null deviance:
1065.39 on 799 degrees of freedom
Residual
deviance: 709.39 on 791
degrees of freedom
AIC: 727.39
Number of Fisher
Scoring iterations: 5
Next
serious part is to interpret the result of the model. From the p-value of the
summary of the model, the variable Embarked, Fare, and SibSp are not
statistically significant. Also, the variable sex has lesser p-value, which
means there is a strong association of passengers with the chance of survival.
It
is important to note that the response variable is log odds and can be written
as ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 + … + z*xn.
To
analyse the deviance, we should use anova() function in R.
anova(lrmodel,
test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: Survived
Terms added sequentially (first to last)
Df Deviance Resid. DfResid. Dev Pr(>Chi)
NULL 799 1065.39
Pclass
1 83.607
798 981.79 < 2.2e-16 ***
Sex
1 240.014 797
741.77 < 2.2e-16 ***
Age
1 17.495 796
724.28 2.881e-05 ***
SibSp
1 10.842 795
713.43 0.000992 ***
Parch
1 0.863 794
712.57 0.352873
Fare
1 0.994 793
711.58 0.318717
Embarked
2 2.187 791
709.39 0.334990
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The difference between the null
deviance and the residual deviance explains the performance of the model
against the null model (a model with only the intercept). If the difference is
wider, model performs better. Large p-value indicates that the logistic
regression model without that particular variable explains the same amount of
variation. At the end, one can infer that lesser AIC is the best model.
Next
important step is to predict the model on a new set of data and R enables us
to do with predict() function. Let us consider our threshold for the new
response data. That is, if P(y=1|X) > 0.5 then y = 1 otherwise y=0. The
threshold could be changed according to the researchers needs. The following
Table presents the accuracy prediction formulas especially when machine
learning algorithms are applied.
Table 4. Measures and formula
|
|
Classifier accuracy
|
TP ± TN (P + N)
|
Classifier error rate
|
FP ± FN
(P + N) |
Recall
|
TP
P |
Precision
|
TP
(TP + FP) |
F-Measure
|
(2 × precision × recall)
(precision + recall) |
fit.results<-predict(lrmodel,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fit.results<- ifelse(fit.results>
0.5,1,0)
Error<- mean(fit.results !=
test$Survived)
print(paste('Accuracy',1-Error))
"Accuracy 0.842696629213483."
84.26% of accuracy is actually a good result.
Note here; I have split the data manually, if you split the data in a different
way, then the accuracy of the model will alter.
library(ROCR)
A<-predict(lrmodel,newdata=subset(test,select=c(2,3,4,5,6,7,8)),
type="response")
Pre<- prediction(A,
test$Survived)
Pre1<- performance(Pre,
measure = "tpr", x.measure = "fpr")
plot(Pre1)
auc<- performance(Pre,
measure = "auc")
auc<- auc@y.values[[1]]
auc
0.8647186
Well. we come to
the end of this tour! I hope now you can analyse logistic regression using R on
your own. Further, I suggest you try the same Titanic datasets and check you
get the same results before exploring other datasets.
Comments
Post a Comment