DATA606 - Logistic Regression

November 20, 2019

Meetup Presentations

Bryan Persaud
Alain Kuiete Tchoupou
Emmanuel Hayble-Gomes
Jai Jeffryes
Anil Akyildirim
Shovan Biswas

Regression so far…

At this point we have covered:

Simple linear regression
- Relationship between numerical response and a numerical or categorical predictor
Multiple regression
- Relationship between numerical response and multiple numerical and/or categorical predictors

What we haven’t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

Odds

For some event $E$,

\[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\]

Similarly, if we are told the odds of E are $x$ to $y$ then

\[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \]

which implies

\[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

Generalized Linear Models

Generalized lineral models (GLM) is a generalizaiton of OLS that allows for the response variables (i.e. dependent variables) to have an error distribution that is not normally distributed. Logistic regression is just one type of GLM, specifically for dichotomous response variables that follow a binomial distiribution.

All generalized linear models have the following three characteristics:

A probability distribution describing the outcome variable
A linear model
$\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n$
A link function that relates the linear model to the parameter of the outcome distribution
$g(p) = \eta$ or $p = g^{-1}(\eta)$

Logistic Regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects $\eta$ to $p$. There are a variety of options but the most commonly used is the logit function.

Logit function

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

The Logistic Function

\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]

logistic <- function(t) { return(1 / (1 + exp(-t))) }
df <- data.frame(x=seq(-4, 4, by=0.01))
df$sigma_t <- logistic(df$x)
plot(df$x, df$sigma_t)

t as a Linear Function

\[ t = \beta_0 + \beta_1 x \]

The logistic function can now be rewritten as

\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]

Similar to OLS, we wish to minimize the errors. However, instead of minimizing the least squared residuals, we will use a maximum likelihood function.

Example: Hours Studying Predicting Passing

study <- data.frame(
    Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,
            3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50),
    Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1)
)
lr.out <- glm(Pass ~ Hours, data=study, family=binomial(link='logit'))
lr.out

## 
## Call:  glm(formula = Pass ~ Hours, family = binomial(link = "logit"), 
##     data = study)
## 
## Coefficients:
## (Intercept)        Hours  
##      -4.078        1.505  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  18 Residual
## Null Deviance:       27.73 
## Residual Deviance: 16.06     AIC: 20.06

Model

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times Hours\]

Plotting the Results

binomial_smooth <- function(...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}

ggplot(study, aes(x=Hours, y=Pass)) + geom_point() + binomial_smooth(se=FALSE)

Prediction

Odds (or probability) of passing if studied zero hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0\] \[\frac{p}{1-p} = exp(-4.078) = 0.0169\] \[p = \frac{0.0169}{1.169} = .016\]

Odds (or probability) of passing if studied 4 hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4\] \[\frac{p}{1-p} = exp(1.942) = 6.97\] \[p = \frac{6.97}{7.97} = 0.875\]

Fitted Values

study[1,]

##   Hours Pass
## 1   0.5    0

logistic <- function(x, b0, b1) {
    return(1 / (1 + exp(-1 * (b0 + b1 * x)) ))
}
logistic(.5, b0=-4.078, b1=1.505)

## [1] 0.03470667

Of course, the fitted function will do the same:

study$fitted <- fitted(lr.out)
study[1,]

##   Hours Pass     fitted
## 1   0.5    0 0.03471034

But remember, the fitted values are log odds.