- Bryan Persaud
- Alain Kuiete Tchoupou
- Emmanuel Hayble-Gomes
- Jai Jeffryes
- Anil Akyildirim
- Shovan Biswas
November 20, 2019
At this point we have covered:
What we haven’t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)
Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).
Odds
For some event \(E\),
\[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\]
Similarly, if we are told the odds of E are \(x\) to \(y\) then
\[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \]
which implies
\[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]
Generalized lineral models (GLM) is a generalizaiton of OLS that allows for the response variables (i.e. dependent variables) to have an error distribution that is not normally distributed. Logistic regression is just one type of GLM, specifically for dichotomous response variables that follow a binomial distiribution.
All generalized linear models have the following three characteristics:
Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.
We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors.
To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.
Logit function
\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]
\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]
logistic <- function(t) { return(1 / (1 + exp(-t))) } df <- data.frame(x=seq(-4, 4, by=0.01)) df$sigma_t <- logistic(df$x) plot(df$x, df$sigma_t)
\[ t = \beta_0 + \beta_1 x \]
The logistic function can now be rewritten as
\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]
Similar to OLS, we wish to minimize the errors. However, instead of minimizing the least squared residuals, we will use a maximum likelihood function.
study <- data.frame( Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00, 3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50), Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1) ) lr.out <- glm(Pass ~ Hours, data=study, family=binomial(link='logit')) lr.out
## ## Call: glm(formula = Pass ~ Hours, family = binomial(link = "logit"), ## data = study) ## ## Coefficients: ## (Intercept) Hours ## -4.078 1.505 ## ## Degrees of Freedom: 19 Total (i.e. Null); 18 Residual ## Null Deviance: 27.73 ## Residual Deviance: 16.06 AIC: 20.06
Model
\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times Hours\]
binomial_smooth <- function(...) { geom_smooth(method = "glm", method.args = list(family = "binomial"), ...) } ggplot(study, aes(x=Hours, y=Pass)) + geom_point() + binomial_smooth(se=FALSE)
Odds (or probability) of passing if studied zero hours?
\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0\] \[\frac{p}{1-p} = exp(-4.078) = 0.0169\] \[p = \frac{0.0169}{1.169} = .016\]
Odds (or probability) of passing if studied 4 hours?
\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4\] \[\frac{p}{1-p} = exp(1.942) = 6.97\] \[p = \frac{6.97}{7.97} = 0.875\]
study[1,]
## Hours Pass ## 1 0.5 0
logistic <- function(x, b0, b1) { return(1 / (1 + exp(-1 * (b0 + b1 * x)) )) } logistic(.5, b0=-4.078, b1=1.505)
## [1] 0.03470667
Of course, the fitted
function will do the same:
study$fitted <- fitted(lr.out) study[1,]
## Hours Pass fitted ## 1 0.5 0 0.03471034
But remember, the fitted values are log odds.