Content
Know this before continue reading
- Sigmoid function
- The basic idea of forecasting
Binomial logistic regression
- A brief review of linear models
- From linear regression to binary logistic regression
- How do we estimate parameters
- Gradient descent
- Newton’s method
Know this before continue reading
Part 1: Sigmoid function
The starting point of binary logistic regression is the sigmoid function
Sigmoid function can map any number to [0,1] interval, that means the value range is between 0,1, further it can be used for probability prediction.
Here is the image of the sigmoid function
Part 2: The basic idea of forecasting
Note that, in logistic regression we do not directly output the the category, but a probability value.
We would determine a threshold according to different situations first, usually set at 0.5.
For example, in the binary model (category 0 and 1), if the output is p (y = 1) = 0.75 (0.75 > 0.5), then we would say y belongs to category 1.
Binary logistic regression
Part 1: A brief review of the linear model
To explain binary logistic regression, we need to understand what is a linear model first.
A linear model is based on two hypotheses
- Suppose that there is a linear relationship between y and X
- yi ( i = 1,2,3, . . . ,n ) is independent identically distributed
The linear relationship between y and X
The dependent variable Y is a nx1 vector
The independent variable X with intercept is a matrix of nx(m+1)
The parameter β with intercept term is a mx1 vector
Part 2: From linear regression to binary logistic regression
Our hypothesis is that yi(yi=0 or 1) is independent identically follow Bernoulli distribution
First of all, we want to use the linear regression to estimate yi. In Bernoulli distribution, The mean value of yi is P (yi = 1), Therefore, our prediction model is
But the disadvantage of this is the value of xiβ may fall outside of [0,1]. Inspired by sigmoid function, we change the prediction model into
β=(β0,β1,…βm)′, xi=(1,xi1,xi2,…,xim), thats is xiβ=β0+xi1β1+…+ximβm
Part 3: How to estimate the parameters
Now we know yi follows Bernouli distribution. And since the distribution is known, we would consider using maximum likelihood method to estimate the parameters. And the likelihood function is
Note pi=P(yi=1)=1/(1+exp{-xiβ})
And we found such equation:
To simplify the calculation, take logarithm on both sides of the above formula so we have
Then we need to maximize the likelihood function but here comes the question, why do we need to maximize it?
- Because our hypothesis is that yi is independent identically distributed, so the likelihood function of yi is actually the joint density function of yi. When the joint density function is at its maximum, that is, the probability of yis is at its maximum, that’s when the event is most likely to occur.That’s why we want to maximize the likelihood function.
I think most of the readers clicked this page are interested in machine learning. So let’s use a term in machine learning, cost function, to continue our math journey. In order to get cost function, we only need to make a small change on (1).
Its essence is to average the logarithmic function of the likelihood function and maximize the logarithmic function of the likelihood function, that is, to minimize the cost function (3).
If we want to do the estimation more strictly, we can also add a penalty term on the basis of (3). This step is also called regularization. The common penalty terms are one norm L1and two norm L2, which are also called lasso and ridge respectively.
The further problem is how to minimize the cost function and get its corresponding parameter β.
There are two common methods to calculate the parameter β, one is Gradient Descent, the other is Newton’s Method.
Part 4: Gradient Descent
Get the derivation of the cost function (3), and (3)=-(2)/n, therefore
Like before, pi=P(yi=1)=1/(1+exp{-xiβ})
xi and yi are known, give β an initial value
Then we can update β by the formula below
Write it in with matrix, we have
l is the learning rate, X is the matrix of nx (m+1), y is the nx1 vector, and p is nx1 vector(p = [ p1 , p2 , . . . , pn ] ′ )
The following is a simple implementation of gradient descent, the calculation for p is omitted.
import numpy as npdef update_bata(X, y, beta, l):
'''
X: (n,m+1) matrix
y: (n,1) vector
beta: (m+1,1) vetor
l: learning rate
'''
n = len(y)#1 from (4),we can get p,name it cal_p
p= cal_p(X, beta)#2 part of (4)
gradient = np.dot(X.T, y-p)#3 Take the average cost derivative for each feature
gradient /= N#4 - Multiply the gradient by our learning rate
gradient *= l#5 - Subtract from our weights to minimize cost
beta -= gradientreturn beta
Part 5: Newton’s Metod
We can say this method is inspired by Taylor series
We want to find x that saitsfies f (x) = 0, that is, the end point of Newton’s method is that f (x) will converges to 0
Here we only take the first two items
h (x) is an approximation of f (x) , so h(x) satisfies f (x) = h (x) = 0, then we have
The iteration would be
In our binary logistic regression model, the cost function is the f (x) here, and the parameter β update method can be described as
This iteration would continue until the cost function approaches zero which is quite different from the gradient method becuase the cost function of gradient descent method does not necessarily tend to zero.
The End
Thanks for reading 🌸