Binomial Logistic Regression · Math explained

Kisaragi

6 min readMar 29, 2021

Content

Know this before continue reading

Sigmoid function
The basic idea of forecasting

Binomial logistic regression

A brief review of linear models
From linear regression to binary logistic regression
How do we estimate parameters
Gradient descent
Newton’s method

Know this before continue reading

Part 1: Sigmoid function

The starting point of binary logistic regression is the sigmoid function

Sigmoid function can map any number to [0,1] interval, that means the value range is between 0,1, further it can be used for probability prediction.

Here is the image of the sigmoid function

Part 2: The basic idea of forecasting

Note that, in logistic regression we do not directly output the the category, but a probability value.

We would determine a threshold according to different situations first, usually set at 0.5.

For example, in the binary model (category 0 and 1), if the output is p (y = 1) = 0.75 (0.75 > 0.5), then we would say y belongs to category 1.

Binary logistic regression

Part 1: A brief review of the linear model

To explain binary logistic regression, we need to understand what is a linear model first.

A linear model is based on two hypotheses

Suppose that there is a linear relationship between y and X
yi ( i = 1,2,3, . . . ,n ) is independent identically distributed

The linear relationship between y and X

The dependent variable Y is a nx1 vector

The independent variable X with intercept is a matrix of nx(m+1)

The parameter β with intercept term is a mx1 vector

Part 2: From linear regression to binary logistic regression

Our hypothesis is that yi(yi=0 or 1) is independent identically follow Bernoulli distribution

First of all, we want to use the linear regression to estimate yi. In Bernoulli distribution, The mean value of yi is P (yi = 1), Therefore, our prediction model is

But the disadvantage of this is the value of xiβ may fall outside of [0,1]. Inspired by sigmoid function, we change the prediction model into

β=(β0,β1,…βm)′, xi=(1,xi1,xi2,…,xim), thats is xiβ=β0+xi1β1+…+ximβm

Part 3: How to estimate the parameters

Now we know yi follows Bernouli distribution. And since the distribution is known, we would consider using maximum likelihood method to estimate the parameters. And the likelihood function is

Note pi=P(yi=1)=1/(1+exp{-xiβ})

And we found such equation:

To simplify the calculation, take logarithm on both sides of the above formula so we have

Then we need to maximize the likelihood function but here comes the question, why do we need to maximize it?

Because our hypothesis is that yi is independent identically distributed, so the likelihood function of yi is actually the joint density function of yi. When the joint density function is at its maximum, that is, the probability of yis is at its maximum, that’s when the event is most likely to occur.That’s why we want to maximize the likelihood function.

I think most of the readers clicked this page are interested in machine learning. So let’s use a term in machine learning, cost function, to continue our math journey. In order to get cost function, we only need to make a small change on (1).

Its essence is to average the logarithmic function of the likelihood function and maximize the logarithmic function of the likelihood function, that is, to minimize the cost function (3).

If we want to do the estimation more strictly, we can also add a penalty term on the basis of (3). This step is also called regularization. The common penalty terms are one norm L1and two norm L2, which are also called lasso and ridge respectively.

The further problem is how to minimize the cost function and get its corresponding parameter β.

There are two common methods to calculate the parameter β, one is Gradient Descent, the other is Newton’s Method.

Part 4: Gradient Descent

Get the derivation of the cost function (3), and (3)=-(2)/n, therefore

Like before, pi=P(yi=1)=1/(1+exp{-xiβ})

xi and yi are known, give β an initial value

Then we can update β by the formula below

Write it in with matrix, we have

l is the learning rate, X is the matrix of nx (m+1), y is the nx1 vector, and p is nx1 vector(p = [ p1 , p2 , . . . , pn ] ′ )

The following is a simple implementation of gradient descent, the calculation for p is omitted.

import numpy as npdef update_bata(X, y, beta, l):
 '''
 X: (n,m+1) matrix
 y: (n,1) vector
 beta: (m+1,1) vetor 
 l: learning rate
 '''
    n = len(y)#1 from (4),we can get p，name it cal_p
    p= cal_p(X, beta)#2 part of (4)
    gradient = np.dot(X.T,  y-p)#3 Take the average cost derivative for each feature
    gradient /= N#4 - Multiply the gradient by our learning rate
    gradient *= l#5 - Subtract from our weights to minimize cost
    beta -= gradientreturn beta

Part 5: Newton’s Metod

We can say this method is inspired by Taylor series

We want to find x that saitsfies f (x) = 0, that is, the end point of Newton’s method is that f (x) will converges to 0

Here we only take the first two items

h (x) is an approximation of f (x) , so h(x) satisfies f (x) = h (x) = 0, then we have

The iteration would be

In our binary logistic regression model, the cost function is the f (x) here, and the parameter β update method can be described as

This iteration would continue until the cost function approaches zero which is quite different from the gradient method becuase the cost function of gradient descent method does not necessarily tend to zero.

The End

Thanks for reading 🌸