Logistic Regression – The ground to Deep Learning
This article gives a basic overview of one of the most popular machine learning algorithms. But before jumping into the article, if you are an aspiring data scientist don’t forget to check our advance data science courses in Bangalore. You can get the code of this article here. This article is part of our student article initiative written by Shashank. Logistic Regression is a supervised classification algorithm. It might be confusing at first as to why is this named as a regression algorithm even though this is a classification algorithm, but the algorithm actually outputs the probability of occurrence of an event which is a continuous variable (which is later converted to either y=0 or y=1 based on a threshold value). This is why ‘regression’ appears in the name. Examples: email spam filtering, image classification, bank fraud detection, etc.
Before jumping into the intricacies of Logistic Regression, let’s just try fitting a linear model to our classification data and see how the model performs. Consider a simple dataset where the features would be salaries of the loan applicants and y would represent whether the loan is approved or not. This would be a binary classification with the two classes being y = 0 ( loan not approved ) and y =1 ( loan approved ).
It’s easy to deduce from the plot that only those applicants whose salary is greater than 10,000 would have their applications approved. If we fit a linear model to this plot using Ordinary Least Squares method, it’d look like this:
The equation of the linear model is:
If we set a threshold value of y=0.5, as seen in the graph, then all the salary values that are greater than 10,000 would be approved for loan according to this linear model. This model actually does a good job of capturing the relationship between the feature and the label. But there’s a catch, what if there are outliers in our data?
Let’s now plot the same data along with outliers and see how the linear model performs.
The model now tilts a little in order to accommodate the outliers. This shifts the threshold salary from 10,000 to around 13,000. As a result of this, the new linear model would misclassify all the salaries between 10,000 and 13,000 as y=0 while in fact they should’ve been classified as y=1. This is when the linear model fails to deliver. Another drawback of using linear models in classification problems is that the linear model does not always return values between 0 and 1, and any value beyond the range of [0,1] would make no sense because the value of probability always lies between 0 and 1. In order to solve this, we’d need a function which takes this linear equation score and outputs a number between 0 and 1 and this number would represent the probability that the observation belongs to class y=1. This function is called sigmoid function or logistic function. The formula and graph would be as follows:
It is worth noting from the graph that the sigmoid function :
– is almost linear between (-3,3)
– is 0.5 when z = 0
– asymptotically reaches 0 when z < -3
– asymptotically reaches 1 when z > 3
To get a better understanding let’s use a built-in dataset called ‘penguins’ from Seaborn, it has data about bill features of penguins along with the name of the species. It’d be interesting to see if we can segregate ‘Adelie’ and ‘Gentoo’ species by using just the bill depth and bill length of the penguins using logistic regression. Bill length and bill depth would be the two features and the species would be the label. This is how the scatter plot would look like:
The two classes can simply be segregated by a straight line as seen in the graph, and the equation of this straight line is given by:
This line that separates the two classes is called the ‘Decision Boundary’.
The decision boundary is a property of the weights and not a property of the dataset. The parameter values that define the decision boundary can be easily determined by fitting a logistic regression model to the data and then extracting the coefficients and intercept.
Let’s discuss decision boundary in more detail. A logistic regression model gives us the probability of a point belonging to a certain class. In our example of penguins, when g(theta) is passed through the sigmoid function, it returns us the probability that the point belongs to Gentoo species, p. The probability that this point belongs to class ‘Adelie’ would simply be 1-p.
Logistic Regression provides us with a probability space which can be divided into three parts based on the decision boundary:
– All the points that lie towards the right of the decision boundary would have g(theta}) > 0 and hence h(g(theta)) > 0.5. Meaning, the points that lie on the right of the decision boundary have more probability of belonging to the ‘Gentoo’ category.
– Similarly, all the points that lie towards the left of the decision boundary would have g(theta) < 0 and hence h(g(theta)) < 0.5. Meaning, the points that lie on the left of the decision boundary have more probability of belonging to the ‘Adelie’ category.
– All the points that lie on the decision boundary would have equal probability of belonging to each of the classes. But since we have defined the threshold as y = 1, when h(theta transpose * X ) > 0.5, then all the points on the decision boundary would belong to class Gentoo.
This was just a simple example where there were just two categories and they could be separated with just a straight line, but in real life we’d come across much more complex problems. If the two classes required a more complex decision boundary, say a circle, then instead of just using a linear equation we’d use polynomial regression to increase the complexity.
A cost function gives an estimate of how badly a model is performing. But before we define the cost function for logistic regression, let’s have a look at two graphs that give us an idea about what we are looking for.
If one can stare at these graphs for long enough, then it’s pretty easy to understand what they signify.
When the actual label is y = 1, but our model predicts otherwise and says y_hat = 0 then our model is misclassifying the point and hence there is a huge error (or cost). This is what is happening in the first plot. The cost tends to infinity when y_hat = 0 while in reality y = 1, and the cost keeps on decreasing when our model starts producing probability values closer to 1.
This simplified cost function gives us the error at a single point. The average of this error across all the points gives us the total error of the model and it’s called Cross Entropy Loss.
Logistic Regression is regression algorithm used when the label is a categorical variable. A logistic regression model consists of two parts, namely, a linear equation and an activation function (sigmoid function). The more complex the decision boundary is, the more complex our linear equation would be. The score produced from the linear equation is passed through a sigmoid function which gives us the probability value of the point belonging to class y = 1. The parameters that define the model are optimized through a technique called Gradient Descent and the performance of the model is computed through the cross entropy loss function.
Thanks for reading!! See you soon in next article.