Regression analysis Loss Function
In the last article, we have discussed the fundamentals of regression analysis and understood the importance of the mean of normal distribution for machine learning models. You can read the article here. In this article, we are going to focus on the mathematics behind regression analysis Loss function. Regression analysis loss function is an important topic. Let us know in case you want more information.
Normal distribution of the dependent variable
In the last article, we have seen this image, where we discussed why the blue line is a better approximation of our data than the green line.
If you are new to normal distribution and statistics, I will recommend getting a good understanding of normal distribution, the concept of mean, median, std deviation, probability, etc before reading regression analysis. We have taken an example of Bangalore city, where the x-axis represents the area in sqft and the y-axis represents the house price. We want to predict the mean price given a specific independent variable. i.e.
P (y | x = x1 ).
y = Price we want to predict, we expect this to be the mean
x = Range of all possible areas in sqft
x1 = Specific area in sqft for which you want to predict the mean, i.e., 500sqft
This equation sets the protocol to find the best model. Let’s break it down further. Area in sqft is a continuous number, which can take a range of values. For our city let’s say from 500sqft up to 2000sqft and we have data points that are divisible by 500. That means we have only 4 independent variables i.e., 500, 1000, 1500, 2000. It can be any positive value but we have considered an assumption for simplicity. As we have discussed before for the price or y-axis will be a normal distribution for each of our 4 area values. The diagram below shows the normal distribution for our dummy data. The first normal distribution is for 500sqft and the last one is for 2000sqft.
Here, the blue points represent the observed data. As you can see for fixed or given independent variables, the dependent variable i.e., price is following a normal distribution. The orange point is the mean of each distribution which we want to predict. What is a regression model here? Well, can you see that the orange point is aligned in a straight line! Yes, the green line is our desired model. Once you get the green line you can predict the price for any area in sqft. Remember, the green line, the orange point, and the normal distributions will not be given. We will get the blue data points as our features and we need to use that information to get that green line.
Unlike how you are seeing the normal distribution in this example, real-world data will be vague and messy. It will contain outliers and sometimes for a given area in sqft, we will have only one data point, which will make it our job difficult to predict the mean.
**It is the right time for you to understand the T-distribution, as it can help you to predict the mean even if you have very few data points. **
So, how we will get the mean or to be precise if I know that the change in variance of the dependent variable is linear to change in variance of the independent variable, how can I get this green line? We can use some basic maths for solving our task.
Simple Linear Regression (SLR)
Once we understand our data movement pattern and confirm it can be generalized by a straight line, we need the equation Y = MX + C, that represents our model. Here, x is the feature and y is the target. For the given x, the equation y can take infinite possibilities depending on the value of m and c. Here m is the slope and c is the intercept or height. Since m and c can take infinite possibilities, we can end up with random lines that can be a very bad approximation to our change in variance. Since our machine learning model or the green line has to go through multiple points then there exists only one and only one true model, which we can get by a particular value of m and c! I will toss some definitions that will be used later for reference.
For our single feature X and single label Y, our equation is Y = θ0 + θ1 X
Where θ0, θ1 are called parameters of the equation and we need to find the optimum value for these parameters to get our machine learning model. Any other value than the optimum value will result in a different line, which we called a hypothesis. So, in general, we will start with a hypothesis and the model is a special hypothesis where θ will be optimized for capturing the change in variance of the dependent variable given the change in the independent variable. This hypothesis is linear and doesn’t have a higher degree of polynomials. This model is called Simple Linear Regression (SLR).
If our data looks like the below table, where we have 4 features, then the SLR equation will be:
Y = θ0 + θ1 X1 + θ2 X2 + θ3 X3 + θ4 X4
A couple of important observations before moving forward.
- 1. The equation is still a linear equation but our model will no more be a straight line. It will be a hyperplane. A hyperplane example in 2D is a paper and in 3D it will be a cube. Since humans can’t see dimensions above 3D, we call it a hyperplane. The above equation is called multivariate SLR.
- 2. To get to the optimum hyperplane we need to adjust values of θ0 to θ
- 3. Here, x1, x2, x3, x4 are the features i.e., given to us. We can’t pass the feature information to our equation in our current form. For example, Locality is a text feature, it has to be converted to numerical values before passing to the SLR. Data is not normalized so, that can create an impact on our model. These things fall under feature engineering and will be covered in separate articles.
The challenge here is finding the right values of θ. A basic assumption might be to start with random parameters and then adjust its value to finally reach the green line. So, we might need a metric to see how bad our hypothesis is and how close we are getting to our machine learning model after each adjustment.
Loss In Regression – Mean Square Error (MSE)
Let’s consider the single feature and single label example we have discussed. As shown in the figure we have two lines, the green line which is the model we want, and the orange line as the hypothesis. The orange line has random parameters and needs to be optimized. But why do we say the orange line is the bad model in the first place?
The solution is simple. If we assume the orange line as the model, then we can say the values that lie on the line are my predictions. In fig-3 the blue points are my observations for a given area in sqft and orange points are predictions. If the distance between orange and blue points which is basically the distance between my observation and prediction is too high, maybe I have selected the wrong model! Maybe we need to optimize the parameters to find a better solution. As one can observe in the below figure orange lines represents the distance between my prediction and observation and it is quite large.
These vertical orange lines represent the error in the hypothesis. So how do we know how bad is our hypothesis? Well, every time you change the parameter of the hypothesis, you change these vertical orange lines. Add all the distances and it will give you the total error. Predictions can be either side of the model and distances can be positive or negative. The summation of distances with the negative values can nullify the sum of error even though a large loss exists in the model. So, we will take a square in the distance formula to transform the negative values. The Sum of square error or Mean square error is given below.
The square helps us to remove the negative distances and we divide the total loss by n to get the average error for each prediction. If the hypothesis has less MSE loss, then we are close to the green line. The green line or best fit line will have the least MSE. This is the metric we are going to use to identify how good or bad is our model. In the MSE equation y^ is the predicted value i.e., data points we got from the orange line and we already know that the orange line is dependent on parameter θ. So, when we change the value of parameters loss will change. Since MSE is changing with the square of θ, it will give us a parabolic curve.
The benefit of the parabolic curve is evident. It has only one global minimum as marked by a dotted line. We desire the parameters where the dotted line crosses the x-axis. Any θ, except the optimum value θo, will be considered as the hypothesis. So, in a nutshell, we are looking for θo. The process of getting the right θo is called optimization in machine learning.
We can get to θo in two ways.
- 1. Ordinary Least Square
- 2. Gradient Descent
We understood the MSE loss in this article, which is a common regression analysis loss function. In the next article, we will learn about the ordinary least square and gradient descent.
If you are an aspiring data scientist and looking for data science courses in Bangalore, you can click here. Thanks for reading the article and we will upload the next article soon.