05
Oct

Regression Analysis – Introduction

In this article, we will learn the difference between regression and classification. This is an introduction article towards the basics of regression analysis, and we will publish more articles as part of this series soon.

Supervised learning is a dominant subsection of Machine Learning. In supervised learning, both features and labels are given during training time. We use the features and labels to identify the machine learning model that is then validated by testing data. Don’t worry if you are not aware of these terms, just think of training data as the data from which we want to learn our pattern and testing data to check how good our model is. More on training data and testing data later.

Regression Vs Classification

Supervised learning has broadly divided into two categories, regression, and classification. To fully understand the difference between the two divisions one needs to understand the difference between continuous and discrete variables. A continuous variable statistically can take an infinite number of possible values. An example of a continuous variable is time as a person’s date of birth can be represented by year, month, day, hour, min, sec, and it goes on up to Pico seconds!  Another example is length, a person’s height can be 6 feet or 6.002 feet or 6.0024009 feet, depending on how precise the scale of measurement is. Whereas contrary to the continuous variable discrete variables takes countable possibilities. A person has only 5 fingers or dice has only 6 faces!

Let’s say we want to predict the price of a house given area in sqft as the feature. Here the area is the independent variable and price depends on it. In the real-world scenario, the price will be dependent on multiple factors than area but for simplicity, we considered feature as a single vector i.e., area in sqft, and here the label is price. Here both price and area are continuous but this is a regression problem as the label is continuous, the features may or may not be continuous.

Multivariate Equations in Machine Learning

Let’s take a concrete example, in table no 1, can you guess which are the continuous variables and which are categorical? Area and price are continuous, whereas the number of bedrooms, amenities, and localities are categorical. The red outline represents the features and the blue i.e. price is our label. This is a multivariate regression problem as we have more than one feature. In our first example when we had a single feature i.e. area, it will be called a univariate regression problem. In both cases, we have a single label but if we have more than one label it will be called multilabel regression.

regression analysis data table

To keep things simple let’s, get back to your first scenario where we have only one feature i.e. area in sqft, and one label i.e price. Here is a plot representing the relation between our data, where the x-axis is the independent variable and the y-axis represents the dependent variable.

 

fig-2

The concept of variance in regression

In regression analysis, a successful machine learning model explains the change in variance dependent variable given the change in variance in the independent variable. In the above figure when we move along the positive direction of the x-axis, a straight-line pattern emerges in the dependent variable. A good machine learning model generalizes this pattern. If the blue points were random in nature, that means, for the change in the variance of the independent variable there is no pattern in the dependent variable. In this case, we can not use regression analysis.

Luckily, for our data, a straight-line pattern exists. Each blue point represents a row in our training data. For higher-dimensional data as shown in table 1, points will be present in a high dimensional coordinate system, which is difficult to visualize, so we will stick to a single feature and single label setup. Remember, our approach and math will be the same irrespective of the dimension.

As we can observe with change in independent variable there is a straight-line pattern emerges for the dependent variable, which we want to predict as our ML model. But some people can argue that why it has to be a straight line, it can be a complex line too!

Fig-3

The green line though goes through our point it can’t be our regression model. To understand why, let me ask few questions. Let’s say, we are predicting the housing price for the Indian city Bangalore. How we got the data in the first place?

Example of data distribution

The data collection process can be of two types i.e., primary or secondary. In the primary data collection process, data will be collected by the data scientist or statisticians by sending real people to collect fresh data from the focused population. In the secondary data collection process, the data is collected from the internet. In our case, let’s assume it is primary data collection, i.e. we sent 10 people to collect house price given area in sqft from different developers in Bangalore.

In the city, different developers will be selling the same size apartments for different prices depending on their brand value. Some developers will be selling 500sqft for a very high price and some will be selling for very little. But most of the house prices will be close to the mean. As you could have guessed it, the price for 500sqft is a normal distribution and so is for any other sqft. So, our data is just a sample from the normal distribution. When searching for the pattern we are definitely searching for the mean of that normal distribution. Maybe that’s the reason people suggest collecting more data as with more data we can confidently predict the mean!

Greenline goes through the sample data which may or may not be the mean. So, we should not select the green line, rather we are looking for a generalized pattern which for me it is the blue line.

fig-4

The pattern doesn’t have to be straight-line always, as shown in figure 4, the pattern is more parabolic than a straight line. Here, the blue line is definitely a better approximation of the data than the orange line. Since the article is going a bit long, we will continue the concepts of regression analysis in the next article. In the next article, I will explain how to formulate the parametric equations and how to optimize our model. Here is a good read before jumping to the next article about parametric equations. If you are searching for a data science course in Bangalore, you can check this link.