Machine Learning Models: Linear Regression

Machine Learning Models: Linear Regression

Linear Regression

Linear Regression is the simplest of machine learning algorithms and is usually the first one you will learn in any course or class on the subject. However, it’s simplicity does not deprive it of any power. On the contrary, despite their simple nature they are still renowned for their ability in prediction; according to the International Business Machines Corporation (IBM), “Linear regression is the most commonly used method of predictive analysis”. It is a type of supervised learning model, meaning it learns from a labeled training set of data, sometimes known as targets, in order to make a prediction. Specifically, linear regression tries to fit a straight line through the given data points in order to best model the relationship between the inputs and the given targets. To do this, it calculates a weighted sum of the inputs plus a bias term.

The bias term is usually denoted with a θ₀, with the weights for each input denoted as θ₁ through θₙ for n input values. Thus, linear regression can be represented as a linear function ŷ = θ₁x₁+ … + θₙxₙ + θ₀, with ŷ being the prediction. Training the model means finding the parameters (θ) that best fit the training data. In order to best train the model, we first must find the difference between the predicted value (the model’s output) and the expected value (the actual target value) for a data point of the training set (yᵢ — ŷᵢ). This is called finding the error for that particular prediction. Finding the error for one specific point isn’t too helpful on it’s own. What’s most important is the total error the model has made, which is known as the cost. The equation used to find the cost of a given model is known as the cost function. The cost function associated with a linear regression model is the Mean Squared Error. In simpler terms, this just means take the average (mean) of all of the errors squared; hence the name.

So this gives the cost of the model, or how far your model is away from the target values. What’s next is to find out how to minimize the result of this equation. Training a linear regression model consists of finding the best values for the weights and biases that give the smallest possible MSE for the training set.

Normal Equation

One way to find the ideal set of parameters for a linear regression model is by using the normal equation. This is an equation that gives the direct result automatically after computation, in contrast to another method that will be discussed later. The way to compute the normal equation usually starts with arranging all of the features (x) for every data point in the training data into a matrix (X), with each row representing an instance of recorded data. Then create a vector (y) containing all of the target values of the training set. Afterwards, complete the equation:

To explain exactly what that means, multiply the transpose of your matrix X by the matrix itself, take the inverse, and multiply that by the product of the transpose of matrix X and vector y. This will give you the optimal value/values for θ to minimize the cost function. This is an effective way for computing the optimal θ values when the number of inputs isn’t that large, however, as the amount of features or instances of data grow this computation becomes slower and less efficient. This brings up the next common way to train this model, and many others as well.

Gradient Descent

Gradient descent is a common optimization algorithm that is widely used for a vast amount of different machine learning models. The idea behind gradient descent is to iteratively change the parameters of the model in order to minimize the overall cost. To do this, the gradient descent algorithm calculates how much the cost function changes if you change a parameter slightly. It does this by computing the partial derivative of the cost with respect to the parameter. The best analogy I’ve seen to represent gradient descent was by Luis Serrano in the Math for Machine Learning & Data Science Specialization hosted by DeepLearning.Ai on Coursera. To summarize: imagine you are in a really hot room and want to get to the coldest spot possible. The way you might go about doing this, is to take a step in any direction and see if it is hotter or colder than where you were before. You would keep doing this until every spot you could go to next is hotter than the spot you are currently at; this is when you have found the coldest spot in the room. This is essentially how gradient descent works, slowly taking steps, the size of which are dictated by the learning rate, to find the minimum cost function for the model. The amount of steps taken, or iterations of the training algorithm, are known as epochs. In order to implement gradient descent, you would compute the partial derivative of the cost function with regards to each parameter, using the equation:

Note that θx is another way of representing the prediction ŷ that is commonly used. It is just expressing the prediction as the product of the transpose of the parameter matrix θ and the feature vector x. Now instead of computing these partial derivatives individually, a common method is to use batch gradient descent. This method of gradient descent calculates the derivatives over the whole training set at each step. This involves creating a vector of gradients:

Once you have the gradient vector, you begin the steps to use the gradient vector to step in the correct direction. This is when the learning rate comes into play. The step algorithm involves subtracting your parameter vector θ by the product of the gradient vector and the learning rate. This is how the learning rate influences the size of the steps you take away from the gradient. The equation looks like this:

The η represents the learning rate. The size of the learning rate is important; too large and you may continuously jump over the lowest point, but too small and it may take forever to converge on it. However, since the MSE is a convex function, the algorithm is guaranteed to get close to the global minima (lowest point) with a small learning rate as long as you wait long enough (and run through enough epochs). Thus it is usually safer to go with a smaller learning rate to start with and experiment from there on what works best.

Code Implementation

In order to fully demonstrate how/when to implement and train a linear regression model, I will go through the steps of a regression/prediction task. The dataset and project itself come from the dataset published in the book Hands On Machine Learning with SciKit-Learn and TensorFlow by Aurelien Geron, however all code presented here is typed and published solely by me. The author has expressed his intent to keep the code for the datasets and projects available open-source through his github, which I will have linked in the reference list at the bottom of this text.

This project is assuming you want to find out if money correlates to happiness. In order to find out, you collect data on the life satisfaction of certain countries along with their GDP (gross domestic product). Your goal is to find if their is some correlation, and if so, to create a model that can predict someone’s expected happiness based off of their country’s GDP.

We will start by importing the necessary libraries:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

Next, we go through the steps of downloading and preprocessing the data. This includes opening the data and determining our features/targets. Since we want to be able to determine happiness given money, it stands to reason that our features would be the GDP of the country and our targets will be the life satisfaction of that country.

data = https://github.com/ageron/data/raw/main/
life_satisfaction = pd.read_csv(data + lifesat/lifesat.csv)
X = life_satisfaction[[GDP per capita (USD)]].values
y = life_satisfaction[[Life satisfaction]].values

We will then go through the steps of visualizing our data; an important step before model selection.

life_satisfaction.plot(kind=scatter, grid=True,
x=GDP per capita (USD), y=Life satisfaction)
plt.axis([23_500, 62_500, 4, 9])
plt.show()

While it doesn’t look like a complete straight line, their definitely looks like a linear correlation between the GDP and life satisfaction of the given countries. Therefore a linear regression model will due well to make predictions. So now that we know which model we want to use, let’s train (or fit) the model to the training set. We’ll use the normal equation first:

Normal Equation Implementation:

from sklearn.preprocessing import add_dummy_feature

X_b = add_dummy_feature(X)
best_theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

This is a code implementation of the same normal equation explained above, with the output being mapped to a variable called “best_theta”. Although knowing the actual numbers themselves isn’t all too helpful, let’s look at what the normal equation says anyways:

best_theta

So now that we have the optimal parameter values, we are able to start making predictions. So given Cyprus’ GDP of 37,655.2; what would the model predict their life satisfaction to be?

X_new = [[37_655.2]]
X_new_b = add_dummy_feature(X_new)
y_prediction = X_new_b @ best_theta
y_prediction

So it’s looking like our model expects Cyprus to have a life satisfaction rating of about 6.3. Now let’s look at a visualization of the prediction line that the normal equation has fit to our data:

life_satisfaction.plot(kind=scatter, grid=True,
x=GDP per capita (USD), y=Life satisfaction)
plt.plot(X, best_theta[1]*X + best_theta[0], r-)
plt.axis([23_500, 62_500, 4, 9])
plt.show()

And that is pretty much all there is to linear regression with the normal equation. Now let’s look solving the same problem with gradient descent.

Gradient Descent Implementation:

alpha = .1
epochs = 1000
m = len(X_b)

np.random.seed(42)
theta = np.random.randn(2, 1)

for epoch in range(epochs):
gradients = 2 / m * X_b.T @ (X_b @ theta y)
theta = theta alpha * gradients

This randomizes the initial values of our parameter vector, implements the cost function, and goes through the steps for the amount of epochs we had set. Now let’s see what parameters the algorithm returns this time.

We got the same thing we did as with the normal equation. Remember, gradient descent is useful for many models, not just linear regression. Now there’s actually one more way to implement and train this model, and it’s the easiest.

SciKit Learn Implementation:

model = LinearRegression()
model.fit(X, y)

model.intercept_, model.coef_

The SciKit Learn library already has a built in linear regression model. Using the “.fit” function, we can train the model in one line of code. Using this way, you can simplify the previous other methods down to just three lines of code. Easier is not always better however, and although it is useful to have most of the details abstracted away, it is always important to understand what is going on beneath the simplified code. With that being said, this is usually the most common way you will see a linear regression model implemented and trained, as most people won’t have the need to code a model from the ground up from scratch. However, if you ever did need to, now you know how.

Reference List

IBMhttps://www.ibm.com/docs/en/db2oc?topic=procedures-linear-regression

Hands On Machine Learning 3rd Edition Github — https://github.com/ageron/handson-ml3/tree/main

Leave a Reply

Your email address will not be published. Required fields are marked *