Machine Learning Zoomcamp: Module 2 Recap

Intro

In this post I will make a recap of the Machine Learning Zoomcamp Module 2.

The posts for the previous modules, together with the points I received for each module’s homework, are listed below:

Machine Learning Zoomcamp Module 1 - points received: 9 (7/7 for questions + 2 bonus for learning in public)

The gist of the module

The problem tackled in this module was the prediction of car prices based on a dataset containing car features.

This type of problem is called a regression problem, and the main idea is to try to predict a continuous value, in this case, the price of a car, based on a set of car features, using a Machine Learning model. For this problem, the model used was the Linear Regression model.

To get a real taste of how such a problem could be solved, the model was implemented from scratch using Python and NumPy, while the dataset was explored and manipulated using Pandas and Matplotlib.

The dataset

As with all things in Machine Learning, the first step is to get the data and make a sense of it.

The dataset used in this module was the Car Features and MSRP from Kaggle.

Note that I will be using another dataset, the one from the homework, to illustrate the concepts discussed in this post. While the datasets are different, the process is the same.

As it is often the case, the dataset may require some cleaning and preprocessing before it could be used to train a model. One such action was normalizing the column names. To get rid of inconsistencies, all columns were converted to lowercase, and the spaces were replaced with underscores.

url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv'

df = pd.read_csv(url)

cols = df.columns.str.lower().str.replace(r'\s+','_', regex=True)
df.columns = cols

A similar conversion process can be applied to the values in the columns, to make sure that the data is consistent and easy to work with.

string_columns = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_columns:
    df[col] = df[col].str.lower().str.replace(r'\s+','_', regex=True)

Another common preprocessing step is to handle missing values. For instance, if a column contains missing values, one could replace them with the mean of the column.

df.screen = df.screen.fillna(df.screen.mean())

Looking at the target variable, i.e. the variable we want to predict, is also important. In this case, the target variable is the price of the car. It is a good idea to check the distribution of the target variable, as it can give us some insights into the data. One way to do this is to plot a histogram of the target variable, using a library like Seaborn.

import seaborn as sns
fp = df.final_price
sns.histplot(fp, bins=50)

One imediate observation could be that the data is not normally distributed. This could be a problem, as the Linear Regression model works better with normally distributed data. One way to address this issue is to apply a transformation to the target variable. One such transformation is the logarithmic transformation.

fp = np.log1p(df.final_price)

Plotting the histogram of the transformed target variable could show a more normally distributed data.

Before training the model, we have to split the data into training, validation and testing sets. The training set is used to train the model, the validation set is used to tune the hyperparameters of the model, and the testing set is used to evaluate the model’s performance. A common split is 60% for training, 20% for validation, and 20% for testing.

Without relying on libraries like Scikit-Learn, we can split the data by shuffling the indices of the dataset and then splitting the indices into the three sets.

Training the model

Training the model refers to the process by which an algorithm learns the relationship between the input features and the target variable. In this case, the algorithm used was the Linear Regression model. The artifacts of the training process are the weights and the bias of the model.

Linear Regression is a simple model that tries to find the best line that fits the data. The line is defined by the equation y = w1*x1 + w2*x2 + ... + wn*xn + b, where w1, w2, ..., wn are the weights, x1, x2, ..., xn are the input features, and b is the bias. In the module bias was named w0. It is also called the intercept, which is the value of the target variable when all the input features are zero, i.e. the value of the target variable when there is no influence from the input features.

For a single feature, x, and its weight, w, the equation becomes y = w*x + b. This is nothing else than the equation of a line, where w is the slope of the line, and b is the y-intercept. The intercept is where the line crosses the y-axis, and the slope is the rate at which the line rises or falls, or mathematically, the change in y divided by the change in x.

Considering we have many observations, we can arrange the input features in a matrix, X, and the target variable in a vector, y. The equation of the model becomes y = X*w + b, where X is a matrix of shape (n_samples, n_features), w is a vector of shape (n_features, 1), and b is a scalar.

Now we are talking about matrix-vector multiplication, which means we can think at b as a vector of ones, and add it as a column to the matrix X. This way, the equation becomes y = X*w, where X is a matrix of shape (n_samples, n_features + 1), w is a vector of shape (n_features + 1, 1), and y is a vector of shape (n_samples, 1).

If the matrix X is invertible (i.e. it is a square matrix and its determinant is not zero), we can find the weights, w, by multiplying the inverse of X with y.

If matrix X is not invertible, we have to make use of the transpose of X, X^T, which multiplied with X gives a square matrix, X^T*X, also called the Gram matrix, which could be invertible. If this is the case, the weights can be found as w = (X^T*X)^-1*X^T*y, where ^-1 is the matrix inverse.

If the Gram matrix is not invertible, we need to talk about regularization, a bit later on in this post.

Feature engineering

In order to improve the model performance, we can engineer the features. This means that we can create new features from the existing ones, or transform the existing features in a way that makes them more informative. This basically means that the feature matrix, X, can contain more columns than the original dataset.

For example, if one of the features is the status of a laptop, i.e. weather it is new or refurbished, we can add two new columns to the feature matrix, one for each status, let’s say, is_new and is_refurbished. If the laptop is new, the value of is_new is 1, and the value of is_refurbished is 0. If the laptop is refurbished, the value of is_new is 0, and the value of is_refurbished is 1.

Regularization

We saw that the Gram matrix may not be invertible. This happens when the matrix is singular, i.e. it has a determinant of zero. This could happen when the number of features is not the same as the number of observations, or when the features are collinear, i.e. they are linearly dependent. Collinearity just means that one or more columns in the feature matrix can be expressed as a linear combination of the other columns. For instance, if we have two columns, x1 and x2, and x2 = 2*x1, then the two columns are collinear.

One way to address this issue is to add a factor to the diagonal of the Gram matrix, called the regularization term. This term is a scalar, r, multiplied with the identity matrix, I, and added to the Gram matrix. The equation becomes w = (X^T*X + r*I)^-1*X^T*y.

This equation can be nicely expressed in Python code:

def train_linear_regression_reg(X, y, r=0.0):
  ones = np.ones(X.shape[0]) # create a vector of ones
  X = np.column_stack([ones, X]) # add the vector of ones as a column to the matrix X

  XTX = X.T.dot(X) # calculate the Gram matrix
  reg = r * np.eye(XTX.shape[0]) # create the regularization term
  XTX = XTX + reg # add the regularization term to the Gram matrix

  XTX_inv = np.linalg.inv(XTX) # calculate the inverse of the Gram matrix
  w = XTX_inv.dot(X.T).dot(y) # calculate the weights

  return w[0], w[1:] # return the bias and the weights

Model evaluation

Once we have our model trained, i.e. we know the weights and the bias, we can evaluate its performance. One way to do this is to calculate the root mean squared error (RMSE) of the model. The RMSE is a measure of the difference between the predicted values and the actual values. The lower the RMSE, the better the model.

The RMSE is calculated as the square root of the mean of the squared differences between the predicted values and the actual values. The equation is RMSE = sqrt(mean((y_pred - y)^2)).

def rmse(y, y_pred):
  se = (y - y_pred) ** 2
  mse = se.mean()
  return np.sqrt(mse)

We can do this for the validation set and the testing set, and compare the RMSE values. If the RMSE of the testing set is much higher than the RMSE of the validation set, it could be a sign of overfitting, i.e. the model is too complex and it is fitting the noise in the data, rather than the underlying pattern.

Once we have a fairly good model, we can use it to make predictions on new data.

Using the model

Using the model just means to apply the weights and the bias to the new data. The equation is the same as the one used to train the model, y = X*w + b, where X is the feature matrix of the new data, w is the weights of the model, and b is the bias.

def predict(X, w, b):
  return X.dot(w) + b

Conclusion

In this module, we learned how to train a Linear Regression model from scratch, using Python and NumPy. We also learned how to preprocess the data, engineer the features, and evaluate the model. We also learned about regularization and how to address the issue of collinearity.

The homework code can be found here.