Machine Learning Zoomcamp: Module 3 Recap

Intro

This post is a recap of the Machine Learning Zoomcamp Module 3.

Below are the posts for the previous modules:

Machine Learning Zoomcamp Module 1 - points received: 9 (7/7 for questions + 2 bonus for learning in public)
Machine Learning Zoomcamp Module 2 - points received: 5 (5/6 for questions + 0 bonus for learning in public)

For module 2 I managed to score 5 points out of 6 for the questions. While the code and the result I got were correct, I interpreted the result incorrectly, choosing the wrong answer. Oh well, it happens…

The gist of the module

While the previous module was all about regression, i.e. predicting a continuous value, this module is about classification, i.e. predicting a category. As we have just two categories, we are talking about binary classification.

The dataset

As with all machine learning problems, we begin with the dataset. In this module, the dataset focuses on predicting customer churn, determining whether a customer is likely to leave or stay. The dataset used for the homework focuses on predicting if a customer has subscribed a term deposit or not.

Data preparation and EDA

One of the first steps in implementing a machine learning model is to prepare the data and perform exploratory data analysis (EDA).

Regarding data preparation, we need to:

bring the columns and values within to a common format, i.e. lowercase, no spaces, etc.
perform some numerical conversions
handling missing values
encoding categorical variables, i.e. yes/no to 1/0

Regarding EDA, we need to:

check the distribution of the target variable
check the distribution of the features
calculate the correlation between the features
calculate the mutual information between the features and the target variable

Training a model

The next step is to train a model. In this module, we used the logistic regression model. The logistic regression model is a linear model that predicts the probability of an outcome. In our case the outcome is binary.

Before proceeding with the training, we need to split the data into training, validation, and test sets. The training set is used to train the model, the validation set is used to tune the hyperparameters, and the test set is used to evaluate the model.

from sklearn.model_selection import train_test_split

df_train, df_remaining = train_test_split(df, test_size=0.4, random_state=42)

df_val, df_test = train_test_split(df_remaining, test_size=0.5, random_state=42)

y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values

Performing one-hot encoding

Before training the model, we need to perform one-hot encoding on the categorical variables. One-hot encoding is a process of converting categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. Below we use the homework dataset to perform one-hot encoding.

from sklearn.feature_extraction import DictVectorizer

numerical_features = ['balance', 'previous', 'pdays', 'campaign', 'age', 'day', 'duration']

categorical_features = ['education', 'housing', 'month', 'contact', 'marital', 'poutcome', 'job']

dict_train = df_train[numerical_features+categorical_features].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dict_train)

After the one-hot encoding the feature names are stored in the dv.get_feature_names_out() method, and look like this:

[
  'age',
  'balance',
  'campaign',
  'contact=cellular',
  'contact=telephone',
  'contact=unknown',
  'day',
  'duration',
  'education=primary',
  'education=secondary',
  'education=tertiary',
  'education=unknown',
  'housing=no',
  ...
  'housing=yes',
  'poutcome=failure',
  'poutcome=other',
  'poutcome=success',
  'poutcome=unknown',
  'previous'
]

Training a logistic regression model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

The model is trained using the training set and the target variable. The solver parameter is set to liblinear, which is a good choice for small datasets. The C parameter is set to 1.0, which is the default value. The max_iter parameter is set to 1000, which is the maximum number of iterations taken for the solvers to converge. The random_state parameter is set to 42 to ensure reproducibility.

Making predictions

After training the model, we can make predictions using the validation set.

dict_val = df_val.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_val = dv.fit_transform(dict_val)
y_pred = model.predict_proba(X_val)[:, 1]

One last step is to make a decission based on the predictions. In this case, we use a threshold of 0.5 to determine if a customer has subscribed a term deposit or not.

y_decission = y_pred >= 0.5

After this step, we can calculate the accuracy of the model.

(y_val == y_decission).mean()

We can do one final check by calculating the accuracy of the model using the test set.

dict_test = df_test.to_dict(orient='records')
X_test = dv.fit_transform(dict_test)
y_pred = model.predict_proba(X_test)[:, 1]
y_decission = y_pred >= 0.5
(y_test == y_decission).mean()

Conclusion

This module was a great introduction to binary classification. We learned how to prepare the data, perform EDA, train a model, and make predictions. We also learned how to evaluate the model using the accuracy metric.

The homework code can be found here.