Intro

In one of my previous posts, I mentioned that I was taking the Machine Learning Zoomcamp course. I have completed the first module of the course, and I wanted to share a recap of what I learned. In addition to that, I’d go one step further and add some extra bits of information here and there.

What is Machine Learning?

To answer this, we need to understand what learning is. Learning is the process of acquiring knowledge or skills through experience, study, or being taught. A simple Google search would give you this definition. Wikipedia defines learning similarly, adding that “[…]the ability to learn is possessed by humans, non-human animals, and some machines[…]”. When learning is done by machines, we refer to it as Machine Learning (ML). In simple terms, ML is the process of teaching a computer to learn from data. The computer uses the data to identify patterns and make decisions without being explicitly programmed to do so.

Why do we need ML?

Why would we want a machine to learn? Why is a rule-based system not enough?

The answer is two-fold. First, we cannot anticipate all possible future situations. Think of a robot navigating a maze, for instance. Second, we sometimes just don’t know how to implement the solution. For example, consider the task of recognizing handwritten digits. Without ML algorithms, this would be a very challenging task to implement.

Recognizing people’s faces in images is another good example, we as humans can do this task effortlessly, but writing a computer program to do the same is not trivial without ML.

A rule-based system would require humans to write down all the rules for the computer to follow, whereas with ML, the computer can learn the rules from the data. Let’s consider two other examples, spam detection and car price prediction. Applying a rule-based system to these tasks would be very difficult. In the case of spam detection, the rules would be constantly changing as spammers come up with new ways to bypass the system. In the case of car price prediction, there are so many factors that can affect the price of a car that it would be impossible to write down all the rules. With ML, the computer can learn the rules from the data and make predictions based on that.

How does ML work?

To put it simply, ML works by learning from examples. The computer is given a set of examples, and it uses these examples to build a model, which is then used to make predictions on new data. The set of examples is called the training data, and the process of building the model is called training. The model is a mathematical representation of the patterns in the data.

Types of ML

There are three main types of ML: supervised learning, unsupervised learning, and reinforcement learning.

This post will touch upon supervised learning, while the other two types will be covered in future posts.

Supervised ML (SML)

In supervised learning, the computer observes input-output pairs and learns a function that maps new inputs to outputs. The input is called the feature, while the output is called the label.

For example, if we are trying to predict the price of a house based on its size, the size of the house would be the input or the feature, and the price would be the output or the label. Of course, we would need a lot more features than just the size of the house to make an accurate prediction.

We put all the features together in a feature matrix, usually called X, where each row represents a different example, and each column represents a different feature. The labels are put in a target vector, usually called y, for each example in the feature matrix, there is a corresponding label in the target vector. The model is then a function that takes X as input and tries to predict values as close as possible to y. Formally, we can write this as f(X) ≈ y.

If matrix, vector, or function are not familiar terms to you, this is where maths comes in, but to keep things simple, you can think of a matrix as a table, a vector as a list of numbers, and a function as a rule that takes an input and gives an output.

Let’s see some types of supervised learning problems:

  • Regression: The output is a continuous value (the price of a house)
  • Classification: The output is a discrete value (spam or not spam / ham)
  • Ranking: The output is a ranking of items (ranking search results)

It’s all about the data

In ML, data is everything. The quality of the data is crucial to the success of the model. If the data is bad, the model will be bad. If the data is good, the model will be good, given no other errors have been made along the way.

You might ask, is there a process to follow when working with data? The answer is yes, there is a process called the CRISP-DM process, which stands for Cross-Industry Standard Process for Data Mining. The process consists of six phases:

  • Business Understanding, i.e. understanding the problem you are trying to solve
  • Data Understanding, i.e. understanding the data you have
  • Data Preparation, i.e. preparing the data for modeling
  • Modeling, i.e. building the model
  • Evaluation, i.e. evaluating the model
  • Deployment, i.e. deploying the model

The process is iterative, meaning that you might have to go back and forth between the phases until you get the desired results.

Fortunately, there are tools and libraries that can help you with each phase of the process. Following posts will cover some of these tools and libraries.

Choosing a model

There are many types of models used in machine learning, each suitable for different tasks like classification, regression, clustering, and more. Here are some commonly used machine learning models:

  • Linear Regression
  • Logistic Regression
  • Decision Trees
  • Random Forests
  • Support Vector Machines
  • Neural Networks
  • Convolutional Neural Networks
  • Recurrent Neural Networks
  • K-Nearest Neighbors
  • K-Means Clustering
  • Principal Component Analysis
  • Gradient Boosting
  • Naive Bayes
  • etc.

Considering the vast number of models available, how do you choose the right one for your task? The answer is, it depends. It depends on the problem you are trying to solve, the data you have, and the resources you have available. In general, it is a good idea to start with a simple model and then move on to more complex models if needed.

One way to choose a model is to try different models and see which one performs best on your data. This process is called model selection.

For this we need to divide our data into multiple sets.

One approach is to devide it into two sets: the training set and the test set. The training set is used to train the model, while the test set is used to evaluate the model.

Another approach is to divide the data into three sets: the training set, the validation set, and the test set.

The training set is used to train the model, the validation set is used to evaluate the model, this might imply additional tweaking of the model, and the test set is used to evaluate the final model. The training-evaluation-tweaking process is repeated and the best model is chosen based on the performance on the validation set. This model is then evaluated on the test set.

Tools and Libraries

Previously I mentioned that there are tools and libraries that can help you with each phase of the CRISP-DM process. Two of the most commonly used libraries in Python are NumPy and Pandas. Pandas is used for data manipulation and analysis, while NumPy is used for numerical computing.

See NumPy and Pandas for more information.

To make use of these tools one has to set up their Python environment. One way to do this is to use Anaconda, which is a distribution of Python that comes with all the tools and libraries you need for data science and machine learning. Once you have Anaconda installed, you can optionally create a virtual environment, activate it, install the libraries you need, and start working on your project.

conda create -n myenvironment python=3.11
conda activate myenvironment
conda install numpy pandas scikit-learn seaborn jupyter

If running all this locally is not possible for you, you can use Google Colab, which is a free cloud-based service that allows you to run Python code in the browser, or even set up everything yourself using some cloud service like AWS, Azure, or GCP.

Conclusion

In this post, I covered the basics of machine learning, the types of machine learning, the importance of data, the CRISP-DM process, choosing a model, and some tools and libraries that can help you with machine learning.