Understanding Machine Learning: A Beginner's Guide to Key Terms
Written on
Chapter 1: The Knowledge Gap
Machine Learning (ML) has surged in popularity, leading many to assume that everyone is familiar with its concepts. Terms like "training a model," "training set," and "cost function" are frequently tossed around without consideration for whether everyone comprehends them. It often feels as if you should inherently know these terms.
Here I am, diving into the discussion of machine learning without even defining it at the outset. My aim today is to bridge this knowledge gap by clarifying some of the fundamental terms associated with machine learning that you are expected to understand.
What Is Machine Learning?
The concept of machine learning was first defined in the late 1960s as follows:
"Machine Learning is the field of study that allows computers to learn without being explicitly programmed." — Arthur Samuel
But how do machines manage to learn without direct programming? Imagine having countless science books on your computer; does that mean it possesses knowledge? You could test this by posing a question like, "Why is water wet?" and see if it provides a coherent answer.
One of the initial applications of machine learning was in spam filtering. Nowadays, spam filters are so advanced that most users can't recall the last time they manually flagged an email as spam.
Before we delve deeper into machine learning, consider how you might write a program to identify spam emails without utilizing machine learning.
You would begin by analyzing numerous examples of spam, identifying recurring patterns, such as urgency phrases like "act now" or "limited time offer," along with any of the 188 words commonly used by spammers. You would then create rules to check for these phrases to filter out spam.
However, spammers are constantly evolving their tactics, rendering your program's rules increasingly complex as they adapt to bypass filters.
How Machine Learning Changes the Game
In contrast, a machine learning-based spam filter identifies unusually frequent words associated with spam emails compared to legitimate ones (often referred to as 'ham') and begins to automatically flag them.
If spammers adopt new tactics and users start manually flagging them, the machine learning algorithm recognizes this and adjusts its filtering process without requiring further input.
This adaptability makes machine learning programs simpler to code and maintain.
Machine learning revolves around identifying patterns within vast amounts of random data (numbers and text) and can extract valuable insights, even from complex phenomena like stock price predictions.
Features and Targets in Machine Learning
All machine learning algorithms aim to forecast future outcomes based on existing data. For instance, to create a spam filter, thousands of emails are fed into the algorithm. Each email possesses various characteristics such as word count, sentence count, subject line, and body content.
When predicting outcomes, these characteristics are categorized into two groups: features and the target. The target attribute (also known as the target variable) represents the value you wish to predict—in this case, whether an email is spam or not.
Conversely, features are all the attributes that aid in predicting the target variable. For example, if predicting house prices, the target would be the sale price, while features could include the number of bedrooms, lot size, and proximity to public transport.
When you input existing data into an ML algorithm, you specify which attributes are features and which one is the target. The algorithm examines all features concurrently to identify patterns, learning what combinations of features correspond to specific target variables.
The nature of the target variable influences the choice of ML algorithm. Generally, algorithms that predict numerical outcomes are termed Regression algorithms, while those that classify data into categories are called Classification algorithms. For instance, Regression is used for predicting house prices, whereas Classification is used for determining email types.
Understanding Models in Machine Learning
You'll often hear the terms machine learning model and machine learning algorithm used interchangeably, but it's crucial to understand their distinctions.
An ML algorithm is essentially a collection of complex mathematical and statistical techniques. Without data and computational resources, it cannot perform any predictive functions.
An ML model, however, is an algorithm that has been trained on data and is ready to generate predictions for new, similar data.
So, what does "training" mean? It refers to the process of feeding data into an algorithm (also known as fitting a model). This is where the algorithm learns; it uses its mathematical rules to identify patterns in the input data, storing relevant information in its internal memory.
Most commonly used algorithms are implemented in programming languages like Python, which provides standard application programming interfaces (APIs) for easy access and training of ML algorithms with minimal code.
In summary, an algorithm is merely a raw, untrained set of mathematical rules, while a model is a trained program ready for predictions.
Training and Testing Sets
Now that we've established that machine learning is focused on predicting future values, how do we ensure that a model's predictions are reliable? In other words, how can we assess its performance?
To achieve this, ML engineers divide the available data into two sets: the training set and the test set. The algorithm is trained on the training set, which includes all features and their corresponding labels.
The test set also contains features and labels, but to evaluate prediction accuracy, you provide only the features to the trained model, allowing it to predict the corresponding target labels.
For example, suppose you have 10,000 images of three different rose species, each labeled accordingly. To build a model that classifies these roses, you typically allocate 80% of the data for training.
To assess the model's learning effectiveness, you present the remaining 20% of images for prediction, this time without their labels. After the model makes its predictions, you compare them to the actual labels of those images.
If the model's accuracy is sufficiently high, you can confidently use it to classify future rose images into their respective species.
Supervised vs. Unsupervised Learning
Machine learning models are further categorized based on their training data. The first category is supervised learning, which requires corresponding labels alongside relevant features to establish connections. This is akin to showing a baby an apple and labeling it as such so that the child can identify it in the future.
Supervised learning models operate similarly. For example, to predict the price of a house with 100 square feet and three bedrooms, the model must have been trained on numerous examples of houses with similar features, each paired with their respective prices (the labels).
As these labels are collected manually, such tasks are referred to as supervised or human-controlled.
However, finding neatly labeled datasets can be challenging. Most data is unstructured and unlabeled, like social media posts. To tackle this, unsupervised learning algorithms come into play.
These algorithms are often more intricate and powerful. The programmer initiates the algorithm on the training data, allowing it to autonomously identify labels and extract meaningful insights. A well-known example is the use of neural networks in cutting-edge technologies such as image and speech recognition, autonomous driving, anomaly detection, and sales forecasting.
Conclusion
While this overview touches on some foundational aspects of machine learning, it is merely a glimpse into a vast and fascinating field that continues to draw in millions of professionals. I hope this introduction equips you with the basic knowledge needed to navigate more intricate ML topics. Whenever you encounter complex terminology in the machine learning realm, I recommend consulting the ML glossary created by Google Developers. Until next time!