Mastering Polynomial Regression with Python: A Complete Guide
Written on
Introduction to Polynomial Regression
In my journey of writing tutorials on machine learning, deep learning, data visualization, and statistical analysis, I've realized that I have not extensively covered the simpler machine learning pipelines that are still quite valuable. While advanced tools and libraries exist, foundational techniques remain relevant. This article will delve into the essentials of Polynomial Regression, focusing on its implementation through the scikit-learn library in Python, and address overfitting concerns for beginners.
Understanding Polynomial Regression
Polynomial regression is a fundamental machine learning technique that continues to be applicable in various business scenarios. It expands upon the limitations of linear regression, which assumes a linear relationship between input and output variables.
In mathematical terms, linear regression is described by the equation:
Y = C + BX
In this equation, Y represents the output variable, X denotes the input variable, C is the intercept, and B is the slope. In machine learning, we often use different terminology:
h = θ0 + θ1X
Where h stands for the hypothesis or predicted output, X is the input variable, θ1 is the coefficient, and θ0 is the bias term. However, relationships between input and output variables are rarely linear.
To illustrate, consider a case with a single input variable. In polynomial regression, we construct multiple features from the input variable X by applying various powers to it. This leads to a polynomial form that can effectively model more complex relationships.
For those seeking more in-depth information about polynomial regression, you can refer to this link: Implementation of Polynomial Regression.
Implementation Using scikit-learn
We will now implement polynomial regression using the scikit-learn library in Python, utilizing the insurance dataset from Kaggle. Here’s how to get started:
import pandas as pd
df = pd.read_csv("insurance.csv")
df.head()
Data Preparation
It’s crucial to check for null values in the dataset initially, as their presence can hinder the performance of machine learning models. The following code will allow you to identify any null values across the DataFrame:
df.isna().sum()
Great news! We found no null values in the dataset. However, since machine learning models require numerical inputs, we need to convert categorical string values into numeric ones:
df['sex'] = df['sex'].replace({'female': 1, 'male': 2})
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 2})
df['region'] = df['region'].replace({'southwest': 1, 'southeast': 2, 'northwest': 3, 'northeast': 4})
This completes our data preparation.
Defining Input Features and Target Variable
In this exercise, our objective is to predict 'charges' based on the other variables in the dataset. Therefore, 'charges' will serve as our output variable, while all other variables will constitute our input features:
X = df.drop(columns=['charges'])
y = df['charges']
Here, X represents the input features and y stands for the output variable.
Splitting the Data for Training and Testing
To effectively train and evaluate our model, we need to separate the dataset into training and testing sets. The scikit-learn library provides a convenient train_test_split method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
Scaling the Data
It's advisable to scale the data to ensure that all features are on a similar scale, especially in polynomial regression. We will employ the Standard Scaler from scikit-learn for this purpose:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Model Development
Since polynomial regression builds upon linear regression, we need to import the necessary methods for both. The development process involves a few additional steps compared to linear regression:
- Utilize the PolynomialFeatures method with a specified degree.
- Fit the input features to the PolynomialFeatures method for both training and testing data.
- Train the model using Linear Regression on the prepared data.
Here’s the code for this process:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=6)
X_poly_train = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(X_poly_train, y_train)
The model training is now complete!
Model Evaluation
We will assess the model's performance using both training and testing datasets. Starting with the test data, we can predict 'charges' as follows:
y_pred = lin.predict(X_test_poly)
To calculate the mean absolute error:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)
Output:
285296796712.7246
This mean absolute error appears quite large! Now, let's see how the model performs on the training data:
y_pred_train = lin.predict(X_poly_train)
mean_absolute_error(y_train, y_pred_train)
Output:
1970.8913236804585
The mean absolute error for the training data is significantly smaller, indicating that the model has overfit the training data, leading to poor performance on new data.
Addressing Overfitting
To tackle overfitting, we start with hyperparameter tuning. By experimenting with different hyperparameters, we can identify optimal values that enhance model performance on both training and testing datasets. In this instance, we will modify the degree used in the PolynomialFeatures method from 6 to 3.
poly = PolynomialFeatures(degree=3)
X_poly_train = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
lin.fit(X_poly_train, y_train)
Let’s evaluate the mean absolute error again for both datasets:
y_pred = lin.predict(X_test_poly)
mean_absolute_error(y_test, y_pred)
Output:
2819.746326567164
This is a significant improvement! Next, we should also check the training dataset performance:
y_pred_train = lin.predict(X_poly_train)
mean_absolute_error(y_train, y_pred_train)
Output:
2818.997792755733
Amazing! The mean absolute errors for both training and testing datasets are now closely aligned at 2819 and 2818, respectively.
Conclusion
This tutorial offered a comprehensive overview of polynomial regression and its implementation using the scikit-learn library in Python. We also discussed the concept of overfitting and explored methods to mitigate it. I hope you found this information beneficial!
Feel free to connect with me on Twitter and like my Facebook page.
Additional Resources
For further insights, check out the video titled "Polynomial Regression in Python - sklearn" on YouTube.
You may also find valuable information in the video "Lab 22 - Polynomial Regression using Python" on YouTube.