Mastering Polynomial Regression with Python: A Complete Guide

Introduction to Polynomial Regression

In my journey of writing tutorials on machine learning, deep learning, data visualization, and statistical analysis, I've realized that I have not extensively covered the simpler machine learning pipelines that are still quite valuable. While advanced tools and libraries exist, foundational techniques remain relevant. This article will delve into the essentials of Polynomial Regression, focusing on its implementation through the scikit-learn library in Python, and address overfitting concerns for beginners.

Understanding Polynomial Regression

Polynomial regression is a fundamental machine learning technique that continues to be applicable in various business scenarios. It expands upon the limitations of linear regression, which assumes a linear relationship between input and output variables.

In mathematical terms, linear regression is described by the equation:

Y = C + BX

In this equation, Y represents the output variable, X denotes the input variable, C is the intercept, and B is the slope. In machine learning, we often use different terminology:

h = θ0 + θ1X

Where h stands for the hypothesis or predicted output, X is the input variable, θ1 is the coefficient, and θ0 is the bias term. However, relationships between input and output variables are rarely linear.

To illustrate, consider a case with a single input variable. In polynomial regression, we construct multiple features from the input variable X by applying various powers to it. This leads to a polynomial form that can effectively model more complex relationships.

For those seeking more in-depth information about polynomial regression, you can refer to this link: Implementation of Polynomial Regression.

Implementation Using scikit-learn

We will now implement polynomial regression using the scikit-learn library in Python, utilizing the insurance dataset from Kaggle. Here’s how to get started:

import pandas as pd

df = pd.read_csv("insurance.csv")

df.head()

Data Preparation

It’s crucial to check for null values in the dataset initially, as their presence can hinder the performance of machine learning models. The following code will allow you to identify any null values across the DataFrame:

df.isna().sum()

Great news! We found no null values in the dataset. However, since machine learning models require numerical inputs, we need to convert categorical string values into numeric ones:

df['sex'] = df['sex'].replace({'female': 1, 'male': 2})

df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 2})

df['region'] = df['region'].replace({'southwest': 1, 'southeast': 2, 'northwest': 3, 'northeast': 4})

This completes our data preparation.

Defining Input Features and Target Variable

In this exercise, our objective is to predict 'charges' based on the other variables in the dataset. Therefore, 'charges' will serve as our output variable, while all other variables will constitute our input features:

X = df.drop(columns=['charges'])

y = df['charges']

Here, X represents the input features and y stands for the output variable.

Splitting the Data for Training and Testing

To effectively train and evaluate our model, we need to separate the dataset into training and testing sets. The scikit-learn library provides a convenient train_test_split method:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

Scaling the Data

It's advisable to scale the data to ensure that all features are on a similar scale, especially in polynomial regression. We will employ the Standard Scaler from scikit-learn for this purpose:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

Model Development

Since polynomial regression builds upon linear regression, we need to import the necessary methods for both. The development process involves a few additional steps compared to linear regression:

Utilize the PolynomialFeatures method with a specified degree.
Fit the input features to the PolynomialFeatures method for both training and testing data.
Train the model using Linear Regression on the prepared data.

Here’s the code for this process:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=6)

X_poly_train = poly.fit_transform(X_train_scaled)

X_test_poly = poly.transform(X_test_scaled)

from sklearn.linear_model import LinearRegression

lin = LinearRegression()

lin.fit(X_poly_train, y_train)

The model training is now complete!

Model Evaluation

We will assess the model's performance using both training and testing datasets. Starting with the test data, we can predict 'charges' as follows:

y_pred = lin.predict(X_test_poly)

To calculate the mean absolute error:

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_pred)

Output:

285296796712.7246

This mean absolute error appears quite large! Now, let's see how the model performs on the training data:

y_pred_train = lin.predict(X_poly_train)

mean_absolute_error(y_train, y_pred_train)

Output:

1970.8913236804585

The mean absolute error for the training data is significantly smaller, indicating that the model has overfit the training data, leading to poor performance on new data.

Addressing Overfitting

To tackle overfitting, we start with hyperparameter tuning. By experimenting with different hyperparameters, we can identify optimal values that enhance model performance on both training and testing datasets. In this instance, we will modify the degree used in the PolynomialFeatures method from 6 to 3.

poly = PolynomialFeatures(degree=3)

X_poly_train = poly.fit_transform(X_train_scaled)

X_test_poly = poly.transform(X_test_scaled)

lin.fit(X_poly_train, y_train)

Let’s evaluate the mean absolute error again for both datasets:

y_pred = lin.predict(X_test_poly)

mean_absolute_error(y_test, y_pred)

Output:

2819.746326567164

This is a significant improvement! Next, we should also check the training dataset performance:

y_pred_train = lin.predict(X_poly_train)

mean_absolute_error(y_train, y_pred_train)

Output:

2818.997792755733

Amazing! The mean absolute errors for both training and testing datasets are now closely aligned at 2819 and 2818, respectively.

Conclusion

This tutorial offered a comprehensive overview of polynomial regression and its implementation using the scikit-learn library in Python. We also discussed the concept of overfitting and explored methods to mitigate it. I hope you found this information beneficial!

Feel free to connect with me on Twitter and like my Facebook page.

Additional Resources

For further insights, check out the video titled "Polynomial Regression in Python - sklearn" on YouTube.

You may also find valuable information in the video "Lab 22 - Polynomial Regression using Python" on YouTube.

1949catering.com

Mastering Polynomial Regression with Python: A Complete Guide

Introduction to Polynomial Regression

Understanding Polynomial Regression

Implementation Using scikit-learn

Data Preparation

Defining Input Features and Target Variable

Splitting the Data for Training and Testing

Scaling the Data

Model Development

Model Evaluation

Addressing Overfitting

Conclusion

Additional Resources

Share the page:

Recent Post:

The Impact of Digital Price Tags: Opportunity or Threat?

Data Analysis: Unlocking Effective Business Decision-Making

A Comprehensive Guide to Contract Testing with PACT and JavaScript

Transform Your Sleep with These 4 Buddhist Techniques for Rest

Elevate Your Breakfast: Quick and Healthy Oatmeal Topping Ideas

Unveiling 23 Essential Lessons from Writing 300,000 Words

A Journey Through My Grandfather's Living History

Keeping Your Pug Fit: A Guide to Healthy Habits