How to Save and Load Machine Learning Models in Python

In this tutorial, you will learn – 

  • How to create a basic linear regression model
  • How to save and load an ML model using Pickle module
  • How to save and load an ML model using Joblib module

Background and Motivation

Over the past years, Machine Learning (ML) has grown in importance with easy access to data and increasing computational power. Better ML models help to determine future events and decipher consumer trends with greater precision.

For example, Scikit-learn and Keras are ML models that help in the diagnosis and detection of skin cancer with high accuracy and likewise, regression and time series are widely used in demand forecasting.

ML models undergo multiple iterations to reach the desired level which can provide results with greater accuracy. It requires a considerable amount of time and resources to develop a model.

Consider a situation where for every scenario, one has to build an ML model from scratch. Do you think it would be even fruitful to consider the ML option for predictions?

Solution Overview

Python has a solution where through its varied modules, one can easily save and load ML models at a later stage to predict an outcome.

In this article, we will study two different methods of saving and loading our ML models using Python

  • Pickle
  • Joblib

Packages required –Β follow our installation guides:

  1. Scikit-Learn
  2. Pickle
  3. SciPy

Method 1: Pickle

In Python, the object structure is serialized and deserialized by the Pickle module through binary protocols.Β 

ℹ️ Info: Pickling is the process in which an object hierarchy is converted into a byte stream, and unpickling is the exact reverse where the stored byte stream is reconverted into an object.

To demonstrate Pickle module versatility, we will perform below steps:

  • Build a plain vanilla linear regression model
  • Save the model (Serialization or pickeling)
  • Load the saved model (Deserialization or unpickeling)

Step 1: Load all required packages from sklearn and pickle

# Import Packages
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pickle

Step 2: Load data set

There is a list of inbuilt datasets that comes with the scikit-learn module such as the wine data set.

# Load Wine Data
Wine_data = load_wine()

Step 3: Split wine data

# Split data into test and train datasets
X_train, X_test, Y_train, Y_test = train_test_split(Wine_data.data, Wine_data.target, test_size=0.33, random_state=3, stratify = Wine_data.target)

Function train_test_split will split the wine dataset into two different sets, train, and test. The model will utilize a train set to find the optimal weights for regression. The test set provides an unbiased measure of the model’s effectiveness.

Test size can be any value between 0 to 1. Value 0.33 suggests that 33% will be utilized as test data and 67% for training the model.

Random state indicates shuffling of data while building the model

Stratify takes into account the frequency of training data

Step 4: Create and Train Linear Regression Model

# Initiate Linear Regression and train the model
lreg = LinearRegression().fit(X_train,Y_train)

Step 5: Evaluate R squared for the train and the test data

# Evaluate R squared for train and test data
print(str(lreg.score(X_train,Y_train)))
print(str(lreg.score(X_test,Y_test)))

=================================================================== RESTART: /Users/mayankchandra/Documents/Python/ML_Save_trial.py ==================================================================
0.914045280377521
0.855120280462724

Shell returned the above values for R square, which denotes fitment of the model. The Better the R score, the better the fitment.

P.S. Since the training dataset was utilized to build the model, it has a better score than the test dataset

Step 6: Save the model using Pickle

# Save model using dump function of pickle
pck_file = "Pck_LR_Model.pkl"

with open(pck_file, 'wb') as file:  
    pickle.dump(lreg, file)

The dump function ensures that the linear regression model is saved in the pickle file.

Pickle file 'Pck_LR_Model.pkl' will be stored in the current working directory.

Step 7: Load the model and evaluate R squared

# Reload model using load function of pickle
with open(pck_file,'rb') as file:
    Pickled_LR = pickle.load(file)

# Validate the R sqaured value of test data, it should be same of the original model
print(str(Pickled_LR.score(X_test,Y_test)))

The load function will reload the model in the Pickled_LR object.

Method 2: Joblib

Joblib provides specific optimizations utilized for lightweight pipelining in Python. It works efficiently on large data Python objects.

Now let’s see how Joblib saves our existing Linear Regression model and reloads it at a later stage for future use.

Step 1: Import joblib

# Import joblib
import joblib

Step 2: Save the model using the dump() function

# Save linear regression model using joblib
jlib_file = "Jlib_LR_Model.pkl"
joblib.dump(lreg,jlib_file)

Step 3: Reload the model using the load function 

# Reload model using load function of joblib
Joblib_LR = joblib.load(jlib_file)

# Validate the R sqaured value of test data, it should be same of the original model
print(str(Joblib_LR.score(X_test,Y_test)))

Summary

In this article, we learned two methods to save and load an ML model

  • Pickle: Serialize and deserialize Python objects
  • Joblib: Efficiently compresses large data Python objects