Hyperparameter Tuning Using Pipeline: End-to-End ML (Part 2/3)

Published in

Analytics Vidhya

7 min readMar 25, 2024

Hyperparameter tuning while cooking (AI-generated image by Canva ‘MagicDesign’)

Did you know that a data scientist spends up to 80% of their time on data preparation, leaving only a fraction for actual model building and refinement? And if you’re a data professional or someone who’s been working towards that, you know this through experience. As exciting as building ML models and creating visualizations are, data professionals who spend a considerable amount of time collecting and preparing data provide more fruitful results compared to their counterparts. Because…

“Garbage in, garbage out” — every data person ever at all times

But that leaves us with only 20% of our time to be allocated to model building and fine-tuning — which pushes us towards finding more efficient ways to do it without losing quality. Think of hyperparameter tuning as repeatedly cooking the same meal with different ingredients to find the right amount of each ingredient to create the best version of that meal. What if you can combine this repeated process into one task to find your perfect dish? * Italian chef kiss * Enter Scikit-learn’s Pipeline — a tool that promises to streamline the model-building process and maximize productivity.

Making the perfect dish using Pipeline (AI-generated image by Canva ‘MagicDesign’)

Advantages of using Pipeline in Machine Learning

Seamless Integration of Preprocessing and Modeling

Pipeline allows you to combine multiple preprocessing steps and a final estimator into a single object. This integration ensures that you apply all preprocessing steps consistently during both training and testing, avoiding data leakage and making the workflow more manageable.

Simplified Code and Workflow

With Pipeline, you can write cleaner and more concise code by chaining together sequential operations. Instead of manually applying preprocessing steps before fitting the model, you define a Pipeline object with the required steps.

Automatic Parameter Tuning (which will be demonstrated in this article)

Pipeline seamlessly integrates with Scikit-learn’s GridSearchCV () for hyperparameter tuning. You can set specific values for parameters, and GridSearchCV will efficiently search through all the combinations of preprocessing and model parameters.

Picking up from where we left off

In the previous article (part 1/3), we learned how to scrape images from Google using Selenium and preprocess image data using cv2 which created a comprehensive set of images to train our model. The first thing we do before creating a Pipeline is to convert our F1 racer names to numeric values and store them in a dictionary as shown below.

{'alex albon': 0,
 'carlos sainz': 1,
 'charles leclerc': 2,
 'daniel ricciardo': 3,
 'esteban ocon': 4,
 'fernando alonso': 5,
 'george russell': 6,
 'kevin magnussen': 7,
 'lance stroll': 8,
 'lando norris': 9,
 'lewis hamilton': 10,
 'logan sargeant': 11,
 'max verstappen': 12,
 'nico hulkenberg': 13,
 'oscar piastri': 14,
 'pierre gasly': 15,
 'sergio perez': 16,
 'valtteri bottas': 17,
 'yuki tsunoda': 18,
 'zhou guanyu': 19}

The next step is to create the X and y variables for our model. When working with images compared to numbers, some resizing and reshaping are to be done before adding them to X and y arrays.

# Resizing and reshaping images

X, y = [], []
for racer_name, training_files in zip(folder_path_dict['Name'],folder_path_dict['Paths']):
    for training_image in training_files: 

        img = cv2.imread(training_image)
        if img is None:
            continue

        scaled_raw_img = cv2.resize(img,(32,32))
        
        final_img = scaled_raw_img.reshape(32*32*3,1)
        
        X.append(final_img)
        y.append(class_dict[racer_name])

#  Reshaping X and changing numbers to float

X = np.array(X).reshape(len(X),3072).astype(float)

Resizing

Resizing is changing the dimensions of an image. This is a common method used in image preprocessing because of many reasons. For this example, we will reduce the size of our images to 32x32 to be more efficient with time. Since we will test out multiple machine learning models with a variety of parameter inputs, Pipeline will use the same set of images multiple times (14 times in this instance), which will strain your memory and significantly increase the runtime. The trick is to find the “sweet spot” in dimensions that don’t improve the performance of the model much when we increase further.

Note: Time to run varies depending on the size of your dataset

Reshaping

When we first create the list X as a combination of ‘final_img’ variables (as per the code above), we end up with a list of 1895 arrays in the shape of 3072x1. Since most ML models are more friendly with arrays than lists, we need to convert this list of arrays into an array that our model can use. For this, the list of arrays gets converted to an array (since we can’t reshape lists) and then reshaped into an array with the shape of 1895x3072.

If we think of arrays as matrices, this is what the transformation looks like.

Creating the Pipeline

There will be 14 models across three machine learning techniques (SVC, Random Forest, Logistic Regression) that we want to experiment with to find the best-performing model for our dataset. Imagine the amount of code duplication if you wanted to test these models out one by one.

This is where Pipeline comes in by automating the entire model-building and cross-validation to give us the best-performing combination of parameters in each category. The first step is to recreate the above table as a dictionary as shown below, and we will use it as an input in the Pipeline.

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto',probability=True),
        'params' : {
            'svc__C': [1,10,100,1000],
            'svc__kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'randomforestclassifier__n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'logisticregression__C': [1,5,10]
        }
    }
}

After creating the above dictionary with your desired models and parameters, we run the model and parameters on a loop to make the process even more efficient. The pipeline consists of a StandardScaler() for preprocessing and the model specified in mp[‘model’]. As mentioned above, we use ‘GridsearchCV’ for hyperparameter tuning. ‘best_estimators[algo] gives us a dictionary of the best combination of parameters for each ML model, but additionally, if a data frame view with the score of each combination is preferable, that can be easily attained as well.

scores = []
best_estimators = {}
for algo, mp in model_params.items():
    pipe = make_pipeline(StandardScaler(), mp['model'])
    clf =  GridSearchCV(pipe, mp['params'], cv=5, return_train_score=False)
    clf.fit(X_train_1, y_train_1)
    
    # finding the best estimator in each model
    best_estimators[algo] = clf.best_estimator_
    
    # creating a df out of the above dictionary output
    scores.append({
        'model': algo,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Once we find our “perfect” combination of parameters for each model, we can simply use the ‘best_estimators’ variable to find the best-performing model out of the three models that we tested out. Through that, we find the logistic regression algorithm to have the highest accuracy when the c parameter is 1.

print(f"Accuracy of the SVM Model is {best_estimators['svm'].score(X_test_1,y_test_1)}")
print(f"Accuracy of the Random Forest Model is {best_estimators['random_forest'].score(X_test_1,y_test_1)}")
print(f"Accuracy of the Logistic Regression Model is {best_estimators['logistic_regression'].score(X_test_1,y_test_1)}")

We can test our model out by randomly selecting some images from the test dataset using the following functions, which compare the actual f1 driver in the image to the predicted result by the model. As you can see from the results below, our model makes decent predictions most of the time but gets it wrong approximately 1/3 of the time.

# creating a fucntion to get the key when a value is given

def get_key_from_value(dictionary, value):
    for key, val in dictionary.items():
        if val == value:
            return key
    return None 

# plot a sample of the image being tested

def plot_sample(X,y, dicti,index):
    plt.figure(figsize = (15,3))
    plt.imshow(X[index])
    plt.xlabel(get_key_from_value(dicti,y[index]))


def check_output(x):
    
    # predicting the driver
    
    prediction = np.round(best_estimators['logistic_regression'].predict(np.expand_dims(X_test_1[x], axis=0))[0])
    
    
    # finding the probabilities assigned to each driver by the LR model

    percentages = np.round(best_estimators['logistic_regression'].predict_proba(X_test_1[x].reshape(1,-1))*100,1)[0]
    
    
    #finding the highest probability from the above list

    highest_perc = np.max(percentages)
    
    
    # formatting

    converted_array = np.array([("{:.2f}".format(number)) for number in percentages])

    converted_highest_perc = "{:.2f}%".format(highest_perc)

    driver_name = f"The predicted driver is {get_key_from_value(class_dict,prediction)} with a {converted_highest_perc} probability"
    
    plot_sample(X_test,y_test,class_dict,x) 

    return driver_name

Since we stuck to 32x32 image size to improve efficiency, we can use 64x64 image size on the logistic regression model to gain some accuracy. However, the increase in image size ends up decreasing the accuracy of the model by 2%. If we were to further improve the model, it seems increasing the size of the dataset and exploring CNN techniques would be the most practical next steps. But for now, we will take what we can get and move forward to the web app development stage of our project.