How to Scrape Images from Google Using Python Selenium: End-to-End ML (Part 1/3)

Mandula Thrimanne
6 min readFeb 6, 2024
Source: Sky Sports

It’s your friend’s birthday in two days — which means you only have 48 hours to find all the embarrassing photos of them to post on Instagram. Maybe you still scroll down until your thumb gets tired to find funny photos of your sweet friend that you took 5 years ago, and that’s fine. But you may already know that most phones these days make it so much easier to embarrass your friend in such scenarios by having an entire folder dedicated to that person and their goofiness. Thank you, Technology!

Meme 1 (yes, there’s more)

That “technology” is given to us by state-of-the-art facial recognition algorithms that combine computer vision, machine learning, and pattern recognition techniques to analyze and identify faces in images. What if I told you that you — yes, you — could also build a machine-learning model from scratch that will recognize a person when you upload an image of them? Wouldn’t that be a super cool project to embark on?

Since I’m a fan of Formula 1, I decided to build a web app that when someone uploads an image of that driver, it will tell who the driver is with a certain probability. For example, the output would look something like, “This is Lewis Hamilton with a 97% probability”. And you can make this project to be images of anything that you find interesting. As the title suggested, part 1 of this three-part series is going to be about scraping data from the web and cleaning that data to use for our model. Because a big driver of any good ML model is good data and a lot of it.

Data Collection

The web scraping component of this project is mainly broken down into two parts: downloading links of images from HTML “<img>” tags, and then saving those images in our local directory using an HTTP GET requests. It is as easy as it sounds — I promise. So, let’s go through the code together.

Meme 2 (maybe there’s more — only one way to find out)

Downloading links of images

The below script runs a set of keywords through a function that will open up a Chrome browser and click on the first 300 Google Image results of that keyword to extract the src link of each image and save it in separate excels. Creating scraping scripts surely comes with experience, but if I could give you a big fat tip to make your life easier, it would be to always try finding the element using css selectors first. CSS selectors allow you to target specific HTML elements based on their attributes, position in the document tree, or relationship with other elements. The element that you’re looking for can be found 95% of the time (in my experience) just by using CSS selectors. This will help you narrow your focus down to one, rather than spending too much time writing XPaths.

Meme 3 (keep scrolling)

Saving Images

Once we run the script above to extract links of images, we can save images to our local directory using HTTP GET requests.

Data Cleaning

Good job! Your script just downloaded thousands of images from Google without you having to lift a finger — doesn’t that feel good? However, the journey doesn’t end there. I urge you to go to Google right now and google someone that you love and admire (someone who’s actually popular, so no, not your crush). Since our focus is to identify someone by their face, you’ll notice there’s a lot of noise in these pictures. In the case of F1 drivers, may it be their partners, team members, cars, pets, or even the driver themselves in a helmet, we don’t want it (at least not as input for our model).

Noisy Data

So, the data cleaning we do will be to find front-facing images where their face is visible, and then we crop those images just to include the face. We call this cropped image the ROI — region of interest. Instead of analyzing the entire image, focusing on the ROI can help improve the efficiency and accuracy of image classification algorithms by concentrating on the most relevant parts of the image — in this case, the face of a driver.

Understanding the function

Let’s breakdown the function at each step:

  • Reading the Image: The function reads the image located at the specified image_path using OpenCV's cv2.imread() function. The image is then stored in the variable img.
  • Converting to Grayscale: The function converts the color image (img) to grayscale using OpenCV's cv2.cvtColor() function. We convert the image to grayscale because grayscale images are easier to process for face and eye detection algorithms.
  • Face Detection: The function uses a pre-trained Haar Cascade classifier (face_cascade) to detect faces in the grayscale image. This is done using the detectMultiScale() function, which returns a list of rectangles representing the detected faces. Then you loop through the coordinates of the rectangles. Each rectangle contains the coordinates (x, y) of the top-left corner, as well as the width (w) and height (h) of the face bounding box.

You can download the required pre-trained Haar Cascade files from here

  • Region of Interest (ROI) Extraction: Within the loop, the function extracts a region of interest (ROI) from both the grayscale and color versions of the image using the coordinates (x, y, w, h) of the detected face. This region corresponds to the area containing the detected face.
  • Eye Detection: Inside the ROI, the function uses another pre-trained Haar Cascade classifier (eye_cascade) to detect eyes. This is done using the detectMultiScale() function, which returns a list of rectangles representing the detected eyes within the face region.
  • Checking for Two or More Eyes: The function checks if the number of detected eyes (len(eyes)) within the face region is greater than or equal to 2. If so, it indicates that the face has at least two eyes. This check is carried out so that we don’t include side angles of racers in our model.
  • Returning Cropped Image: If the condition is met (i.e., if there are at least two eyes detected), the function returns the color ROI (roi_color) containing the face with at least two eyes.

Tada! There you have it. You have successfully cropped all images where drivers’ faces were visible. However, the job’s not fully done yet. Notice how we had other people in pictures with the person of interest. Therefore, we need to manually (I know, I know, the PAIN) delete those images of people whom we don’t want the model to be trained on. As you can see below, we don’t want our model to think Shakira, Toto Wolff, and Lewis Hamilton all look the same — duh!

If I have kept you interested enough to read up to this point, I hope you learned something new. Maybe you can follow me along to see where this ends. I’ll be back with part 2 of this project soon. Till then, keep scraping away and learning new things!

Meme 4 (expect more memes soon)

--

--

Mandula Thrimanne

Data Analyst | Storyteller | "Best way of learning about anything is by doing"