Scraping IMDB Gallery

Image Mosaic - Part II

Jun 13, 2022

To create mosaics we need a lot of images, the more the better. Unless we have a large library of images we need to find them somewhere. One logical source is the Internet Movie Database or IMDB. Select a movie/series and go to a photo gallery like this one for Star Trek The Next Generation. We could click on each image individually and save it one by one but that would take ages, for example, Star Trek TNG, and if it takes approximately 10 sec per image it would take almost 10h to download them manually. I don’t know about you but I have more interesting things to do than clicking and saving for 10 hours. So let’s make a Python script.

Python has a library that will help us parse HTML, it’s called BeautifulSoup. Let’s get started.

import os #access to operating system
import shutil #file operations
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen #download file from address

Next, we need to set up some variables in this case how many pages are there in the gallery and the start page. Starting page is there if some error happens so that we can start at that page and not download the entire gallery all over again. Also, this is where we will enter the address of the desired gallery.

start_page = 0
paggination = 71

base_url = "https://www.imdb.com/title/tt0092455/mediaindex/"
mediaview_url = "https://www.imdb.com/" #Helper variable

We need a folder to place images in. We could name it ourselves by changing the folder variable but here we will take a part of the address to name it.

#create directory based on base_url
folder = base_url.replace("https://www.imdb.com/title/", "")
if not os.path.exists(folder):
	os.makedirs(folder)

folder = "./"+folder + "/"
print("Created Directory:", folder)

We continue to create the main loop that will go through gallery pages, download it and get links to individual pictures.

for x in range(start_page, paggination):
	if x > 0: url =base_url +"?page="+str(x)
	else: url = base_url
	print()
	print("Scrapping from:", url)
	
	htmldata = urlopen(url)
	soup = BeautifulSoup(htmldata, 'html.parser')
	images = soup.find_all(class_='media_index_thumb_list')
	links = images[0].find_all('a')
	print("Found:", len(links), "images")

Then we go through thumbnail links and retrieve the page with a large image

for index, link in enumerate(links):
	thumb = link['href']
	url = mediaview_url + thumb
	print(url)
	imagedata = urlopen(url)
	individual_soup = BeautifulSoup(imagedata, 'html.parser')
	found = individual_soup.find('img')['src']
	if found == None:
		print("No image found")
		continue
	print("Downloading "+found)
	file_name = folder + found.split('/')[-1]

Once we get the page with the large image we have a little problem. IMDB media viewer has no unique identifier for large images and looks like it buffers one or two images in advance for smooth browsing. To make our life simpler we will download the first image that we find on the large image page. We will also test if a file exists with the same name and if it does the script will append an incremental variable g to its name until no files with the same name is found. This could result in duplicate images which we will address later on.

try:
 res = requests.get(found, stream = True)
 if res.status_code == 200:
  exists = os.path.isfile(file_name)
  g = 0
  while exists:
   print("file exists:", file_name)
   g += 1
   file_name =folder+str(index)+"_"+str(g)+"_"+ found.split('/')[-1]
   exists = os.path.isfile(file_name)			
  with open(file_name,'wb') as f:
   shutil.copyfileobj(res.raw, f) 
  saved = os.path.isfile(file_name)
  if not saved:
   print("Error Saving >>>> ",saved)
  else:  
   print('Image sucessfully Downloaded: ',file_name)
 else:
  print('Image Couldn\'t be retrieved')
except Exception as e:
 print(e)
 pass

And that’s it, now you can scrape all images from the IMDB gallery for use in AI training or to create mosaics.

Q is not amused.

You can find the complete code on GitHub.

Now, it’s possible that you will get duplicates and it’s a pain to remove them if you have several thousand images that are all different sizes, be it dimensionally or on disk.

To fix this we will use AntiDupl.Net. Just make sure you download 2.3.9 and not 2.3.10 as it has some issues. Next, go to the search menu and select one search path. This will make it easier to quickly search a single folder. Next click the Folder icon with Open written on it, select a folder, and click play. The program will compare all images within the selected folder and display them. You can go one by one and delete them or select them all and choose which duplicate you wish to delete.

Back to Part 1

Barn Lab

Discussion about this post