A multi-label classifier, part 1: collecting images

part 1

March 9, 2023


March 14, 2023


The idea of developing a computer system capable of recognizing birds was considered extremely difficult in 2015, so much so that it was the inspiration for an XKCD joke.

I want to make a bird detector that works fairly reliably with any sort of input image.

A multi-label classifier for birds, cats, and dogs should be a good starting point.

It’s going to take me a lot more than 5 minutes to make a good bird detector, but hopefully I won’t need to hire a research team for five years!

These are my main references:

After doing maybe half of the work for this project, I found that it will be too much for a single blog post, so I’m splitting it up into eight posts:

  1. collecting images
  2. reusable functions
  3. Google image search
  4. cleaning up the dataset
  5. finding duplicate images
  6. training the model
  7. deploying with fastai
  8. deploying without fastai

Search engines

I tried a few different search engines to find suitable images:

Install requirements

!pip install -qq -U fastai ipywidgets ipynbname pillow pillow-avif-plugin inflect
!pip install -qq -U selenium webdriver_manager retry
!pip install -qq -U duckduckgo_search
!pip install -qq -U clip-retrieval
!pip install -qq protobuf==3.20.0
!jupyter nbextension enable --py --sys-prefix widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

Import required libs

%load_ext autoreload
%autoreload 2
import pillow_avif
from fastai.vision.all import *

I collected some reusable functions in a mini library ucm.py; see part 2: reusable functions.

from ucm import *

General setup

data = Path("bird_cat_dog")
null_query = "photo of outdoors"
labels = ["bird", "cat", "dog"]
query_prefix = "photo of "
samples_per_query = 200

Find images using DuckDuckGo

from duckduckgo_search import ddg_images
engine = "ddg"
for comb in powerset(labels):
    others = seq_diff(labels, comb)
    dirname = "_".join(comb) + "_"
    path = data/engine/dirname
    query = query_prefix + join_a_foo_and_a_bar(comb) if comb else null_query
    query += " " + " ".join("-"+x for x in others)
        path.mkdir(parents=True, exist_ok=False)
    except FileExistsError as e:
        print(f"already downloaded: {query}")
    print(f"downloading: {query}")
    # want creative commons images to avoid stock photos, but not many CC images have all three
    license = "any" if len(comb) < 3 else None
    results = ddg_images(query, max_results=samples_per_query, license_image=license)
    urls = [r["image"] for r in results]
    download_images(dest=path, urls=urls)
already downloaded: photo of outdoors -bird -cat -dog
already downloaded: photo of a bird -cat -dog
already downloaded: photo of a cat -bird -dog
already downloaded: photo of a dog -bird -cat
already downloaded: photo of a bird and a cat -dog
already downloaded: photo of a bird and a dog -cat
already downloaded: photo of a cat and a dog -bird
already downloaded: photo of a bird and a cat and a dog 

Find images using LAION

Deduplication means that fewer than samples_per_query images will be returned, around 75% or so.

from clip_retrieval.clip_client import ClipClient
engine = "laion"
laion = ClipClient(
    aesthetic_score=0, aesthetic_weight=0,
for comb in powerset(labels):
    dirname = "_".join(comb) + "_"
    path = data/engine/dirname
    query = query_prefix + ", ".join(comb) if comb else null_query
        path.mkdir(parents=True, exist_ok=False)
    except FileExistsError as e:
        print(f"already downloaded: {query}")
    print(f"downloading: {query}")
    results = laion.query(text=query)
    urls = [r["url"] for r in results]
    download_images(dest=path, urls=urls)
downloading: photo of outdoors
downloading: photo of bird
downloading: photo of cat
downloading: photo of dog
downloading: photo of bird, cat
downloading: photo of bird, dog
downloading: photo of cat, dog
downloading: photo of bird, cat, dog

references for LAION clip retrieval

Some thoughts

I wanted to avoid stock photos, so I used options to fetch Creative Commons licensed images in some cases. I noticed that these free images tend to be of better quality than other images. This is consistent with Meta AI’s finding with LLaMA, that “it is possible to train state-of-the-art models using publicly available datasets exclusively”.

This project continues in part 2: reusable functions


!mvdata bird_cat_dog
mkdir: created directory '/home/sam/ai/data/blog'
mkdir: created directory '/home/sam/ai/data/blog/posts'
mkdir: created directory '/home/sam/ai/data/blog/posts/multilabel'
mv   renamed 'bird_cat_dog' -> '/home/sam/ai/data/blog/posts/multilabel/bird_cat_dog'
ln   'bird_cat_dog' -> '/home/sam/ai/data/blog/posts/multilabel/bird_cat_dog'
!mv ~/ai/data/blog/posts/multilabel/{bird_cat_dog,bird_cat_dog.orig}