Usage

Minimal example

The package is designed as a library. Here is a minimal example of what you can do (examples/example_api_minimal.py):

#!/usr/bin/python3

# Minimal example. Use the convenience function io.get_image_data() without any
# extra arguments.

from imagecluster import calc, io as icio, postproc

# The bottleneck is calc.fingerprints() called in this function, all other
# operations are very fast. get_image_data() writes fingerprints to disk and
# loads them again instead of re-calculating them.
images,fingerprints,timestamps = icio.get_image_data('pics/')

# Run clustering on the fingerprints. Select clusters with similarity index
# sim=0.5.
clusters = calc.cluster(fingerprints, sim=0.5)

# Create dirs with links to images. Dirs represent the clusters the images
# belong to.
postproc.make_links(clusters, 'pics/imagecluster/clusters')

# Plot images arranged in clusters.
postproc.visualize(clusters, images)

Have a look at the clusters, represented as dirs with symlinks to the relevant files (by make_links()).

$ tree pics/imagecluster/clusters
pics/imagecluster/clusters
├── cluster_with_2
│   ├── cluster_0
│   │   ├── 140700.jpg -> /path/to/pics/140700.jpg
│   │   └── 140701.jpg -> /path/to/pics/140701.jpg
│   ├── cluster_1
│   │   ├── 140100.jpg -> /path/to/pics/140100.jpg
│   │   └── 140101.jpg -> /path/to/pics/140101.jpg
│   ├── cluster_2
│   │   ├── 140600.jpg -> /path/to/pics/140600.jpg
│   │   └── 140601.jpg -> /path/to/pics/140601.jpg
│   ├── cluster_3
│   │   ├── 140400.jpg -> /path/to/pics/140400.jpg
│   │   └── 140401.jpg -> /path/to/pics/140401.jpg
│   ├── cluster_4
│   │   ├── 140000.jpg -> /path/to/pics/140000.jpg
│   │   └── 140001.jpg -> /path/to/pics/140001.jpg
│   ├── cluster_5
│   │   ├── 140501.jpg -> /path/to/pics/140501.jpg
│   │   └── 140502.jpg -> /path/to/pics/140502.jpg
│   ├── cluster_6
│   │   ├── 140300.jpg -> /path/to/pics/140300.jpg
│   │   └── 140301.jpg -> /path/to/pics/140301.jpg
│   └── cluster_7
│       ├── 140200.jpg -> /path/to/pics/140200.jpg
│       └── 140201.jpg -> /path/to/pics/140201.jpg
└── cluster_with_3
    └── cluster_0
        ├── 140801.jpg -> /path/to/pics/140801.jpg
        ├── 140802.jpg -> /path/to/pics/140802.jpg
        └── 140803.jpg -> /path/to/pics/140803.jpg

Here is a visual representation made by visualize().

../_images/clusters.png

So there are some clusters with 2 images each, and one with 3 images.

For this example, we use a very small subset of the Holiday image dataset (25 images (all named 140*.jpg) of 1491 total images in the dataset). See examples/inria_holiday.sh for how to select such a subset:

#!/bin/sh

# select 25 images
#   ./this.sh jpg/100*
#
# select 274 images
#   ./this.sh jpg/10*

if ! [ -d jpg ]; then
    for name in jpg1 jpg2; do
        wget ftp://ftp.inrialpes.fr/pub/lear/douze/data/${name}.tar.gz
        tar -xzf ${name}.tar.gz
    done
fi

mkdir -p pics
rm -rf pics/*
for x in $@; do
    f=$(echo "$x" | sed -re 's|jpg/||')
    ln -s $(readlink -f jpg/$f) pics/$f
done

echo "#images: $(ls pics | wc -l)"
$ /path/to/imagecluster/examples/inria_holiday.sh jpg/140*

Here is the result of using a larger subset of 292 images from the same dataset (inria_holiday.sh jpg/14*):

../_images/clusters_many.png

You may have noticed that in the 25-image example above, only 19 out of 25 images are put into clusters. The others are not assigned to any cluster. Technically they are in clusters of size 1, which we don’t report by default (unless you use calc.cluster(..., min_csize=1)). One can now start to lower sim to find a good balance of clustering accuracy and the tolerable amount of dissimilarity among images within a cluster. See Clustering and similarity index.

Detailed example

This example shows all low-level functions and also shows how to use time distance scaling. Use the latter if you (i) find that pure content-based clustering throws similar but temporally uncorrelated images in the same cluster and (ii) you have meaningful timestamp data such as EXIF tags or correct file timestamps (watch out for those when copying files around, use cp -a or rsync -a). See Content and time distance.

#!/usr/bin/python3

# Detailed API example. We show which functions are called inside
# get_image_data() (read_images(), get_model(), fingerprints(), pca(),
# read_timestamps()) and show more options such as time distance scaling.

from imagecluster import calc, io as icio, postproc


##images,fingerprints,timestamps = icio.get_image_data(
##    'pics/',
##    pca_kwds=dict(n_components=0.95),
##    img_kwds=dict(size=(224,224)))

# Create image database in memory. This helps to feed images to the NN model
# quickly.
images = icio.read_images('pics/', size=(224,224))

# Create Keras NN model.
model = calc.get_model()

# Feed images through the model and extract fingerprints (feature vectors).
fingerprints = calc.fingerprints(images, model)

# Optionally run a PCA on the fingerprints to compress the dimensions. Use a
# cumulative explained variance ratio of 0.95.
fingerprints = calc.pca(fingerprints, n_components=0.95)

# Read image timestamps. Need that to calculate the time distance, can be used
# in clustering.
timestamps = icio.read_timestamps('pics/')

# Run clustering on the fingerprints. Select clusters with similarity index
# sim=0.5. Mix 80% content distance with 20% timestamp distance (alpha=0.2).
clusters = calc.cluster(fingerprints, sim=0.5, timestamps=timestamps, alpha=0.2)

# Create dirs with links to images. Dirs represent the clusters the images
# belong to.
postproc.make_links(clusters, 'pics/imagecluster/clusters')

# Plot images arranged in clusters and save plot.
fig,ax = postproc.plot_clusters(clusters, images)
fig.savefig('foo.png')
postproc.plt.show()