Fast Bird Part Localization for Fine-grained Recognition

1. Abstract *

In this work we have introduced a new and faster method for part localization of bird species. Part localization is an important step in fine-grained recognition as the discriminative features are highly local. State-of-the-art method of [1], uses R-CNN [2] to localize parts. R-CNN does O(1000) forward passes of the network, which makes it very slow. Comparing to the state-of-the-art [1], our method is at least 2 orders of magnitude faster while achieving comparable categorization accuracy.

2. Publication

Fast Bird Part Localization for Fine-Grained Categorization

Yaser Souri, Shohreh Kasaei

The Third Workshop on Fine-Grained Visual Categorization (FGVC3) in conjunction with CVPR 2015

pdf | poster | code

3. Part Localization as Pixel Classification

3.1. Classifying pixels

We pose the problem of finding a part as classifying the set of pixels that belong to that part. Below are example results of our system classifying pixels whether they belong to the head of a bird. The more red a pixel is the more the system thinks it belongs to the head of a bird. Exact probability values can be estimated from the color bar.

Example Output 1 Example Output 2 Example Output 3

To train each part localizer (pixel classifier) we need three things:

  • feature representation for each pixel
  • a set of positive and negative example pixels
  • a classifier

3.1.1. Feature representation for each pixel (Similart to [3])

We use an off-the-shelf Convolutional Neural Network (specifically CaffeNet an almost identical network to AlexNet) which has been pre-trained on the classification task of ILSVRC 2012 and only use a single forward pass of its convolutional layers (conv1 - conv5). After up-sampling each feature map we obtain a good feature representation for all of the pixels in the image.

Feature Extraction for each pixel

Hypercolumn of each image

3.1.2. Positive and negative training pixels

We mine a large set of positive and negative pixels for each part that we want to localize from the training set of CUB-200-2011 dataset. Notice that since the CUB dataset contains part location information we use them during training.

Points for head localization

Points for head localization

Points for body localization

Points for body localization

Points for bounding box localization

Points for bounding box localization

3.1.3. Pixel Classifier

For classifying pixels we needed a probabilistic classifier that was fast, so we choose Random Forest.

3.1.4. Results

Below are a set of results of our pixel classifier that was trained to localize head of the birds.

Success

Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7 Output 8 Output 9 Output 10 Output 11 Output 12 Output 13 Output 14 Output 15 Output 16 Output 17 Output 18 Output 19 Output 20 Output 10 Output 21 Output 22 Output 23 Output 24 Output 25 Output 26 Output 27 Output 28 Output 29 Output 30 Output 31 Output 32 Output 33 Output 34 Output 35 Output 36 Output 37 Output 38 Output 39

Failure

Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7 Output 8 Output 9 Output 10

3.1.5. Properties

Our pixel classifier has some very interesting properties.

Ability to localize multiple part instances without any computational overhead.

Output 1 Output 2 Output 3

Ability to localize parts on pencil line drawings and cartoons.

Output 1 Output 2 Output 3

Ability to localize parts on novel classes that the model has not seen in the training set (Since the problem is fine-grained recognition and all classes are sub-class of a general class (e.g. bird) this is expected).

Output 1 Output 2

Ability to detect absence of part with low probabilities.

Output 1 Output 2

3.2. Estimating the bounding box around each part

From these classification scores of each pixel we can easily extract the bounding box around each part easily:

  • Threshold the pixel probability map
  • Do some morphological filtering to remove the noise
  • Find the largest connected component
  • Find the box around that connected component

3.2.1. Results

Below are a set of results of our part localizer that was trained to localize head of the birds.

Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7 Output 8 Output 9 Output 10 Output 11 Output 12 Output 13 Output 14 Output 15 Output 16 Output 17 Output 18 Output 19 Output 20

Below are some images that head, body and the bounding box are localized on them using out system.

Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7 Output 8

Below you can see the results of our system applied to consecutive frames of a video.

4. Fine-grained categorization

After we have localized parts we use a very similar approach to [1] for categorization. For categorization features we combine features from head, body and the bounding box of the bird. We use a fine-tuned CaffeNet network and use the fc7 data as features (same as [1]). Below you see an overview of the method.

Categorization pipeline overview

4.1. Categorization accuracy **

MethodTraining annotationsTesting annotationsMean accuracy
[1]bounding box + parts-73.89%
Oursbounding box + parts-72.02% (+-0.33)

5. References and Notes

* Some facts might be old, since few months have past since this work was published.

** As other works on the same dataset has indicated, using a better network like VGG-19 can dramatically improve the performance. Using VGG-19 we can improve the performance to 82% mean accuracy. This work unfortunately cannot be extended to datasets where part location annotation is not available. This indeed is the main limitation of this work.

[1] Zhang, Ning, et al. "Part-based R-CNNs for fine-grained category detection." ECCV 2014.

[2] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

[3] Hariharan, Bharath, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR 2015.