In this work we have introduced a new and faster method for part localization of bird species. Part localization is an important step in fine-grained recognition as the discriminative features are highly local. State-of-the-art method of , uses R-CNN  to localize parts. R-CNN does O(1000) forward passes of the network, which makes it very slow. Comparing to the state-of-the-art , our method is at least 2 orders of magnitude faster while achieving comparable categorization accuracy.
We pose the problem of finding a part as classifying the set of pixels that belong to that part. Below are example results of our system classifying pixels whether they belong to the head of a bird. The more red a pixel is the more the system thinks it belongs to the head of a bird. Exact probability values can be estimated from the color bar.
To train each part localizer (pixel classifier) we need three things:
We use an off-the-shelf Convolutional Neural Network (specifically CaffeNet an almost identical network to AlexNet) which has been pre-trained on the classification task of ILSVRC 2012 and only use a single forward pass of its convolutional layers (conv1 - conv5). After up-sampling each feature map we obtain a good feature representation for all of the pixels in the image.
We mine a large set of positive and negative pixels for each part that we want to localize from the training set of CUB-200-2011 dataset. Notice that since the CUB dataset contains part location information we use them during training.
For classifying pixels we needed a probabilistic classifier that was fast, so we choose Random Forest.
Below are a set of results of our pixel classifier that was trained to localize head of the birds.
Our pixel classifier has some very interesting properties.
Ability to localize multiple part instances without any computational overhead.
Ability to localize parts on pencil line drawings and cartoons.
Ability to localize parts on novel classes that the model has not seen in the training set (Since the problem is fine-grained recognition and all classes are sub-class of a general class (e.g. bird) this is expected).
Ability to detect absence of part with low probabilities.
From these classification scores of each pixel we can easily extract the bounding box around each part easily:
Below are a set of results of our part localizer that was trained to localize head of the birds.
Below are some images that head, body and the bounding box are localized on them using out system.
Below you can see the results of our system applied to consecutive frames of a video.
After we have localized parts we use a very similar approach to  for categorization. For categorization features we combine features from head, body and the bounding box of the bird. We use a fine-tuned CaffeNet network and use the fc7 data as features (same as ). Below you see an overview of the method.
|Method||Training annotations||Testing annotations||Mean accuracy|
|||bounding box + parts||-||73.89%|
|Ours||bounding box + parts||-||72.02% (+-0.33)|
* Some facts might be old, since few months have past since this work was published.
** As other works on the same dataset has indicated, using a better network like VGG-19 can dramatically improve the performance. Using VGG-19 we can improve the performance to 82% mean accuracy. This work unfortunately cannot be extended to datasets where part location annotation is not available. This indeed is the main limitation of this work.
 Zhang, Ning, et al. "Part-based R-CNNs for fine-grained category detection." ECCV 2014.
 Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.
 Hariharan, Bharath, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR 2015.