Object detection models are typically trained to detect only a limited number of object classes. The widely used MS COCO (Microsoft Common Objects in Context) dataset for example only contains eighty classes, ranging from person to toothbrush. Extending this set is cumbersome and typically involves collecting a range of images for each object class to be detected, labeling them, and fine-tuning an existing model. But what if there were an easy way to teach a model new categories out of the box? Even extend it with thousands of categories? That is exactly what the recently published Detic model promises. In this blogpost we discuss how this new model works and assess its strengths and weaknesses by testing it against a number of potential use cases.
Object detection is composed of two sub-problems: finding the object (localization) and identifying it (classification). Conventional methods couple these sub-problems and consequently rely on box labels for all classes. However, detection datasets remain much smaller in size and amount of object classes (vocabulary), compared to image classification datasets. The latter have richer vocabularies because datasets are larger and easier to collect. Incorporating the image classification data in the training of the classifiers of a detector allows expanding the vocabulary of detectors: from hundreds to tens of thousands of concepts! Detic  (Detector with image classes) does exactly this, by using image-level supervision in addition to detection supervision. The model also decouples the localization and classification sub-problems.
Detic is the first known model that trains a detector on all twenty-one-thousand classes of the ImageNet dataset. Consequently, it has the capability of detecting a large number of objects, making it a very suitable baseline model for a wide variety of tasks. Furthermore, to generalise to larger vocabularies, Detic leverages open-ended CLIP  (Contrastive Language-Image Pre-training) embeddings. CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time, the learned text encoder embeds the names or descriptions of the target dataset’s classes. By using embeddings of words instead of a fixed set of object categories, a classifier is created that can recognize concepts without having explicitly seen any examples of them. This is also called zero-shot learning. My colleague Juta has a really neat write-up on the subject of CLIP and multimodal AI in general, go check it out!
Combining aforementioned methods, training of the model is done on a mix of detection data and image-labeled data. When using detection data, Detic uses standard detection losses to train the classifier (W) and the box prediction branch (B) of a detector. When using image-labeled data, only the classifier is trained by using a modified classification loss. The classification loss trains the features extracted from the largest-sized proposal predicted by the network. Similar to CLIP, the model can use a custom vocabulary of class embeddings at prediction time to generalise to different classes, without retraining.
Imagine you would like to automatically analyze a social media feed for occurrences of certain objects. Object detectors trained for specific use cases in sports, animal detection, healthcare etc. would not get desired results in this context. This is because social media posts are not limited to certain topics and these images contain a wide variety of possible objects. However, since traditional detectors have a rather small vocabulary, you are constrained to how many categories you can look for. This finite set of classes also limits the analysis you can accomplish on these types of images. Since Detic is able to significantly expand the vocabulary, it presents itself as a good solution to this problem.
So now you know that Detic is able to detect a broad range of object classes due to the fact that the Imagenet dataset has up to 21K classes. But the use of image classification data comes with another benefit: getting more fine-grained class detections without the requirement of adding more (fine-grained) labels to the detection dataset. Because, while an object detection dataset such as LVIS contains a “dog” class, the labels do not specify the exact species of dog. So for the detector to distinguish between different sorts of dogs, you would be required to manually add or change data of these fine-grained categories. But now, adding image labeled data of different species — as contained in ImageNet — suffices, allowing the model to detect and name the dogs by their correct species as seen in the example.
The capabilities of the model don’t stop there. As already explained, the usage of CLIP allows us to work with custom vocabularies containing classes of which the model has never seen an image before. On the example on the left, the model looked for the class “hoverboard”, which is not in any of the training datasets, but because the embedding of the word lies closely to that of the word “skateboard”, it is still able to retrieve this object that it has never seen before! The same goes with the word “bust” in the image on the right : CLIP enables the model to again correctly identify this unfamiliar object.
We have seen the model perform on different images, but we also applied it to some videos, where it achieves similar results. To showcase the possibilities of Detic on moving images, we made a simple tool that extracts the frames from a video, applies detection (for which again the vocabulary can be custom-made) on these separate frames and finally chains them back together. It is also possible to save the detected objects in each frame to a text file, allowing easier analysis of the whole video.
Detection models are usually trained on specific data for a certain use-case, but Detic has a very broad utility field. It is the first known model with such a large vocabulary of object classes. It can be fine-tuned for more fine-grained detection by only adding more image labeled classes, and by using CLIP embeddings the model also performs in zero-shot detection tasks. There is of course a lot to explore with these sorts of models. One limitation we found is when using a custom vocabulary containing descriptions such as “a person sitting on a bench”. The model often fails to correctly detect the whole context, but rather only selects a single word from the sentence. This is probably due to the fact that Detic extracts image labels from captions using a naive text-match, and is mostly trained on singular words.
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., & Misra, I. (2021). Detecting Twenty-thousand Classes using Image-level Supervision. ArXiv Preprint ArXiv:2201.02605.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. CoRR, abs/2103.00020. https://arxiv.org/abs/2103.00020
Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement. CoRR, abs/1804.02767. http://arxiv.org/abs/1804.02767