Grounded Segment Anything:From Objects to Parts

The University of Hong Kong
*Equal contribution

We expand Segment Anything Model (SAM) to support text prompt input. The text prompt could be object-level (eg, dog) and part-level (eg, dog head). Furthermore,we build a ChatGPT-based dialogue system that flexibly calls various segmentation models when receiving instructions in the form of natural language.

1. Motivate

From last year's ChatGPT to this year's GPT-4, the entire AI world is in the carnival of NLP and multimodality. As for the CV community, the Segment Anything Model (SAM) is one of the few recent works that catches the eye: huge amount of data, various point/box/mask prompts, fluent demo experience, and amazing generalization ability. However, SAM also has a little drawback: text prompt is not supported in its open source code, although its paper mentions that text prompt has been achieved.

We believe that the reason that text prompt is not released in the open source code is not because the FAIR author team keeps it secret, but that the text prompt is too ambiguous to be properly displayed in the demo, especially in the open world. For SAM project, pushing class-agnostic segmentation quality to unprecedented heights is sufficiently surprising, while the text prompt needs to open a new group to discuss it. From this point of view, SAM does not end CV, but clears the obstacles effect of segmentation tasks on the subsequent recognition tasks and creates a greater imagination for visual recognition tasks and vision-language multimodal tasks. So the next step is to Segment Anything and Name It!

2. Method: Segment Anything and Name It

2.1 A straightforward method: SAM + CLIP R-CNN

Enabling SAM to realize category recognition, the most straightforward implementation is combine SAM and CLIP, that is, SAM provides region proposals, and then crops out their corresponding image patches from the original image (usually slightly enlarged the region by 1.5-2 times to obtain the surrounding environment information), and then sends these image patches to CLIP(Contrastive Language-Image Pre-Training) for classification. The core idea of this implementation inherits R-CNN: R-CNN adopts the traditional region proposal extraction method such as selective search; this implementation takes the output of SAM as proposals. R-CNN's classification is on closed sets; this implementation is open-set classification. In issue#6 of SAM's github repo, multiple community members implement this solution.

2.2 A fast method: SAM + CLIP Fast R-CNN

Considering the development from R-CNN to Fast R-CNN, SAM + CLIP can have a more efficient implementation: all region proposals provided by SAM share one feature map, such that the image only goes through CLIP once. Code implementation: sam_clip_fast_rcnn. This implementation works well in many scenarios. This is due to: (1) SAM provides high-quality region proposals, (2) CLIP itself has region classification capability to some extent, especially the large model version of CLIP.

2.3 A practical method: Open-Vocabulary Object Detection + SAM

A practical solution is to use the classification module of the Open-Vocabulary Object Detection (OVOD) model to classify the mask output by the SAM. Another similar solution is to send the box output of OVOD as a prompt to SAM to get the mask. The core difference between these two solutions lies on which model region proposals come from. In terms of the final recognition result, using the region proposals extracted by OVOD itself or the region proposals provided by SAM should be simliar. This is because the bottleneck lies in the recognition module of OVOD: its recognition ability on the region proposals that have not been extracted by itself should be very weak. Even if SAM provides more diverse proposals, OVOD cannot recognize them.

There are many excellent implementations of OVOD+SAM in the community, such as: Grounded SAM based on Grounded DINO and SAM. Learning from these open source solutions, we build a solution based on GLIP(Grounded language-image pre-training). Code implementation: demo_glip_sam.

Compared with the sequential OVOD+SAM, why not train an end-to-end segmentation and recognition model? We think that if the end-to-end model expects to achieve better results than model combinations, the classification module should be trained on the same magnitude of object category labeling data (about 10M) and computing resources(about 256 A100) as SAM. Emmm....

2.4 An advanced method: Open-Vocabulary Part Detection + SAM

The advantage of SAM over previous segmentation models lies in not only its high-quality segmentation, but also greatly expanding the segmentation ability to fine-grained objects. In other words, segmentation can be performed from object to object parts. Following, if Open-Vocabulary Part Detection models are available, the category recognition for object parts can be realized We happen to be working on a similar project VLPart recently, so its model is integrated with SAM. Code implementation: demo_vlpart_sam. In addition to dog head, dog leg, dog paw, dog nose, etc., can also be successfully segmented and recognized.

3. ChatBot for Segment Anything and Name It

By the form of natural language, interacting with AI models is the biggest surprise that AI has brought to humans in recent months. Following this trend, we develop a ChatGPT-based dialogue system borrowing from Visual ChatGPT. Our system supports receiving images from users, class-agnostic segmentation, and object-level and part-level segmentation by text prompt. Code implementation: chatbot.

The current implementation is to selectively call SAM, GLIP_SAM, or VLPart_SAM according to the user's query. Although it can satisfy various needs, it is obviously not a "unified" solution. A foundation vision model that is able to segment any mask at object-level and part-level and indeed support various point/box/mask/text prompts is the exciting research direction.

4. Exploration

In this blog, we provide a preliminary solution to support Segment Anything Model with text prompts. This birngs to an interesting follow-up question: If a visual model could segment the mask given any text prompt, can it output the caption for any given mask area at the same time? The answer is no perhaps. Finding the corresponding mask area in the image for any text prompt is a discriminative problem, whereas outputting the text description for any mask area is a generative problem, aka, dense caption problem. From the perspective of generative problem, using the dense caption model to build a segment and name anything visual system is a brand new direction compared with this blog. We will continue to explore it. (Maybe in one day, it can be realized by enumerating all possible text prompts)

Acknowledgments

A large part of the code is borrowed from segment-anything, CLIP, GLIP, Grounded-Segment-Anything, Visual ChatGPT. Many thanks for their wonderful work.