Clip as RNN: segment countless visual concepts without training endeavor

Sun S, Li R, Gu X, Li S, Torr P

16 September 2024

Conference paper

Journal:

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

IEEE

pp.

13171 - 13182

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without finetuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM’s broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zeroshot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

Keywords:

training

semantic segmentation

vocabulary

semantics

computer vision

visualization

filters

DOI

10.1109/cvpr52733.2024.01251

ORA record