Referring expression comprehension aims to localize a natural language description in an image. Using location priors to help reduce inaccuracies in cross-modal alignments is the state of the art for CNN-based methods tackling this problem. Recent Transformer-based models cast aside this idea, making the case for steering away from hand-designed components. In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and thus show that language-guided location priors can be effectively exploited in a Transformer-based architecture. Our method first initializes an anchor box from the input expression via a small "proto-decoder,'' and then uses this anchor and its refined successors as location guidance in a modified Transformer decoder. At each decoder layer, the anchor box is first used as a query for gathering multi-modal context, and then updated based on the gathered context (producing the next, refined anchor). In the end, a lightweight assessment pathway evaluates the quality of all produced anchors, yielding the final prediction in a dynamic way. This approach allows box decoding to be conditioned on learned anchors, which facilitates accurate grounding, as we shown in the experiments. Our method outperforms existing state-of-the-art methods on the datasets of ReferIt Game, RefCOCO/+/g, and Flickr30K Entities.
visual grounding
,vision-and-language
,referring expression comprehension