LUNA: language as continuing anchors for referring expression comprehension

Liang Y, Fan J, Li Z, Wang J, Torr PHS, Yang Z, Huang S-L, Tang Y

Referring expression comprehension aims to localize a natural language description in an image. Using location priors to help reduce inaccuracies in cross-modal alignments is the state of the art for CNN-based methods tackling this problem. Recent Transformer-based models cast aside this idea, making the case for steering away from hand-designed components. In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and thus show that language-guided location priors can be effectively exploited in a Transformer-based architecture. Our method first initializes an anchor box from the input expression via a small "proto-decoder,'' and then uses this anchor and its refined successors as location guidance in a modified Transformer decoder. At each decoder layer, the anchor box is first used as a query for gathering multi-modal context, and then updated based on the gathered context (producing the next, refined anchor). In the end, a lightweight assessment pathway evaluates the quality of all produced anchors, yielding the final prediction in a dynamic way. This approach allows box decoding to be conditioned on learned anchors, which facilitates accurate grounding, as we shown in the experiments. Our method outperforms existing state-of-the-art methods on the datasets of ReferIt Game, RefCOCO/+/g, and Flickr30K Entities.

Keywords:

visual grounding

,

vision-and-language

,

referring expression comprehension