Amplifying key cues for human-object-interaction detection

Liu Y, Chen Q, Zisserman A

Human-object interaction (HOI) detection aims to detect
and recognise how people interact with the objects that surround them.
This is challenging as different interaction categories are often distinguished only by very subtle visual differences in the scene. In this paper
we introduce two methods to amplify key cues in the image, and also a
method to combine these and other cues when considering the interaction between a human and an object. First, we introduce an encoding
mechanism for representing the fine-grained spatial layout of the human
and object (a subtle cue) and also semantic context (a cue, represented
by text embeddings of surrounding objects). Second, we use plausible
future movements of humans and objects as a cue to constrain the space
of possible interactions. Third, we use a gate and memory architecture as
a fusion module to combine the cues. We demonstrate that these three
improvements lead to a performance which exceeds prior HOI methods
across standard benchmarks by a considerable margin.