No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance

Udandarao V, Prabhu AP, Ghosh A, Sharma Y, Torr P, Bibi A, Albanie S, Bethge M

Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of
multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation.
However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal
models, as it is not known to what extent their pretraining datasets encompass the downstream concepts
targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal
models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?
We comprehensively investigate this question across 34 models and five standard pretraining datasets
(CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data
artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models
require exponentially more data to achieve linear improvements in downstream “zero-shot” performance,
following a sample inefficient log-linear scaling trend. This trend persists even when controlling for
sample-level similarity between pretraining and downstream datasets [79], and testing on purely synthetic
data distributions [51]. Furthermore, upon benchmarking models on long-tailed data sampled based on
our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this
long-tail test set as the Let it Wag! benchmark to further research in this direction. Taken together, our
study reveals an exponential need for training data which implies that the key to “zero-shot” generalization
capabilities under large-scale training paradigms remains to be found.