Clustering generative adversarial networks for story visualization

Li B, Torr PHS, Lukasiewicz T

Story visualization aims to generate a series of images, semantically matching a given sequence of sentences, one for each, and different output images within a story should be consistent with each other. Current methods generate story images by using a heavy architecture with two generative adversarial networks (GANs), one for image quality, and one for story consistency, and also rely on additional segmentation masks or auxiliary captioning networks. In this paper, we aim to build a concise and single-GAN-based network, neither depending on additional semantic information nor captioning networks. To achieve this, we propose a contrastive-learning- and clustering-learning-based approach for story visualization. Our network utilizes contrastive losses between language and visual information to maximize the mutual information between them, and further extends it with clustering learning in the training process to capture semantic similarity across modalities. So, the discriminator in our approach provides comprehensive feedback to the generator, regarding both image quality and story consistency at the same time, allowing to have a single-GAN-based network to produce high-quality synthetic results. Extensive experiments on two datasets demonstrate that our single-GAN-based network has a smaller number of total parameters in the network, but achieves a major step up from previous methods, which improves FID from 78.64 to 39.17, and FSD from 94.53 to 41.18 on Pororo-SV, and establishes a strong benchmark FID of 76.51 and FSD of 19.74 on Abstract Scenes.