Learning Affective Explanations for
Real-World Visual Data


  • [04/03/2022] This work has been provisionally accepted to CVPR-2023.
  • [12/22/2022] You can now browse a slightly filtered version of Affection's annotations!


Real-world images often convey emotional intent, i.e., the photographer tries to capture and promote an emotionally interesting story. In this work, we explore the emotional reactions that real-world images tend to induce by using natural language as the medium to express the rationale behind an affective response to a given visual stimulus. To embark on this journey, we introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85K publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt in a particular way when observing a particular image, producing a total of 526K responses. Even though emotional reactions are subjective and sensitive to context (personal mood, social status, past experiences) – we show that there is significant common ground to capture potentially plausible emotional responses with large support in the subject population. In light of this key observation, we ask the following questions: i) Can we develop multi-modal neural networks that provide reasonable affective responses to real-world visual data, explained with language? ii) Can we steer such methods towards creating explanations with varying degrees of pragmatic language or justifying different emotional reactions while adapting to the underlying visual stimulus? Finally, iii) How can we evaluate the performance of such methods for this novel task? With this work, we take the first steps to partially address all of these questions, thus paving the way for richer, more human-centric, and emotionally-aware image analysis systems.

Summary of Main Contributions and Findings

  1. We curate and share Affection a large-scale dataset with emotional reactions to real-world images and explanations behind them in free-form language.

  2. We introduce the task of Emotional Explanation Captioning (EAC) and develop correspondingly neural speakers that can create plausible utterances to explain emotions grounded in real-world images.

  3. Our best neural speaker passes an emotional Turing test with a 65% chance. I.e., its generations are this likely to be thought of as if other humans made them, and not machines, as judged by third-party observing humans.

  4. Our developed neural listeners and CLIP-based studies indicate that Affection contains significant amounts of discriminative references that enable the identification of its underlying images from the affective explanations. Tapping on this observation, we also experiment and provide pragmatic neural speaking variants.

Qualitative Νeural Speaking Results

Examples of neural speaker generations on unseen images from our emotion-grounded, pragmatic speaker variant. The top row includes generations that reflect a positive sentiment, while the bottom row showcases generations grounded on similar visual subjects (object classes) e.g., another dog, food item, etc., that give rise to negative emotions. Remarkably, this neural speaker appears to take into account the underlying fine-grained visual differences to properly modulate its output, providing strong explanatory power behind the emotional reactions. Note, also, how the explanations can include purely human-centric semantics (‘nostalgic of my childhood’, ‘love coffee’), and use explicit psychological assessments (‘feel content/excited/disgusted’, ‘is depressing’).
Examples of neural speaker generations with an emotion-grounded speaker variant on unseen test images. The grounding emotion (shown in boldface fonts) is predicted during inference time by a separately trained image-to-emotion classifier. We ground the speaker’s generation with two emotions for each image, corresponding to the most likely (top row) and second most likely (bottom row) predictions. As seen this figure, this variant provides a certain control over the output by aligning it to the requested/input emotion.
Effect of boosting the pragmatic content of neural speaker generations via CLIP. Aside from often correcting the identity of shown objects/actions (right-most image is indeed taken inside an airport), this pragmatic variant tends to use more visual details in its explanations (‘standing in the sand’), and perhaps more importantly to expand the explanation to include non-visual but valid associations (e.g., ‘take a nap’, or ‘do not like crowds’).
Failure modes of neural speakers. Left-most two examples show generic problems that all neural variants might suffer from: e.g., misidentifying the underlying visual elements (example A) or making non-sensible emotional judgments (example B). While the third example (C) is sensible, it highlights how an emo-grounded variant can overfocus on the underlying emotion and miss crucial visual details (e.g., the shown fence). On the contrary, the pragmatic variant (example D) can overcompensate by including dubious visual details (the default neural speaker simply mentions the zebras in this example). For more details see our Sup. Mat.
Legal Disclaimer: The images shown in the above qualitative results are included in the image-centric datasets covered by our Affection dataset, and which are described in detail in the Section 3, of our manuscript. I.e., the datasets of MS-COCO, Emotional-Machines, Flickr30kEntities, Visual Genome and those included in the work of Quanzeng et al. We do not have or claim any ownership or copyright right for these images.

The Affection Dataset

The Affection dataset is provided under specific terms of use. We are in the process of releasing it. If you are interested in downloading the data, first fill out the underlying form accepting the terms. We will email you once the data are ready.

Meanwhile, you can quickly browse a slightly filtered version of Affection's annotations!

Important Disclaimer: Affection is a real-world dataset containing the opinion and sentiments of thousands of people. Thus, we expect it to include text with certain biases, factual inaccuracies, and possibly foul language. The provided neural networks are also likely biased and inaccurate, similar to their training data. Their output, subjective opinions, and sentiments present in Affection are not byproducts that express the personal views and preferences of the authors by any means. Please use our work responsibly.


To contact all authors, please use affective.explanations@gmail.com, or their individual emails.


If you find our work useful in your research, please consider citing:

    title={{Affection}: Learning Affective Explanations for
                      Real-World Visual Data},
    author={Achlioptas, Panos and Ovsjanikov, Maks 
            and Guibas, Leonidas and Tulyakov, Sergey},
    journal={Computing Research Repository (CoRR)},


P.A. wants to thank Professors James Gross, Noah Goodman and Dan Jurafsky for their initial discussions and the ample motivation they provided for exploring this research direction. Also, wants to thank Ashish Mehta for fruitful discussions on alexithymia, Grygorii Kozhemiak for helping design Affection's logos and webpage, and Ishan Gupta for his help building Affection's browser. Last but not least, the authors want to emphasize their gratitude to all the hard-working Amazon Mechanical Turkers without whom this work would be impossible.

Parts of this work were supported by the ERC Starting Grant No. 758800 (EXPROTEA), the ANR AI Chair AIGRETTE, and a Vannevar Bush Faculty Fellowship.