KiVA: Kid-inspired Visual Analogies for Large Multimodal Models

1University of California, Berkeley, 2Boston University, 3Google DeepMind, 4Toyota Technological Institute at Chicago
Figure 1 Analogies

Our benchmark evaluates analogical reasoning in five basic visual domains. KiVA demands generalization to a different object and is solvable by young children. KiVA-adults demands further generalization, to a different object and to a different starting value, and is solvable by adults.




KiVA

Scroll to see samples of KiVA.
Click on the image to reveal the correct answer.




KiVA-adults

Solved all the KiVA samples above? Scroll to see samples of KiVA-adults, which is solvable by adults but not young children.
Click on the image to reveal the correct answer.

Abstract

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children.

A “visual analogy” is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose Kid-inspired Visual Analogies (KiVA), composed of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children and adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object),and applying the rule to new scenarios.

Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the “what” effectively, they struggle with quantifying the “how” and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

Our Query Pipeline

Figure 3 Pipeline

Models are asked to classify what changed, specify how it changed, and visually extrapolate by applying the same change to a new object.

Our Findings

Spider.Results

Models' visual extrapolation performance depends on visual domain.
Models show weaker performance in the number and spatial domains.




Correlation

In the visual extrapolation of KiVA, GPT-o1's error scores correlates with children's error scores and adults' response times.




Results

Models get worse with increasing reasoning complexity from verbal description to visual extrapolation in KiVA, unlike humans.




Results

Models get more inconsistent (choosing different choices that are randomized in order within the same trial) with increasing reasoning complexity.




Results

Successful visual extrapolation is contingent on solving verbal classification or specification correctly when models are solving KiVA above chance level (refer to GPT-o1 in all domains and GPT-4V in color and size domains).





BibTeX

@inproceedings{
      yiu2025kiva,
      title={Ki{VA}: Kid-inspired Visual Analogies for Testing Large Multimodal Models},
      author={Eunice Yiu and Maan Qraitem and Anisa Noor Majhi and Charlie Wong and Yutong Bai and Shiry Ginosar and Alison Gopnik and Kate Saenko},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=vNATZfmY6R}
      }
    }