KiVA: Kid-inspired Visual Analogies for Large Multimodal Models

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children.

A “visual analogy” is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose Kid-inspired Visual Analogies (KiVA), composed of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children and adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object),and applying the rule to new scenarios.

Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the “what” effectively, they struggle with quantifying the “how” and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

KiVA: Kid-inspired Visual Analogies for Large Multimodal Models

Our benchmark evaluates analogical reasoning in five basic visual domains. KiVA demands generalization to a different object and is solvable by young children. KiVA-adults demands further generalization, to a different object and to a different starting value, and is solvable by adults.

KiVA

Scroll to see samples of KiVA.
Click on the image to reveal the correct answer.

KiVA-adults

Solved all the KiVA samples above? Scroll to see samples of KiVA-adults, which is solvable by adults but not young children.
Click on the image to reveal the correct answer.

Abstract

Our Query Pipeline

Models are asked to classify what changed, specify how it changed, and visually extrapolate by applying the same change to a new object.

Our Findings

Models' visual extrapolation performance depends on visual domain.
Models show weaker performance in the number and spatial domains.

In the visual extrapolation of KiVA, GPT-o1's error scores correlate with children's error scores and adults' response times.

Models get worse with increasing reasoning complexity from verbal description to visual extrapolation in KiVA, unlike humans.

Models get more inconsistent (choosing different choices that are randomized in order within the same trial) with increasing reasoning complexity.

Successful visual extrapolation is contingent on solving verbal classification or specification correctly when models are solving KiVA above chance level (refer to GPT-o1 in all domains and GPT-4V in color and size domains).

BibTeX

KiVA: Kid-inspired Visual Analogies for Large Multimodal Models

Our benchmark evaluates analogical reasoning in five basic visual domains. KiVA demands generalization to a different object and is solvable by young children. KiVA-adults demands further generalization, to a different object and to a different starting value, and is solvable by adults.

KiVA

Scroll to see samples of KiVA. Click on the image to reveal the correct answer.

KiVA-adults

Solved all the KiVA samples above? Scroll to see samples of KiVA-adults, which is solvable by adults but not young children. Click on the image to reveal the correct answer.

Abstract

Our Query Pipeline

Models are asked to classify what changed, specify how it changed, and visually extrapolate by applying the same change to a new object.

Our Findings

Models' visual extrapolation performance depends on visual domain. Models show weaker performance in the number and spatial domains.

In the visual extrapolation of KiVA, GPT-o1's error scores correlate with children's error scores and adults' response times.

Models get worse with increasing reasoning complexity from verbal description to visual extrapolation in KiVA, unlike humans.

Models get more inconsistent (choosing different choices that are randomized in order within the same trial) with increasing reasoning complexity.

Successful visual extrapolation is contingent on solving verbal classification or specification correctly when models are solving KiVA above chance level (refer to GPT-o1 in all domains and GPT-4V in color and size domains).

BibTeX

Scroll to see samples of KiVA.
Click on the image to reveal the correct answer.

Solved all the KiVA samples above? Scroll to see samples of KiVA-adults, which is solvable by adults but not young children.
Click on the image to reveal the correct answer.

Models' visual extrapolation performance depends on visual domain.
Models show weaker performance in the number and spatial domains.