preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload preload

Evaluating Model Perception of Color Illusions in Photorealistic Scenes

University of California, Berkeley

Contrast Illusion

Click the Image to See the Answer

contrast1

Is the right side of the bottle (B) darker than the left (A), or maybe not?

contrast2

How do the colors of the walls (A, B) on the left and right sides of the man compare? B is darker?

contrast3

Which side of the walls, left (A) or right (B), is darker in color?

contrast4

How do the shadow colors of the bushes (A, B) on the left and right sides of the girl compare?"

contrast5

How do the shadow colors on the road's left and right sides (A, B) compare? Is B darker?

contrast6

Is the mirrored surface B darker in color than A? Are you sure?

Why is contrast important in design?

Stripe Illusion

Click the Image to See the Answer

stripe1

Does the blue color of the books in the right column look duller compared to the left column?

stripe2

Do the yellow flowers on the right side seem more vibrant in color?

stripe3

How do the colors of the carrots on the left and right columns compare?

stripe4

Are the colors of the pink shelves in the bottom left and top right the same, or do they differ?

stripe5

Is the color of the violets at the bottom darker than at the top?

stripe6

How do the colors of the strawberry jam on the bread at the top and bottom compare?

Filter Illusion

Click the Image to See the Answer

filter1

What color is the bus in the image?

filter2

What color are the scattered flowers on the ground?

filter3

What color is the body of the airplane in the image?

filter4

What color are the T-shirts worn by the people in the image?

filter5

What is the color of the hat cake?

filter6

What is the color of the fire hydrant in the image?



We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.

Contribution

  1. We propose an automated framework for generating realistic illusion images and create a large, realistic dataset of color illusion images, named Realistic Color Illusion Dataset (RCID), to enhance the fairness and accuracy of model testing.
  2. We investigate underlying mechanisms of color illusions in VLMs, highlighting the combined influence of the vision system and prior knowledge. We also explore how external prompts and instruction tuning impact the models' performance on these illusions.
  3. We propose an simple training method that enables models to understand human perception while also recognizing the actual pixel values.

Illusion Dataset Construction

Some prior works have also attempted to explore VLMs' performance on visual illusions. However, existing datasets have a drawback: they directly use illusion images found on the web. Most images (e.g., 60% for the IllusionVQA dataset) are well-known examples of these illusions; thus, VLMs have likely memorized humanlike behavioral responses to them, rather than relying on perceptual reasoning.

Additionally, the limited scale restricts the depth and variety of analyses that can be conducted. Therefore, we generated a larger-scale color illusion dataset, embedding color illusions into realistic world scenarios.

The construction of our dataset involves three steps:

  1. Image Generation. For contrast and stripe illusions, we use procedural code to generate simple illusion images, which are then processed by ControlNet to create realistic illusion images. For filter illusions, we directly apply contrasting color filters to the original images. Each type of illusion also includes a corresponding control group without any illusions for comparison.
  2. Question Generation. We use GPT-4o to generate image-specific questions that are designed to evaluate the model's understanding of the illusion.
  3. Human Feedback. We collect human participants' feedback on these images and adjust the original classification of “illusion” and “non-illusion” based on whether participants are deceived.

Data statistics of RCID

Main Results

We evaluate a range of open-source vision-language models using generated illusion and non-illusion images in our development set, with questions explicitly asking for color judgments 'Based on pixel values' or 'Based on human perception'. The results show that, after fine-tuning on non-illusion images, these models achieve high accuracy (75%) on non-illusion images, while their accuracy on illusion images is significantly lower. We find that explicitly querying for color judgments 'Based on pixel values' or 'Based on human perception' does not lead to significant changes in model performance. In addition, models are likely to produce responses that are completely inaccurate, matching neither pixel values nor human perception. This suggests that while the models are misled by color illusions to some extent, they still struggle to fully model human perception.

Experiment Analysis

1. Factors affecting the strength of color illusions

We explore a range of visual factors that may influence the strength of color illusions and compare whether these factors affect human perception and VLMs in the same way. We focus on three potential influencing factors: the orientation of the illusion, the contrast between foreground and background colors (for contrast illusions only), and the number of stripes (for filter illusions only). Overall, our findings indicate that these factors significantly impact the strength of the illusion. For example, altering certain factors, such as increasing the color contrast between the foreground and background, can turn a non-illusion image into an illusion image. And these effects are consistent across both humans and VLMs. In these experiments, we use LLaVA-1.5 (7B) as our tested model.

Deception rates of humans and VLMs across different structural patterns:

Error rates of humans and LLaVA across different conditions in color illusions:

2. The impact of prompting methods and fine-Tuning on model performance

We investigate whether external prompts can influence VLM responses. Unlike the main experiments illustrated in Figure X, here we aim to identify the model's bias toward predicting answers based on either human-like perception or pixel-level values, without explicitly specifying this distinction in the question.

To achieve this, we evaluate several prompting methods: (1) a simple prompt that focuses solely on questions about color comparisons; (2) chain of thought (CoT), where the model is guided to first consider factors in the image that might affect color perception; (3) emphasizing the illusion by explicitly informing the model that the image contains an illusion, without specifying its type; and (4) providing few-shot examples, using either no-illusion (NI) answers or human-like (HL) answers as 3-shot examples.

Additionally, we also explore whether fine-tuning the model on no-illusion (NI) or human-like (HL) examples can improve its understanding of color illusions.

3. Purely vision-based models (e.g., CNNs) can also be deceived by visual illusions.

To investigate whether VLMs' perceptual biases originate from their visual components, we test a range of purely visual models. We design a simple classification task based on simple rectangle-based contrast illusions. n this task, we generate backgrounds with varying brightness on both sides and place a rectangle on each side. The vision model is trained as a three-way classifier on 6,000 non-illusion images to predict which rectangle appears darker, or if their colors are identical, stopping when the loss shows no significant decrease. It is then tested on 1,000 non-illusion and 1,000 illusion images.

BibTeX


        @misc{mao2024evaluatingmodelperceptioncolor,
          title={Evaluating Model Perception of Color Illusions in Photorealistic Scenes}, 
          author={Lingjun Mao and Zineng Tang and Alane Suhr},
          year={2024},
          eprint={2412.06184},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2412.06184}, 
        }