New research shows that reconfiguring a model’s visual representation can make it more useful, robust, and reliable.
“Visual” artificial intelligence (AI) is everywhere. We use it to classify photos, identify unknown flowers, and steer cars. But these powerful systems don’t always “see” the world in the same way that we do, and sometimes behave in surprising ways. For example, an AI system that can identify hundreds of car makes and models may not be able to capture what cars and airplanes have in common: They are both large vehicles made primarily of metal.
To better understand these differences, today we publish a new paper in Nature that analyzes important ways that AI systems organize the visual world differently than humans. We present methods to better reconcile these systems with human knowledge and show that addressing these inconsistencies improves the systems’ robustness and generalizability.
This work is a step toward building more intuitive and trustworthy AI systems.
Why AI struggles with “weird results”
When you see a cat, your brain creates a mental representation that captures everything about the cat, from basic concepts like color and fur to more advanced concepts like “catness.” The AI vision model also generates representations by mapping images to points in high-dimensional space where similar items (such as two sheep) are placed close together and different items (such as a sheep and a cake) are far apart.
To understand the difference between how human and model representations are structured, we used a classic “odd-one-output” task from cognitive science, asking both human and model to choose which of three given images does not fit with the other images. This test reveals which two items are “considered” most similar.
Sometimes everyone agrees. Given a tapir, a sheep, and a birthday cake, both humans and models will definitely choose the cake as the odd one. In other cases, the correct answer is not clear and people and models disagree.
Interestingly, we found many cases where humans strongly agreed with the answer, but the AI model was wrong. In the third example below, most people would agree that the starfish is weird. However, most visual models focus on superficial features such as background color and texture and choose cats instead.

