Over the past decades, computer scientists have created increasingly advanced artificial intelligence (AI) models, some of which can perform similarly to humans on specific tasks. The extent to which these models truly “think” and analyze information like humans, however, is still a heated topic of discussion.
Researchers at the Max Planck Institute for Biological Cybernetics, the Institute for Human-Centered AI at Helmholtz Munich and the University of Tubingen recently set out to better understand the extent to which multi-modal large language models (LLMs), a promising class of AI models, grasp complex interactions and relationships in visual cognition tasks.
Their findings, published in Nature Machine Intelligence, show that while some LLMs perform well in tasks that entail processing and interpreting data, they often fail to derive intricacies that humans would grasp.
“We were inspired by an influential paper by Brenden M. Lake and others, which outlined key cognitive components required for machine learning models to be considered human-like,” Luca M. Schulze Buschoff and Elif Akata, co-authors of the paper, told Tech Xplore.
“When we began our project, there was promising progress in vision language models that can process both language and images. However, many questions remained about whether these models could perform human-like visual reasoning.”
The main objective of the recent study by Buschoff, Akata and their colleagues was to assess the ability of multi-modal LLMs to grasp specific aspects of visual processing tasks, such as intuitive physics, casual relationships and the intuitive understanding of people’s preferences. This could in turn help to shed light on the extent to which the capabilities of these models could actually be considered human-like.
To determine this, the researchers carried out a series of controlled experiments, where they tested the models on tasks derived from past psychology studies. This approach to the testing of AI was first pioneered in an earlier paper by Marcel Binz and Eric Schulz, published in PNAS.
“For example, to test their understanding of intuitive physics, we gave the models images of block towers and asked them to judge whether a given tower is stable or not,” explained Buschoff and Akata.
“For causal reasoning and intuitive psychology, the models needed to infer relationships between events or understand the preferences of other agents. We then evaluated their basic performance and compared them to human participants that took part in the same experiments.”
By comparing the responses of LLMs during tasks with those given by human participants, the researchers were able to better understand the ways in which the models were aligned with humans and where they fell short.
Overall, their findings showed that although some models were good at processing basic visual data, they still struggled to emulate more intricate aspects of human cognition.
“At this point it is not clear whether this is something that can be solved by scale and more diversity in the training data,” said Buschoff and Akata.
“This feeds into a larger debate on the kinds of inductive biases these models need to be outfitted with. For instance, some argue that these models need to be equipped with some basic processing modules such as a physics engine, so that they achieve a general and robust understanding of the physical world. This even goes back to findings in children showing that they can predict some physical processes from an early age.”
The recent work by Buschoff, Akata and their colleagues offers new valuable insight into the extent to which current state-of-the-art multi-modal LLMs exhibit human-like cognitive skills. So far, the team have tested models that were pre-trained on large datasets, but they would soon like to conduct additional tests on models that were fine-tuned on the same types of tasks used in the experiments.
“Our early results with fine-tuning show that they do become a lot better at the specific task they are trained on,” added Buschoff and Akata.
“However, these improvements don’t always translate to a broader, more generalized understanding across different tasks, which is something that humans do remarkably well.”
More information:
Luca M. Schulze Buschoff et al, Visual cognition in multimodal large language models, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-024-00963-y.
© 2025 Science X Network
Citation:
Psychology-based tasks assess multi-modal LLM visual cognition limits (2025, February 6)
retrieved 6 February 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.