Novel technique overcomes spurious correlations problem in AI

Celebrity Gig
Credit: Unsplash/CC0 Public Domain

AI models often rely on “spurious correlations,” making decisions based on unimportant and potentially misleading information. Researchers have now discovered these learned spurious correlations can be traced to a very small subset of the training data and have demonstrated a technique that overcomes the problem. The work has been published on the arXiv preprint server.

“This technique is novel in that it can be used even when you have no idea what spurious correlations the AI is relying on,” says Jung-Eun Kim, corresponding author of a paper on the work and an assistant professor of computer science at North Carolina State University.

“If you already have a good idea of what the spurious features are, our technique is an efficient and effective way to address the problem. However, even if you are simply having performance issues, but don’t understand why, you could still use our technique to determine whether a spurious correlation exists and resolve that issue.”

Spurious correlations are generally caused by simplicity bias during AI training. Practitioners use datasets to train AI models to perform specific tasks. For example, an AI model could be trained to identify photographs of dogs. The training dataset would include pictures of dogs where the AI is told a dog is in the photo.

READ ALSO:  Prosecutors in SBF trial compare defense argument to 'Dumb and Dumber'

During the training process, the AI will begin identifying specific features that it can use to identify dogs. However, if many of the dogs in the photos are wearing collars, and because collars are generally less complex features of a dog than ears or fur, the AI may use collars as a simple way to identify dogs. This is how simplicity bias can cause spurious correlations.

“And if the AI uses collars as the factor it uses to identify dogs, the AI may identify cats wearing collars as dogs,” Kim says.

Conventional techniques for addressing problems caused by spurious correlations rely on practitioners being able to identify the spurious features that are causing the problem. They can then address this by modifying the datasets used to train the AI model. For example, practitioners might increase the weight given to photos in the dataset that include dogs that are not wearing collars.

READ ALSO:  Nigerian students, dependants contribute £1.9bn to UK – Report

However, in their new work, the researchers demonstrate that it is not always possible to identify the spurious features that are causing problems—making conventional techniques for addressing spurious correlations ineffective.

The paper, “Severing Spurious Correlations with Data Pruning,” will be presented at the International Conference on Learning Representations (ICLR), being held in Singapore from April 24–28. The first author of the paper is Varun Mulchandani, a Ph.D. student at NC State.

“Our goal with this work was to develop a technique that allows us to sever spurious correlations even when we know nothing about those spurious features,” Kim says.

The new technique relies on removing a small portion of the data used to train the AI model.

“There can be significant variation in the data samples included in training datasets,” Kim says. “Some of the samples can be very simple, while others may be very complex. And we can measure how ‘difficult’ each sample is based on how the model behaved during training.

“Our hypothesis was that the most difficult samples in the dataset can be noisy and ambiguous, and are most likely to force a network to rely on irrelevant information that hurts a model’s performance,” Kim explains.

READ ALSO:  Why it took until 1975 for Australians to finally watch TV in color

“By eliminating a small sliver of the training data that is difficult to understand, you are also eliminating the hard data samples that contain spurious features. This elimination overcomes the spurious correlations problem, without causing significant adverse effects.”

The researchers demonstrated that the new technique achieves state-of-the-art results—improving performance even when compared to previous work on models where the spurious features were identifiable.

More information:
Varun Mulchandani et al, Severing Spurious Correlations with Data Pruning, arXiv (2025). DOI: 10.48550/arxiv.2503.18258

Journal information:
arXiv


Provided by
North Carolina State University


Citation:
Novel technique overcomes spurious correlations problem in AI (2025, April 18)
retrieved 18 April 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Categories

Share This Article
Leave a comment