Multimodal Models · University of California, Davis
When Vision Speaks for Sound: The Audio-Visual Clever Hans Effect
Top video models look like they hear audio but really guess it from the picture. This paper's THUD probes catch the cheat, and a 10K-sample fix lifts audio grounding by 28 points.