1Graduate School of Science and Technology, University of Tsukuba
2Tsukuba Institute for Advanced Research
3Center for Artificial Intelligence Research, Department of Computer Science, University of Tsukuba
Vision Language Models (VLMs) are challenging for the field of Deep Learning Interpretability given their billions of parameters and recurrence-based reasoning processes. Based on the Frame Representation Hypothesis from language models, we introduce a novel approach for interpretability for the visual domain, enabling systematic extraction and analysis of learned concepts. Our framework reveals inherent biases and vulnerabilities in VLMs, caused by lack of proper safety alignment between the visual and textual components. By combining our interpretability techniques with attack methods, we demonstrate model vulnerabilities, achieving an average 97.7% attack success rate in the Qwen2-VL and Llama 3.2 Vision models. This combination of interpretability analysis and security testing provides crucial insights for developing safer, more transparent models, offering a path toward more trustworthy visual AI systems.
Extended the Frame Representation Hypothesis to the visual domain, enabling systematic interpretability of VLM outputs.
Developed a gradient-free approach combining interpretability with attack methods, achieving 97.7% ASR on state-of-the-art VLMs.
Released vulnerability analysis dataset manually verified by native/fluent speakers across 8 languages.
Extract semantic concepts from VLMs using Frame Representation
Use Concept Guided Decoding to steer model outputs
Expose vulnerabilities and biases in model behavior
Examples demonstrating how different concepts guide the model to produce specific outputs, revealing inherent biases in vision-language models.
| Model | Text Only | CGD | Image Attack | Combined |
|---|---|---|---|---|
| Qwen2-VL 2B | 26.8% | 81.8% | 97.6% | 99.8% |
| Qwen2-VL 7B | 19.8% | 85.6% | 91.6% | 100% |
| Qwen2-VL 72B | 23.4% | 94.6% | 96.0% | 99.6% |
| Llama 3.2 11B | 54.2% | 90.4% | 83.4% | 94.8% |
| Llama 3.2 90B | 66.4% | 94.0% | 77.4% | 94.4% |
Average Attack Success Rate: 97.7%
@inproceedings{valois2025vision,
title={Vision Language Model Interpretability with Concept Guided Decoding},
author={Valois, Pedro H. V. and Satav, Dipesh and de Campos, Rodrigo A. P. and
Pratamasunu, Gulpi Q. O. and Fukui, Kazuhiro},
booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
pages={397--402},
year={2025},
organization={IEEE},
doi={10.1109/ICIP55913.2025.11084299}
}