ICIP 2025 • Anchorage, Alaska

Vision Language Model Interpretability with Concept Guided Decoding

Dipesh Satav1
Rodrigo A. P. de Campos1
Gulpi Q. O. Pratamasunu1
Kazuhiro Fukui1,2,3

1Graduate School of Science and Technology, University of Tsukuba

2Tsukuba Institute for Advanced Research

3Center for Artificial Intelligence Research, Department of Computer Science, University of Tsukuba

Abstract

Vision Language Models (VLMs) are challenging for the field of Deep Learning Interpretability given their billions of parameters and recurrence-based reasoning processes. Based on the Frame Representation Hypothesis from language models, we introduce a novel approach for interpretability for the visual domain, enabling systematic extraction and analysis of learned concepts. Our framework reveals inherent biases and vulnerabilities in VLMs, caused by lack of proper safety alignment between the visual and textual components. By combining our interpretability techniques with attack methods, we demonstrate model vulnerabilities, achieving an average 97.7% attack success rate in the Qwen2-VL and Llama 3.2 Vision models. This combination of interpretability analysis and security testing provides crucial insights for developing safer, more transparent models, offering a path toward more trustworthy visual AI systems.

Frame Representation Extension

Extended the Frame Representation Hypothesis to the visual domain, enabling systematic interpretability of VLM outputs.

Strong Vulnerability Detection

Developed a gradient-free approach combining interpretability with attack methods, achieving 97.7% ASR on state-of-the-art VLMs.

Multilingual Dataset

Released vulnerability analysis dataset manually verified by native/fluent speakers across 8 languages.

How It Works

1

Extract Concepts

Extract semantic concepts from VLMs using Frame Representation

2

Guide Generation

Use Concept Guided Decoding to steer model outputs

3

Reveal Biases

Expose vulnerabilities and biases in model behavior

Examples

Example dialogue showing concept-guided outputs

Examples demonstrating how different concepts guide the model to produce specific outputs, revealing inherent biases in vision-language models.

Attack Success Rate Comparison

Model Text Only CGD Image Attack Combined
Qwen2-VL 2B 26.8% 81.8% 97.6% 99.8%
Qwen2-VL 7B 19.8% 85.6% 91.6% 100%
Qwen2-VL 72B 23.4% 94.6% 96.0% 99.6%
Llama 3.2 11B 54.2% 90.4% 83.4% 94.8%
Llama 3.2 90B 66.4% 94.0% 77.4% 94.4%

Average Attack Success Rate: 97.7%

@inproceedings{valois2025vision,
  title={Vision Language Model Interpretability with Concept Guided Decoding},
  author={Valois, Pedro H. V. and Satav, Dipesh and de Campos, Rodrigo A. P. and 
          Pratamasunu, Gulpi Q. O. and Fukui, Kazuhiro},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
  pages={397--402},
  year={2025},
  organization={IEEE},
  doi={10.1109/ICIP55913.2025.11084299}
}