Vision Language Model Interpretability with Concept Guided Decoding

Pedro H. V. Valois ¹

Dipesh Satav¹

Rodrigo A. P. de Campos¹

Gulpi Q. O. Pratamasunu¹

Kazuhiro Fukui^1,2,3

¹Graduate School of Science and Technology, University of Tsukuba

²Tsukuba Institute for Advanced Research

³Center for Artificial Intelligence Research, Department of Computer Science, University of Tsukuba

Abstract

Vision Language Models (VLMs) are challenging for the field of Deep Learning Interpretability given their billions of parameters and recurrence-based reasoning processes. Based on the Frame Representation Hypothesis from language models, we introduce a novel approach for interpretability for the visual domain, enabling systematic extraction and analysis of learned concepts. Our framework reveals inherent biases and vulnerabilities in VLMs, caused by lack of proper safety alignment between the visual and textual components. By combining our interpretability techniques with attack methods, we demonstrate model vulnerabilities, achieving an average 97.7% attack success rate in the Qwen2-VL and Llama 3.2 Vision models. This combination of interpretability analysis and security testing provides crucial insights for developing safer, more transparent models, offering a path toward more trustworthy visual AI systems.

Frame Representation Extension

Extended the Frame Representation Hypothesis to the visual domain, enabling systematic interpretability of VLM outputs.

Strong Vulnerability Detection

Developed a gradient-free approach combining interpretability with attack methods, achieving 97.7% ASR on state-of-the-art VLMs.

Multilingual Dataset

Released vulnerability analysis dataset manually verified by native/fluent speakers across 8 languages.

How It Works

1

Extract Concepts

Extract semantic concepts from VLMs using Frame Representation

2

Guide Generation

Use Concept Guided Decoding to steer model outputs

3

Reveal Biases

Expose vulnerabilities and biases in model behavior

Examples

Example dialogue showing concept-guided outputs

Examples demonstrating how different concepts guide the model to produce specific outputs, revealing inherent biases in vision-language models.

Attack Success Rate Comparison

Model	Text Only	CGD	Image Attack	Combined
Qwen2-VL 2B	26.8%	81.8%	97.6%	99.8%
Qwen2-VL 7B	19.8%	85.6%	91.6%	100%
Qwen2-VL 72B	23.4%	94.6%	96.0%	99.6%
Llama 3.2 11B	54.2%	90.4%	83.4%	94.8%
Llama 3.2 90B	66.4%	94.0%	77.4%	94.4%

Average Attack Success Rate: 97.7%

@inproceedings{valois2025vision,
  title={Vision Language Model Interpretability with Concept Guided Decoding},
  author={Valois, Pedro H. V. and Satav, Dipesh and de Campos, Rodrigo A. P. and 
          Pratamasunu, Gulpi Q. O. and Fukui, Kazuhiro},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
  pages={397--402},
  year={2025},
  organization={IEEE},
  doi={10.1109/ICIP55913.2025.11084299}
}

Vision Language Model Interpretability with
Concept Guided Decoding

Abstract

Frame Representation Extension

Strong Vulnerability Detection

Multilingual Dataset

How It Works

Extract Concepts

Guide Generation

Reveal Biases

Examples

Attack Success Rate Comparison

GitHub Repository

Multilingual Dataset

Paper PDF

IEEE Xplore