A new hope!

(I will just ignore all promises and cringe posts from before)

This now becomes my research journal.

This week I mainly worked on my research progress presentation, which I also intend to use on the research seminar class. The topic was the basics on machine learning interpretability, so I used occlusion as the baseline example (actually RISE), and from there I went to explain ROSA and ROSA-GDS.

Although I was a bit nervous about the presentation, there were many valuable comments, which I took note.

Rework the augmentation sets

  1. Sensei commented that building GDS from a set of augmentations of the same image does not make sense the way I am doing. GDS is supposed to contain the difference on a set of different subspaces, and since the GDS dimensionality was becoming so small, that is a sign the augmentation subspaces were too similar.

  2. Lima-san also pointed out that my technique could be used for out-of-distribution interpretability. Given the ROSA family is a class agnostic technique, I could as well input anything and try to analyze what the model sees. Not sure if that fits as an out-of-distribution interpreter exactly, but something of the sort maybe.

These comments made me remember of Maki-sensei’s presentation, when he commented on semi-supervised learning and how they had this concept of weak and strong augmentations. My idea now is to build GDS from 4 subspaces:

  • weak augmentation: similar to the original image
  • strong augmentation: distant from the original image
  • weak aug on noise: baseline noise
  • strong aug on noise: far from original noise

I still wish to keep using the original image, but I think that building GDS from these 4 sets could lead to a better grasp of the difference and what the representation vector is actually spanning. In other words, I hypothesize that the important parts of the original image generate transformations on the vector that will be far from all these 4 sets, thus giving meaning to using GDS and the “difference” we are looking for becomes more carefully defined.


In that sense, I already have some ideas on what could become the weak/strong augmentations:


  • patches / random crop
  • diffusion model (+ control net): photo without x
  • mask and inpaint


  • diffusion model (canny edge control net): photo of x

Compare the methods

  1. Many people asked how to measure the efficacy of each model and I believe comparing simple occlusion with ROSA is unfair. The point is that for metrics such as deletion and insertion, the fact the interpreter had access to the class, makes it hard for class-agnostic methods to compete.

However, after some thinking, maybe the consistency metric is the solution.

The fact I didn’t notice before is that consistency is well defined directly from the definition of explanation:

Definition 1: Explanation

Let \( X \in \mathbb{R}^{C \times H \times W} \) be an image and \(f: \mathbb{R}^{C \times H \times W} \rightarrow \mathbb{R}^{n}\) a classification model. If \(M \in \mathbb{Z}_2^{H \times W}\) is a binary mask and \(S = X \odot M\) is a masked subset of the input, then the explanation \(E(f \vert X)\) is the minimal non-trivial subset which maintains the output

\[E(f \vert X) = \min_{|S|}{S} : f(X) = f(S) \; \& \; |S| > 0\]

Idea: Consistency metric

Given #definition 1, it follows that the best explanation is the one that is closest to the original output and has the minimal amount of non-masked entries.

\[C(E(f \vert X)) = \dfrac{1}{|f(E) - f(X)||S|}\]

Of course some things have to be adjusted in case the function outputs a class (make a == comparison instead), but thinking in terms of the representation vector or probability vector, I think this might be useful for comparing ROSA and Occlusion respectively.


This is an extra set of ideas that come from things I have already been thinking and recent sensei e-mail on Uchiyama-san’s approach.

I believe there is room for optimization in occlusion techniques, starting with:

Masking from instance segmentation

Just create a masking class that uses mask-r-cnn to generate a limited number of masks from instance segmentation. This model is available in pytorch, and I could compare with generating from bbox as well


It happens that Uchiyama-san approximation technique was able to render good results on video transformers, what lead me to think if the same would be seen on image interpretability.

I think also that replacing the original gradient for something like SmoothGrad or IntegratedGradients could also lead to further improvements.


I should definitely use this for explaining self-supervised methods, but maybe checking the literature on interpreting these things is also important. Possibly there is something on NLP or even StableDiffusion interpretation.

Also, Lima-san commented on performing a feature watching during training for these interpretability maps. This is already in my plans, but I still need to finish implementing it and setting an option to watch only a sample of the dataset.