Good results, bad results


Yesterday I left running a script to evaluate multiple interpreters at the same time, however the results were not what I expected. ROSA was better than ROSA-GDS in all metrics, and the Center interpreter was the one of the best except for the ‘overall’ metric.

I was checking for ‘overall’, ‘minimal size’, ‘representation consistency’ and ‘class consistency’, and the Center interpreter showed to be very good on all of them, followed by ROSA.

I believe a problem with all of these metrics is the domain shift generated by masking, as mentioned in the ROAR paper: The fact the center interpreter has no curved masking makes it very easy for the model to keep the right prediction (beyond the fact most imagenet images have their reason in the center anyway). These metrics can thus be reworked with a ‘crop’ approach, but how to define a crop from a 2D probability distribution?

For now, I think I might go look for how those ‘best crop’ models work, but maybe just finding an anchor and cropping around that could be enough. Consistency, Size, Retention, Deletion, Insertion, etc, all use the probability map to design an inserting/deleting path of pixels, and that could be the main problem. Again, I don’t think a metric that says the Center interpreter is the best can be trusted, even if the other methods are as bad.

Moreover, this ‘crop’ approach could help techniques like IntegratedGradients and SHAP, which got terrible scores in all metrics given they produce a very discontinuous map.

ViT Baseline

The ViT baseline got 55% validate accuracy on imagenet on my first try, so I left it sweeping for a better result, which I think it is possible. I also had to change it to use PatchDropout instead of token halting, as it kept crashing due to memory problems.

Beyond that, I discovered SwinV2 works with progressive resizing out-of-the-box, making it quite fast even without patch dropout, but it is diverging completely from the first epoch (validate accuracy ~ 0.1%), so I still have to understand why the weak performance.