Evan Anders

Category: Toy Problem

Sparse autoencoders find composed features in small toy models

We study toy models which are trained on synthetic data consisting of composed feature pairs. We train sparse autoencoders on the activations of these toy models. We find that, for very small toy models, sparse autoencoders find composed features instead of the true underlying features, and discuss future experiments which should test these results.

March 21, 2024

·

Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Toy Problem
2 digit subtraction — difference prediction

I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.

December 22, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2 digit subtraction — How does it predict +/-?

I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.

December 15, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2-digit subtraction – first steps

I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.

December 6, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers