Category: Toy Problem
-
We study toy models which are trained on synthetic data consisting of composed feature pairs. We train sparse autoencoders on the activations of these toy models. We find that, for very small toy models, sparse autoencoders find composed features instead of the true underlying features, and discuss future experiments which should test these results.
-
I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.
-
I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.
-
I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.