-
Sparse autoencoders find composed features in small toy models
We study toy models which are trained on synthetic data consisting of composed feature pairs. We train sparse autoencoders on the activations of these toy models. We find that, for very small toy models, sparse autoencoders find composed features instead of the true underlying features, and discuss future experiments which should test these results.
-
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders
I stress test sparse autoencoders trained on the residual stream of gpt2-small using tokens from Open WebText and the Lambada benchmark. I find good on-distribution performance but poor off-distribution performance, especially in contexts longer than the training context.
-
How polysemantic can one neuron be? Investigating features in TinyStories.
I look at some features in a pre-trained sparse autoencoder trained on an MLP layer in a TinyStories Model. I look at the features through a statistical lens and also just examine a few of them hand by hand.
-
2 digit subtraction — difference prediction
I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.
-
2 digit subtraction — How does it predict +/-?
I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.
-
2-digit subtraction – first steps
I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.