Category: Language Model
-
I stress test sparse autoencoders trained on the residual stream of gpt2-small using tokens from Open WebText and the Lambada benchmark. I find good on-distribution performance but poor off-distribution performance, especially in contexts longer than the training context.
-
I look at some features in a pre-trained sparse autoencoder trained on an MLP layer in a TinyStories Model. I look at the features through a statistical lens and also just examine a few of them hand by hand.
-
I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.
-
I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.
-
I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.