Category: Language Model

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

I stress test sparse autoencoders trained on the residual stream of gpt2-small using tokens from Open WebText and the Lambada benchmark. I find good on-distribution performance but poor off-distribution performance, especially in contexts longer than the training context.

February 27, 2024

·

Language Model, Mechanistic Interpretability, Sparse Autoencoders
How polysemantic can one neuron be? Investigating features in TinyStories.

I look at some features in a pre-trained sparse autoencoder trained on an MLP layer in a TinyStories Model. I look at the features through a statistical lens and also just examine a few of them hand by hand.

January 16, 2024

·

Language Model, Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Transformers
2 digit subtraction — difference prediction

I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.

December 22, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2 digit subtraction — How does it predict +/-?

I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.

December 15, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2-digit subtraction – first steps

I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.

December 6, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers

Category: Language Model

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

How polysemantic can one neuron be? Investigating features in TinyStories.

2 digit subtraction — difference prediction

2 digit subtraction — How does it predict +/-?

2-digit subtraction – first steps