Skip to content

Evan Anders — Researcher @ Anthropic. All views my own.

Category: Transformers

MATS Project: Crafting Polysemantic Transformer Benchmarks with Known Circuits

A link to my MATS final report (interpretability research)

August 23, 2024

·

Language Model, Mechanistic Interpretability, Polysemanticity, Transformers
How polysemantic can one neuron be? Investigating features in TinyStories.

I look at some features in a pre-trained sparse autoencoder trained on an MLP layer in a TinyStories Model. I look at the features through a statistical lens and also just examine a few of them hand by hand.

January 16, 2024

·

Language Model, Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Transformers
2 digit subtraction — difference prediction

I (mostly) figure out how a 2-digit subtraction transformer determines the difference between two numbers.

December 22, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2 digit subtraction — How does it predict +/-?

I examine a toy transformer trained to perform two-digit subtraction and find that it learns a simple linear classification algorithm to predict whether the output is positive or negative.

December 15, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers
2-digit subtraction – first steps

I train a 1-layer transformer to do 2-digit subtraction and find some interesting patterns in weights and activations.

December 6, 2023

·

Language Model, Mechanistic Interpretability, Toy Problem, Transformers

Evan Anders — Researcher @ Anthropic. All views my own.

Blog at WordPress.com.

Subscribe Subscribed
- Evan Anders -- Researcher @ Anthropic. All views my own.
- Already have a WordPress.com account? Log in now.