Skip to content

Evan Anders — Researcher @ Anthropic. All views my own.

Category: Polysemanticity

MATS Project: Crafting Polysemantic Transformer Benchmarks with Known Circuits

A link to my MATS final report (interpretability research)

August 23, 2024

·

Language Model, Mechanistic Interpretability, Polysemanticity, Transformers
Sparse autoencoders find composed features in small toy models

We study toy models which are trained on synthetic data consisting of composed feature pairs. We train sparse autoencoders on the activations of these toy models. We find that, for very small toy models, sparse autoencoders find composed features instead of the true underlying features, and discuss future experiments which should test these results.

March 21, 2024

·

Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Toy Problem
How polysemantic can one neuron be? Investigating features in TinyStories.

I look at some features in a pre-trained sparse autoencoder trained on an MLP layer in a TinyStories Model. I look at the features through a statistical lens and also just examine a few of them hand by hand.

January 16, 2024

·

Language Model, Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Transformers

Evan Anders — Researcher @ Anthropic. All views my own.

Blog at WordPress.com.

Subscribe Subscribed
- Evan Anders -- Researcher @ Anthropic. All views my own.
- Already have a WordPress.com account? Log in now.