Notes on Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
This is a summary of an important research paper that provides a 43:1 time savings. It was crafted by humans working with several AI's. The goal is to save time and curate good ideas.

Link to paper: https://arxiv.org/abs/2307.09458
Paper published on: 2023-07-19
Paper's authors: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
GPT3 API Cost: $0.07
GPT4 API Cost: $0.13
Total Cost To Write This: $0.21
Time Savings: 43:1
The ELI5 TLDR:
A recent research paper looked at how big language models work. They focused on a model called Chinchilla and tested different sizes of it. They used a tool called MMLU to study how well the model could answer multiple-choice questions. They also used a technique called activation patching to find important parts of the model. They found that certain parts of the model were responsible for its performance. They also looked at how different parts of the model affect the output. They found that some parts have a direct effect, while others have an indirect effect. They also looked at different categories of attention patterns in the model. They found that some patterns focus on the correct answer, while others focus on specific letters or gather information from the last few words. They also tested different changes to the model's prompts to see how it affected the model's performance. They found that the model was able to adapt to different changes. The researchers suggest automating the analysis of models like this in the future. This research helps us understand how big language models work and can help make better AI models in the future.
The Deeper Dive:
Unraveling the Complexity of AI: A Dive into Circuit Analysis of Large Language Models
In the realm of artificial intelligence, understanding the inner workings of large language models is a complex task. This is where the recent research paper comes in, focusing on circuit analysis of large language models, specifically the 70B Chinchilla model. The paper's main novelty lies in its investigation of how circuit analysis can be scaled in multiple-choice question answering scenarios. It delves into the application of existing techniques like logit attribution, attention pattern visualization, and activation patching to this model.
The research identifies and categorizes a small set of 'output nodes' in the model, focusing on the 'correct letter' category of attention heads and their features. It also investigates the query and key subspaces of the attention heads and their representation of an 'Nth item in an enumeration' feature.
The Chinchilla Model and the MMLU Benchmark
The Chinchilla model is a large language model with sizes 1B, 7B, and 70B. The research paper tests these different sizes on the Massive Multitask Language Understanding (MMLU) benchmark, a tool used to study multiple-choice question answering. The 70B model is the only one to perform well on the standard 5-shot version of MMLU, indicating a higher level of complexity and capability in this model.
Activation Patching and Circuit Analysis
Activation patching is a technique that helps identify relevant circuit nodes in the language model. In this study, it was used in conjunction with logit attribution and attention pattern visualization to identify the final nodes in the circuit. A set of 45 nodes were found to be causally responsible for recovering almost all of the model's performance through direct effects.
Direct and Total Effects of Nodes
The research paper also delves into the direct and total effects of different nodes in a neural network model. The direct effect of a node refers to the immediate impact it has on the output, while the total effect considers all the indirect pathways through which a node can influence the output. Interestingly, the total effects of the nodes do not necessarily align with their direct effects.
Attention Heads and Their Categories
The attention patterns of the identified heads in the transformer decoder can be grouped into four categories: 'Correct Letter' heads, 'Uniform' heads, 'Single letter' heads, and 'Amplification' heads. The 'Single letter' heads mostly attend to a single fixed letter, while the 'Amplification' heads aggregate information from the last few tokens into the last token.
Correct Letter Heads and Their Features
The heads that identify the correct answer and its corresponding label are called "content gatherers" and "correct letter heads." The correct letter heads use a general feature of "Nth item in a list" and a more adhoc feature based on label identity. The Q and K spaces of the correct letter heads can be compressed into a 3D subspace without harming their performance, which is an interesting finding in terms of model compression and efficiency.
Testing Hypotheses with Mutated Prompts
The research paper uses mutated prompts to test the hypothesis about the semantic meaning of the subspaces. These mutations include changes in delimiters and label types. The researchers analyze the effects of different prompt mutations on the model's performance and loss, providing valuable insights into the model's robustness and adaptability.
The Future of Circuit Analysis
The research paper concludes with the suggestion of automating the analysis of circuits and their nodes to reduce labor intensity, highlighting the need for more research into improved tools and methods for interpretability. This indicates a promising direction for future work in the field of AI interpretability.
Through this detailed analysis of the Chinchilla model, the research paper provides a deeper understanding of the inner workings of large language models. This knowledge is invaluable for AI developers and researchers, as it can guide the development of more efficient, robust, and interpretable AI models.




