Notes on Teaching Arithmetic to Small Transformers

Link to paper: https://arxiv.org/abs/2307.03381

Paper published on: 2023-07-07

Paper's authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

GPT3 API Cost: $0.10

GPT4 API Cost: $0.17

Total Cost To Write This: $0.27

Time Savings: 46:1

TLDR:

Large language models can learn arithmetic operations even though it's not explicitly taught to them.
Researchers used a method called chain-of-thought (CoT) data to guide the models step by step in solving arithmetic problems.
Changing the way data is formatted can improve the models' accuracy in arithmetic tasks.
Balanced data sampling, like giving the models a balanced diet of different types of arithmetic problems, improves accuracy.
Pretraining can help models perform basic arithmetic tasks, but non-standard formatting can affect performance.
Fine-tuning, which hones a model's skills on a specific task, works best when the pretrained model and fine-tuning format are consistent.
Larger models like GPT-2 and GPT-3 are more effective in learning arithmetic tasks than smaller models.
Different arithmetic operations like subtraction and multiplication present unique challenges.
Mixing arithmetic and text data during training can improve the models' performance in arithmetic tasks.
Models trained on arithmetic operations struggle with adding numbers of longer lengths than what they were trained on.

DEEPER DIVE:

Understanding Arithmetic Learning in Large Language Models

Let's dive into a fascinating aspect of large language models, specifically their ability to learn arithmetic operations. This capability is not explicitly encoded in the training objective, yet it emerges, illustrating the power of these models. The research we're discussing today focuses on GPT-like models, namely NanoGPT, GPT-2, and GPT-3, and their proficiency in arithmetic tasks such as addition, subtraction, and multiplication.

To put it in layman's terms, imagine trying to teach a child arithmetic. You wouldn't just throw a bunch of equations at them and expect them to figure it out. You would guide them, showing them step by step how to solve the problem. This is similar to what the researchers have done with these models, using a method they call the chain-of-thought (CoT) data, which includes intermediate step results. This technique significantly improves the model's learning in terms of sample complexity and accuracy.

Importance of Data Formatting

An interesting finding from this research is that conventional training data for arithmetic learning is not the most effective. Simple formatting changes can significantly improve accuracy. In the addition task, the researchers used different data formatting methods, including plain, reverse, simplified scratchpad, and detailed scratchpad formats.

The reverse format, which flips the order of the output, leads to substantial performance improvements and enhanced sample efficiency compared to the plain format. This is akin to writing a sentence in reverse order, yet still being able to understand it. This finding is crucial as it suggests that the way we present data to these models can greatly impact their learning efficiency.

Structured Data Sampling

To balance the dataset, the researchers used structured data sampling. This method assigns higher weights to lower-digit numbers and ensures an equal distribution of examples with different numbers of carry-on operations. Think of it as ensuring that the model gets a balanced diet of different types of arithmetic problems. The experiments showed that balanced data sampling improves the accuracy of the model on addition tasks compared to random sampling.

The Role of Pretraining and Fine-Tuning

Pretraining can facilitate reasonable performance on basic arithmetic tasks. However, non-standard formatting can interfere with performance. The researchers extended their work to GPT-2 and GPT-3 models, investigating teaching arithmetic from scratch as well as fine-tuning using pretrained models.

The process of fine-tuning is akin to taking a model that already has a basic understanding of a task (like arithmetic) and then honing its skills on that task. The research shows that fine-tuning yields the best performance when the pretrained model and the fine-tuning format are consistent.

Limitations of Small Models and the Power of Larger Models

The research highlights the limitations of small models in learning addition tasks and suggests that larger models like GPT-2 and GPT-3 may be more effective. This is similar to how a more powerful computer can handle more complex tasks than a less powerful one. However, the researchers also found that model scale aids in learning arithmetic operations but is not necessary, suggesting that smaller models can still learn these tasks given the right conditions.

Dealing with Complexity: From Addition to Multiplication and Beyond

The research expands to include other arithmetic operations such as subtraction, multiplication, sine, and square root. Each operation presents its unique challenges and intricacies. For instance, the detailed scratchpad format significantly improves performance for subtraction and multiplication tasks, but the reverse format is not particularly effective for multiplication.

The Role of Text Data

Interestingly, the interplay between arithmetic and text data during training can improve the performance of arithmetic tasks. This finding suggests that mixing different types of data during training can enhance the model's ability to perform specific tasks.

The Challenge of Length Generalization

One of the key challenges that the researchers encountered was length generalization. Models trained on arithmetic operations have difficulty with length generalization beyond the trained digit lengths. This means that if a model is trained on adding two 2-digit numbers, it might struggle to add two 3-digit numbers.

Concluding Thoughts

This research contributes to our understanding of how transformers acquire arithmetic operations and highlights the importance of high-quality, instructive data for the emergence of arithmetic capabilities in transformers. It also underscores the potential of fine-tuning and the role of data-centric AI in improving model performance.

While the research has its limitations, such as the difficulty of length generalization and the uncertainty of generalizing the findings to larger language models used in practice, it provides valuable insights that can be leveraged to improve the arithmetic learning capabilities of large language models.

In conclusion, the way we present data to these models, the balance of the dataset, and the use of pretraining and fine-tuning can all significantly impact the model's ability to learn arithmetic operations. This research opens up new possibilities for improving the performance of large language models and provides valuable insights for those looking to leverage these models in their work.

Notes on Teaching Arithmetic to Small Transformers