Notes on PolyLM: An Open Source Polyglot Large Language Model
This is a summary of an important research paper. It was made interactively by a human and several AI's. The goal is to curate good ideas and provide a 10:1 time savings.

Link to paper: https://arxiv.org/abs/2307.06018
Paper published on: 2023-07-12
Paper's authors: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie
GPT3 API Cost: $0.064944
GPT4 API Cost: $0.13434000000000001
Total Cost To Write This: $0.19928400000000002
Time Savings: 36:1
Understanding and developing language models that can proficiently handle multiple languages is a complex task. The research we're discussing today, like a culinary artist crafting a fusion dish, combines different ingredients (or languages) to create a novel, multilingual large language model (LLM), named POLYLM. The unique aspect here is the integration of bilingual data and the use of a curriculum learning strategy during training. The model is also fine-tuned using a multilingual self-instruct method, which is akin to adding a secret sauce to the dish to enhance its flavor. The researchers have also created a benchmark to assess the model's performance on various multilingual tasks.
Let's delve into the specifics. The POLYLM is a multilingual LLM trained on a colossal corpus of 640 billion tokens. It's available in two sizes, 1.7B and 13B (akin to small and large portions of our fusion dish). The pre-training dataset for POLYLM is a mix of various languages, with English data accounting for about 68%. The dataset is prepared through a comprehensive data pre-processing pipeline that includes language identification, rule-based filtering, ML-based quality filtering, and deduplication.
Now, let's talk about the architecture of POLYLM. It's based on a decoder-only autoregressive Transformer with Pre-LN and GeLU activation. It uses Byte-Pair Encoding (BPE) with a vocabulary of 256K token entries. The models are trained using a 2048 token context window and the Adam optimizer with warm-up and cosine decay learning rate schedules. The training process also includes weight decay and gradient clipping.
The researchers used the Megatron-LM framework and trained the models on a cluster of 32 A100 GPU servers. They applied tensor model parallelism within a single node and trained a 13B-parameter model over a dataset containing 640B tokens. The training process took approximately two months due to unforeseen spikes and deviations in losses. They encountered instability during training due to a large learning rate causing gradient explosions, which they addressed by reducing the learning rate. They also incorporated mixed-precision training using the bfloat16 numerical format to reduce memory and increase training efficiency.
The researchers divided the training process into two stages, using a curriculum learning approach. The first stage was like the initial cooking process, where all the ingredients were mixed together. The second stage was more like refining the dish, focusing on a subset of the pre-training dataset with higher quality and a greater proportion of multilingual content. This is where the bilingual data was integrated into the training.
For fine-tuning, the researchers created a multilingual self-instruction dataset called MULTIALPACA. The construction of the MULTIALPACA dataset involved iterative progress, including prompt construction, response collection, format checking, similarity checking, and task pool updating.
To test the model's performance, the researchers constructed a multilingual benchmark that included tasks in natural language understanding, knowledge, natural language generation, and machine translation. The results showed that the 13B model of POLYLM outperformed other open-source models on multilingual tasks while maintaining comparable performance in English.
The researchers also introduced MULTIALPACA, a multilingual instruction dataset, and a multilingual benchmark. They found that the larger model (13B) performed significantly better on multilingual tasks. However, they also acknowledged potential bias towards prominent languages and the importance of supporting low-resource languages.
The researchers also discussed potential deficiencies of POLYLM, such as hallucination and toxicity, and the ethical concerns regarding these issues. They recommended that POLYLM and MULTIALPACA materials be used only for research proposals and encouraged users to identify deficiencies in the contents.
In conclusion, this research presents a novel approach to improving the multilingual capabilities of LLMs. By integrating bilingual data and adopting a curriculum learning strategy, along with the creation of a multilingual self-instruction dataset and benchmark, the researchers have made significant strides in the field of multilingual LLMs. However, the potential biases and ethical concerns raised also highlight the need for further research and improvements in this area.




