Notes on Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

Link to paper: https://arxiv.org/abs/2307.06290

Paper published on: 2023-07-12

Paper's authors: Yihan Cao, Yanbin Kang, Lichao Sun

GPT3 API Cost: $0.39

GPT4 API Cost: $0.1

Total Cost To Write This: $0.49

Time Savings: 11:1

Our discussion today revolves around a novel approach to evaluate the quality of instruction-following data for language models called INSTRUCTMINING. Think of this as a quality control officer who has a set of criteria to evaluate the quality of products on a production line. Similarly, INSTRUCTMINING uses specific natural language indicators to evaluate the quality of instruction-following data.

Large language models, like a well-oiled machine, typically undergo two stages: pretraining and finetuning. Imagine pretraining as the assembly of the machine and finetuning as the calibration to ensure it works optimally. Despite large-scale pretraining, these models can sometimes fail to understand human instructions, like a machine that's assembled but not calibrated correctly. To improve this, instruction finetuning, akin to fine-tuning the machine, is employed.

The researchers found that these models can be fine-tuned to perform well even with a small amount of high-quality instruction-following data. This is like having a machine that can perform optimally even when calibrated with a few, but high-quality, parameters. However, the challenge lies in selecting these high-quality datasets, as there are no clear guidelines.

The researchers conducted extensive finetuning experiments to investigate the relationship between data quality and natural language indicators. These results were then applied to estimate parameters in INSTRUCTMINING, similar to testing a machine with different parameters to find the optimal settings. The results showed that INSTRUCTMINING can help select relatively high-quality samples from various instruction-following datasets, like a quality control officer selecting the best products off the line.

The researchers used a multivariate and univariate evaluation on randomly sampled subdatasets from several candidate datasets of different quality levels. This is similar to testing a machine's performance with different combinations of parameters. The multivariate evaluation estimates the correlation between evaluation loss and a bag of indicators, while the univariate evaluation studies the individual correlation between each indicator and instruction data quality.

Several datasets were used for training: ALPACA, OPEN-ASSISTANT, STACKEXCHANGE, and WIKIHOW. These datasets can be thought of as different sets of instructions to calibrate the machine, each with different formats, sizes, and distributions.

For evaluation, the study combines test data from different datasets and uses gpt-3.5-turbo from OPENAI to generate five unique outputs for each instruction. This is like combining different sets of parameters and testing the machine's output.

The researchers conducted the instruction tuning on the base model LLAMA-7B, with all finetuning datasets of the same size, 2000 examples in each. This is akin to calibrating a specific machine model with the same number of parameters for each test.

The empirical results show that reward score and nearest neighbour score are the most significant indicators of general instruction data quality. This is like finding that the machine's efficiency and precision are the most significant indicators of its performance. The research also found that variables PPL, MTLD, Nat, and Und exhibit positive correlations with the anticipated evaluation loss, while Rew and Coh showcase negative correlations with the evaluation loss.

The study concludes that selecting datasets directly according to PPL, MTLD, Rew are more preferable, and that these indicators can share multicollinearity. This is like concluding that calibrating the machine according to efficiency, precision, and speed parameters is more preferable, and that these parameters can share multicollinearity.

The research proposes a quality evaluation rule for instruction fine-tuning. This rule estimates the expected evaluation loss using quality indicators from previous works. The experiments conducted show that higher expected evaluation loss corresponds to higher actual loss, indicating lower quality instruction data. This is like proposing a quality control rule that estimates the expected inefficiency of a machine using previous data, where higher expected inefficiency corresponds to higher actual inefficiency, indicating lower quality parameters.

The performance of instruction fine-tuning on LLAMA-7B models is sensitive to very high-quality data. This is like a machine that performs optimally when calibrated with high-quality parameters. The selected datasets performed better than randomly sampled datasets, but the difference was not significant according to gpt-4. This is like the machine performing better when calibrated with selected parameters than with randomly chosen ones, but the difference is not significant according to a specific evaluation metric.

The method also uses gpt-3.5-turbo to generate responses for the golden evaluation set. This is like using a specific tool to evaluate the machine's performance.

The study acknowledges limitations in the research and plans to expand the analysis to larger models and include more evaluation sets and instruction datasets for further analysis. This is like acknowledging limitations in the machine tests and planning to test larger machines and include more evaluation metrics and parameter sets for further analysis.

Finally, the researchers aim to present an instruction quality evaluation method for measuring the quality of instruction datasets. This is like presenting a new method to evaluate the quality of parameters used to calibrate a machine.

Notes on Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

More from this blog

Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

Notes on LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

Notes on Text2Layer: Layered Image Generation using Latent Diffusion Model

Notes on DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Notes on Towards A Unified Agent with Foundation Models

Command Palette

More from this blog