Notes on DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks
This is a summary of an important research paper. It was made interactively by a human and several AI's. The goal is to curate good ideas and provide a 10:1 time savings.

Link to paper: https://arxiv.org/abs/2307.05628
Paper published on: 2023-07-11
Paper's authors: Daoan Zhang, Weitong Zhang, Bing He, Jianguo Zhang, Chenchen Qin, Jianhua Yao
GPT3 API Cost: $0.52
GPT4 API Cost: $0.11
Total Cost To Write This: $0.63
Time Savings: 15:1
We begin our exploration with the DNAGPT model, a generalized pre-trained tool for DNA sequence analysis. Think of DNAGPT as a Swiss army knife, pre-trained on over 10 billion base pairs from nine different species, and capable of being fine-tuned for any DNA sequence analysis task. It's not just a simple tool, it's a comprehensive toolbox. It can process and output DNA sequences and numbers simultaneously, and it uses a unique token design that allows users to design prompts according to their task requirements.
Imagine you're a chef in a kitchen, and the DNAGPT is your recipe book. It has been evaluated on a variety of dishes (tasks) - from classification and regression to generation. The recipe book uses ingredients (reference genomes) from 9 species for pre-training. It also uses a symbolic language to encode various recipes into sequence, unifying the paradigmatic differences between different recipes.
Let's delve into the technicalities now. The model was trained using cross-entropy loss for the next token prediction and sequence order prediction tasks, and mean squared error (MSE) loss for the GC ratio prediction task. It's like using different cooking techniques for different types of dishes. The model can recognize genomic signals and regions (GSR) from any species and was evaluated on the recognition of polyadenylation signals (PAS) and translation initiation sites (TIS) of different organisms.
The model was also used to generate artificial human genomes (AGs), akin to creating a new dish from scratch. The AGs were evaluated using principal component analysis (PCA), allele frequency (AF) analysis, and linkage disequilibrium (LD) analysis. The results were promising - the model was found to fit the original data distribution more accurately than other models, perform stably with a correlation of 0.99 in allele frequency analysis, and generate slightly weaker LD than real genomes.
The model consists of 12 layers of transformer blocks based on unidirectional attention, with each layer containing 12 attention heads and a hidden layer size of 768. It's like having a 12-course meal, with each course having 12 dishes and each dish having 768 ingredients. The model was pre-trained for 15 epochs and took approximately one day on 8 Nvidia V100 32GB GPUs.
The research involves incorporating numerical heads and regression heads for joint encoding in composite tasks. Pre-trained weights are loaded into the model, and unused weights are discarded. It's like using pre-cooked ingredients and discarding the ones that are not required for the dish.
In genomic signals and regions recognition, sequential and classification heads are used. Metrics used include Accuracy (ACC), F1 score (F1), Matthews Correlation Coefficient (MCC), Precision, and Recall. In mRNA expression levels prediction, sequential and numerical heads are used for input, and regression head for output. In artificial human genomes generation, only sequential and Classification Head are used.
A stop symbol is added at the last position of the input sequence to determine the model’s output length. Test sequences that did not generate stop symbols or with incorrect stop symbol positions are removed in the post-processing step. It's like setting a timer for cooking and discarding dishes that are not cooked properly.
The experiment demonstrated that by adjusting the generation temperature, a fixed model can generate more diverse sequences without requiring additional training. It's like adjusting the temperature of the oven to bake a variety of dishes without needing different ovens.
Six categories of tokens were used in building up the DNAGPT structure: Discrete tokens, Continuous tokens, True/False tokens, Instruction tokens, Connection tokens, and Reserved tokens. It's like having different types of ingredients for different types of dishes.
For artificial human genomes generation, the 1000 Genomes data was utilized. mRNA expression levels prediction used human protein-coding gene sequences located upstream and downstream of the transcription start site (TSS). It's like using different types of ingredients from different sources.
Comparisons between DNAGPT and DNABERT on the human PAS dataset showed that DNAGPT can better capture the relative information of GSR in the sequence. Attention maps of DNAGPT-M with TIS input showed that in the deep networks, it can pinpoint the locations of important tokens more precisely. It's like comparing two chefs and finding that one is more precise in their measurements and cooking techniques.
In conclusion, the DNAGPT model is a powerful tool for DNA sequence analysis. It's like a master chef, capable of cooking a variety of dishes, using a variety of techniques, and always delivering a delicious result.




