Genomics + LLMs: A Case Study on adding variant annotations to LLMs through RAG and Fine-tuning
This Blog was Co-Authored by Shuangjia Lu,Research Intern at Microsoft Research & Ph.D. student at Yale University, Department of Genetics
Abstract:
In this blog, we show how we added genomics domain knowledge to large language models (LLMs), such as GPT-4 and GPT-4o through retrieval augmented generation (RAG) and fine-tuning using Azure Open AI and Azure AI Search platforms. The specific genomics knowledge we added is genetics annotation data, crucial for genetic report interpretation and disease diagnosis. Users can now query specific variants to gpt-4o and receive accurate variant annotations and interpretations supported by advanced reasoning and summarizing capabilities of LLMs.
Introduction:
What is RAG:
LLM knowledge might not be sufficient in domains you queried, or you may want your question answered using your private data. This is where RAG could help. RAG is an approach that improves accuracy and relevance in model answer generation by searching in your own data and retrieving relevant information1.
What is fine-tuning:
We refer to supervised fine-tuning (SFT). SFT is a continuous training process from a pre-trained base model, adapting the model to specific domains, tones or formats. By fine-tuning the model on a new dataset, the model weights are adjusted to generate desired responses2.
Genetic variant annotation data:
Variant annotation datasets provide detailed information about genetic variants and their biological and clinical implications. These datasets are crucial for interpreting and prioritizing potential genetic variants for diseases. Variant annotation contains several aspects including variant description and identifications, gene affected, protein affected, molecular consequences, population frequencies, clinical conditions, etc.
Methods:
We incorporated 189 million variant annotations from several datasets to gpt-4o and gpt-4 models through either RAG or fine-tuning. To assess the model’s performance, we evaluated its output accuracy.
Build RAG for GPT-4o using Azure AI Search and Azure OpenAI:
R1) Downloading data and converting to csv format: We downloaded variant annotation data (in vcf3 format) from databases. To ensure compatibility with the next step search index creation process, we converted vcf files to csv format. This conversion was achieved using bcftools4 and awk commands. Here is an example illustrating the process, including three variants in vcf format, the command used for format conversion, and the resulting output csv file.
R2) Splitting csv files into smaller files to accommodate the limitation of the Azure AI Search S1 price tier, each containing less than 4 million characters.
R3) Transferring the data to Azure Blob Storage to ensure accessibility for the Azure AI Search Service.
R4) Creating an Azure AI Search service and defining the data source as Azure Blob Storage at Azure AI Search Azure blob indexer – Azure AI Search | Microsoft Learn.
R5) Defining search index schema based on our data structure: Create an index – Azure AI Search. Creating 5 searchable and retrievable fields: “chr:pos, rsid, gene, condition, and content”.
R6) Loading data from data resource into the search index through blob indexer Load an index – Azure AI Search.
R7) Adding search index to GPT-4o at Azure OpenAI through ‘add your own data’ feature: Using your data with Azure OpenAI Service – Azure OpenAI | Microsoft .
Fine-tune GPT-4 at Azure OpenAI
F1) Converting annotation vcf file to jsonl prompt for GPT-4 requirements. Here’s an example of three input variants, along with the command to convert the file format and the resulting output in jsonl.
F2) Uploading 3,000 training variants and 1,000 validation variants to Azure OpenAI.
F3) Fine-tuning GPT-4 model at Azure OpenAI.
Evaluate model performance
We assessed the performance of our models using output accuracy. For fine-tuning models, to measure the model’s ability to memorize the trained information and accurately reproduce the desired output, we randomly selected 100 variants from the training sets and counted the exact matches between outputs and the true sets. For RAG models, we also randomly selected 100 variants from input datasets and counted exact matches.
Results:
Our initial testing of the base GPT-4o model revealed limited knowledge in genomics, as it returned 0 out of 100 correct gene information when queried about variants.
We made significant improvement by leveraging the power of RAG. Through RAG, we successfully incorporated 189 million variant annotations into the GPT-4o model, and we achieved 100% accuracy in all annotation fields of our test set of 100 variants. Now, when users query about variants, they not only receive accurate annotation information but also benefit from the interpretations supported by GPT-4o capabilities. Here is an example that demonstrates the improvement in model performance for the same user query when external data is incorporated through RAG. Before using RAG, the model often provided general or incorrect information. After implementing RAG, the model now delivers accurate information and can offer extra interpretations.
Before RAG After RAG
Fine-tuning GPT-4 on variant annotation data also improved performance in some annotation fields, although the accuracy across more fields remains suboptimal. Initially, we fine-tuned GPT-4 to predict 13 annotation fields (such as ID, gene, disease name, etc.) using chromosome position information (e.g. chr16:14555693) provided by users. Despite testing multiple input formats and repeat strategies, the average output accuracy for each field was still around 0.2. To better understand how to improve fine-tuned model performance, we adjusted our approach and fine-tune for a single field. Specifically, we fine-tuned the model only on the gene field, resulting in 95% accuracy. However, when we expanded to predict multiple fields simultaneously, the accuracy significantly decreased. This led us to conclude that the more information we added and the less frequency of information occurrence, the more challenging for model to learn through fine-tuning.
Discussion:
In our exploration, we have found that RAG outperforms supervised fine-tuning in adding factual information to LLMs in terms of data amount, accuracy and cost-effectiveness. Moving forward, to expand the application scenarios and incorporate more genomics and clinical data that have complex relationships, we aim to explore embedding strategy and GraphRAG. To further improve the model’s performance and usability in genomics, we are considering enhancing basic genomics knowledge of the LLM through unsupervised learning or pretraining. Additionally, we will explore the use of small language models, such as phi-3, to ensure the secure utilization of private data the model.
As the first model for adding genomics information to LLMs, our model paves the way for developing a more comprehensive and helpful genomics AI to support clinical diagnoses and research projects, and it demonstrates the potential for LLM in specialized domains.
Disclaimer:
While we strive to ensure the accuracy and reliability of the information generated by this large language model, we cannot guarantee its completeness or correctness. Users should independently verify any information and consult with experts in the respective field to obtain accurate and up-to-date advice. We used fully public datasets from Microsoft Genomics Data Lake [5] and did not used any personal identification data.
Resources:
Learning more about RAG: Retrieval augmented generation in Azure AI Studio – Azure AI Studio | Microsoft Learn
Fine-tuning at Azure OpenAI: Customize a model with Azure OpenAI Service – Azure OpenAI | Microsoft Learn
Introduction of vcf file format: VCF – Variant Call Format – GATK (broadinstitute.org)
Using bcftools to extract vcf information: bcftools(1) (samtools.github.io)
Microsoft Genomics Data Lake: Genomics Data Lake – Azure Open Datasets | Microsoft Learn
Microsoft Tech Community – Latest Blogs –Read More