Update: Cost-effective genomics analysis with Sentieon on Azure
This Blog was Co-Authored by Don Freed – Sr. Bioinformatics Scientist, Brendan Gallagher – Head of Business Development at Sentieon, Inc.
In our previous blog, we discussed benchmarking the performance of Sentieon’s, DNAseq and DNAscope pipelines using Azure instances using v202112.05 of the software. Since the publication of those results, there have been significant updates to the Sentieon software. As a result, we have updated the benchmarking to use Sentieon version 202308.01. We break down the runtime and cost of the pipelines on a wide range of currently available instances. These benchmarks use publicly available datasets, and the pipeline is available on Github.
Additionally, we have worked with Sentieon to develop a Terraform template for deploym
ent of the license server.
Running Sentieon on Azure
The pipelines and scripts needed for setup used in this benchmarking are provided on GitHub.
Instance Setup
The script at misc/instance_setup.sh performs initial setup of the instance and download/installation of software packages used in the benchmark.
Input datasets
In these benchmarks, as we stated before, we use the GIAB HG002 sample sequenced on multiple sequencing platforms. Input datasets for the benchmark are recorded in the config/config.yaml. With the exception of the Element dataset, that you will have to download on your own.
We recommend downloading all the files and placing them in an azure blob storage. You can use AzCopy to transfer the required files to your own Storage account using a shared access signature with “Write” access. Then we recommend updating the configs to use a shared access signature to each file. The pipeline will automatically download input files.
Input FASTQ were obtained as previously outlined, we have added the new ONT dataset below:
ONT HPRC
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_1_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_2_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_3_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_4_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_5_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_6_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_7_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_8_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_9_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_10_Dordo_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE–UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_11_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz
The input files vary in their coverage, so the datasets with FASTQ input were down-sampled to approximately 93 billion bases (~30x coverage) prior to processing with the Sentieon secondary analysis pipelines. The Ultima CRAM file was not down-sampled and is at 40x coverage as recommended by Ultima Genomics. The ONT duplex sample was not down-sampled and is at approximately 30x coverage.
The data were processed using the hg38 reference genome. The reference genome at https://giab.s3.amazonaws.com/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz was used for files with input in the FASTQ format. The reference genome at https://broad-references.s3.amazonaws.com/hg38/v0/Homo_sapiens_assembly38.fasta was used with the Ultima data in CRAM format, as this dataset was already aligned to this reference genome.
Running benchmarks on Azure
The script at misc/run_benchmarks.sh was used to run the benchmarks. This orchestrates the localization of the input datasets, references, model files and execution of Snakemake workflows on the machine. The workflow will down-sample the input data to be consistent to run on the Sentieon analysis workflows and will calculate variant calling accuracy against the Genome in a Bottle (GIAB) v 4.2.1 truth set. For the ARM benchmarking we didn’t run ONT and Pacbio data as minimap2 is not support by Sentieon on that architecture in version 202308.01. Support for minimap2 on ARM was added in version 202308.03 of the Sentieon software.
Improved Benchmarking with HBv3
To test the improvement of the software we wanted to retest on the HBv3 series of machines, that we previously recommended. These machines are optimized for applications that are driven by memory bandwidth, such as fluid dynamics, finite element analysis, and reservoir simulation and would be a good fit for Sentieon’s analysis pipelines. Figure 1 presents the runtime and Spot compute cost of running Sentieon’s analysis pipelines for germline variant calling across multiple sequencing technologies on Standard_HB120rs_v3 instance in US East at the time of publication.
Figure 1: Runtime and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on Standard_HB120rs_v3.
Using the Standard_HB120rs_v3, we analyzed 30x Illumina NovaSeq and HiSeqX samples from FASTQ to VCF using the DNAseq and DNAscope pipelines. The DNAseq pipeline took around 28 minutes with a cost of $0.17. Sentieon’s DNAscope pipeline has been speed up and takes only 10 minutes shorter– around 18 minutes with a cost of $0.11, about 6 cents less, see Table 1
The Ultima UG100 dataset is already aligned to the reference genome and pipeline performed variant calling without alignment. The DNAscope pipeline finished in 18 minutes for Spot cost of $0.10.
Sentieon’s DNAscope LongRead pipeline for PacBio HiFi data is more computationally intensive as it includes multiple passes of variant calling along with a read-backed phasing. The DNAscope LongRead pipeline finished in 41 minutes with a Spot cost of $0.25. We add in ONT data in this round of tests, similar to the PacBio data, the ONT pipeline is more computationally involved. The DNAscope LongRead pipeline finished in 88 minutes with a Spot cost of $0.53 with the ONT long reads.
The Element Biosciences AVITI system is supported by a customized Sentieon DNAscope pipeline. Sentieon’s DNAscope pipeline for Element Biosciences finished in 21 minutes with a Spot cost of $0.13.
All run times and costs can be found in Table 1.
Sample
Pipeline
Alignment (min)
Preprocessing (min)
Variant Calling (min)
Total Runtime (min)
On Demand($)
Spot ($)
Element Aviti
DNAscope
11.05
2.30
7.39
20.74
1.241
0.121
Illumina HiSeq X
DNAseq
21.09
2.97
4.11
28.18
1.691
0.171
Illumina HiSeq X
DNAscope
9.47
1.40
7.71
18.57
1.111
0.111
Illumina NovaSeq
DNAseq
21.53
2.63
4.43
28.59
1.721
0.171
Illumina NovaSeq
DNAscope
9.74
1.39
7.78
18.92
1.141
0.111
ONT Duplex
DNAscope
32.91
N/A
55.37
88.28
5.301
0.531
PacBio HiFi
DNAscope
11.49
N/A
29.75
41.24
2.471
0.251
Ultima UG100
DNAscope
N/A
N/A
17.87
17.87
1.071
0.111
Table 1: Runtime and On Demand and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on Standard_HB120rs_v3. Alignment includes alignment with Sentieon BWA-MEM for short-read data and alignment with Sentieon minimap2 for PacBio HiFi and ONT Duplex data. Preprocessing includes duplicate marking, base-quality score recalibration, and merging of multiple aligned files into a single file. Variant calling includes variant calling or variant candidate identification along with variant genotyping and filtering. Variant calling for PacBio HiFi data is implemented as a multi-stage pipeline. All runs were in the eastus region1 Pricing is accurate at the time of publication.
Let’s compare the improvements between v202112.05 and v202308.01 of the software results based on the provided information:
1. DNAseq Pipeline Performance:
– v202112.05: Took around 30 minutes with a cost Spot of $0.18.
– v202308.01: Took around 28 minutes with a cost Spot of $0.17.
– Improvement: In v202308.01, the runtime decreased by 2 minutes; and the cost decreased by $0.01.
2. DNAscope Pipeline Performance:
– v202112.05: Took around 32 minutes with a cost of $0.19.
– v202308.01: Improved to 19 minutes with a cost of $0.11.
– Improvement: In v202308.01, the runtime decreased significantly to 19 minutes, and the cost decreased by $0.07.
3. DNAscope LongRead Pipeline Performance (PacBio HiFi Data):
– v202112.05: Finished in 72 minutes with a Spot cost of $0.42.
– v202308.01: Improved to 41 minutes with a Spot cost of $0.25.
– Improvement: In v202308.01, decreased significantly to 41minutes, and the cost decreased by $0.17.
4. Element Biosciences AVITI System Performance:
– v202112.05: Finished in 31 minutes with a Spot cost of $0.18.
– v202308.01: Improved to 20 minutes with a Spot cost of $0.12.
– Improvement: In v202308.01, the runtime decreased slightly to 20 minutes, and the cost decreased by $0.06.
Overall, in v202308.01, significant improvements were observed in the runtime and cost efficiency of the DNAscope pipeline, whereas minor fluctuations were noted in other pipeline performances. It’s also important to note that v202308.01 introduced support for ONT data in the DNAscope LongRead pipeline.
Sentieon benchmark across multiple instance families and architectures
The Sentieon pipelines and software can scale to smaller or larger instances depending on data as well as instance availability. To provide an accurate representation of performance across various architectures, we again benchmarked the Sentieon DNASeq and DNAscope pipeline with Illumina NovaSeq dataset on ARM and x86 architecture. The runtime, On Demand and Spot compute cost is shown in Figures 2 and 3 respectively. On Demand VMs are pay for compute capacity by the second, with no commitments or upfront payments. While Spot VMs are pay for unused compute capacity at a discount.
Figure 2: Runtime and Dedicated and Spot compute cost of Sentieon DNAseq pipeline across various Azure machine types using Illumina NovaSeq dataset sorted by overall runtime. Larger instances provide lower runtime, while cost is generally consistent within a family but does differ between architectures.
Figure 3: Runtime and Dedicated and Spot compute cost of Sentieon DNAscope pipeline across various Azure machine types using Illumina NovaSeq dataset sorted by overall runtime. Larger instances provide lower runtime, while cost is generally consistent within a family but does differ between architectures.
For the fastest turnaround, the Sentieon DNAseq pipeline can process the Illumina 30x NovaSeq dataset in 28 minutes on a Standard_HB120rs_v3, with a Dedicated cost of $1.72 or a Spot cost of $0.11, see Figure 2. As another cost-effective option, DNAseq can be used on the Standard_D96ads_v5 instance with an On-Demand cost of $3.38, a spot cost of $0.34 and a turnaround time of under 40 minutes, see Figure 2. The DNAscope pipeline for Standard_D96ads_v5 instance with an On-Demand cost of $2.55, a spot cost of $0.26 and a turnaround time of 31 minutes, see Figure 3. Note, for the Standard_F48s_v2, an additional external disk was used to accommodate all the test data for the analysis but wasn’t included in the overall cost.
Let’s compare the performance and cost efficiency between version v202308.01 and v202112.05:
1. DNAseq Pipeline Performance:
– v202112.05: Processed Illumina 30x NovaSeq dataset in 30 minutes on a Standard_HB120rs_v3 with a Spot cost of $0.18.
– v202308.01: Processes the dataset in 28 minutes on a Standard_HB120rs_v3 with a Spot cost of $0.11. Alternatively, it can be processed on a Standard_D96ads_v5 instance in under 40 minutes with a Spot cost of $0.34.
– Improvement: The turnaround time for the Standard_HB120rs_v3 decrased slightly to 28 minutes, with a decrease in Spot cost by $0.07. Additionally, a new option is available on the Standard_D96ads_v5 instance with a slightly longer turnaround time of under 40 minutes but at a higher Spot cost of $0.34 compared to $0.11.
2. DNAscope Pipeline Performance:
– v202112.05: Turnaround time of under 50 minutes with a Spot cost of $0.39.
– v202308.01: Turnaround time of 31 minutes on a Standard_D96ads_v5 instance with an On-Demand cost of $2.55 and a Spot cost of $0.26.
– Improvement: In v202308.01, the turnaround time decreased to 31 minutes, with a Spot cost of $0.26, offering improved performance and cost efficiency compared to the previous version.
3. Comparison Against ARM CPUs:
– v202112.05: ARM runtime was within 10-20 minutes of X86 equivalent for Intel and AMD. Spot price of $0.33 for DNAscope and $0.30 for DNAseq pipeline.
– v202308.01: ARM runtime was within 10-20 minutes of X86 equivalent for Intel and AMD. No significant difference in cost between architectures.
– Improvement: No significant difference in cost between the architectures is noted in v202308.01, whereas in v202112.05, there was a significant difference in cost for AMD architecture compared to Intel.
Overall, in v202308.01, while the DNAseq pipeline on the Standard_HB120rs_v3 shows a slight increase in turnaround time and cost, the DNAscope pipeline on the Standard_D96ads_v5 instance demonstrates improved performance and cost efficiency compared to the previous
version. Additionally, there is no significant difference in cost between ARM and X86 architectures in v202308.01, unlike in v202112.05. We would also like to note that the order of the machine types is slightly different but not with significant changes.
We were able to also run comparison against ARM CPUs. For direct comparison we were able to use the equivalent 32 vCPU machines, but the highest available is 64 vCPU when compared to 96 vCPU in X86 (Figure 2 and 3). In Table 2, we can see that ARM runtime was within 10-20 minutes of X86 equivalent for Intel and AMD. Additionally, Dedicated cost was comparable for DNAscope and DNAseq pipeline comparable across the board. However, this time there was not significant difference in cost between the architectures.
VM Size
Architecture
Pipeline
Total Runtime (min)
On Demand ($)
Spot ($)
D32ds_v5
x86 (Intel)
DNAscope
64.51
1.941
0.191
D32ads_v5
x86 (AMD)
DNAscope
76.95
2.111
0.211
D32pds_v5
ARM
DNAscope
82.00
1.981
0.201
D32ds_v5
x86 (Intel)
DNAseq
121.51
3.661
0.371
D32ads_v5
x86 (AMD)
DNAseq
115.12
3.161
0.321
D32pds_v5
ARM
DNAseq
123.72
2.981
0.301
Table 2: Runtime, Dedicated and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on across 32cpu architectures. All runs were in the eastus region.
1 Pricing is accurate at the time of publication.
These results highlight the ability of the Sentieon software to scale up large instances for faster turnaround and down to smaller instances as needed. We only included a subset of potential compute, based on optimized compute-to-price ratios. However, the Sentieon tools can also be used with other machine families, based on availability in a given region.
Conclusion
Sentieon’s updated DNAseq and DNAscope pipelines are highly scalable and can be used on a variety of machine types. The software can scale up to the 120 vCPU Standard_HB120rs_v3, instances for turnaround times of 28 minutes or down to Standard_D32pds_v5 instances for better pricing on Spot instance of $0.30
If you can get Standard_HB120rs_v3 in your preferred region, it is the cheapest per run. However, if not available, all other Spot pricing options are great with the following two being your best cost advantage, Standard_D32ds_v5 and Standard_D96ds_v5. If you are looking for turnaround time, we recommend any of the 96vCPU options. Sentieon’s FASTQ to VCF pipelines can process Illumina 30x whole genomes for less than $3.60 on On Demand machines or $0.33 on Spot machines and in under 120 minutes. Standard_D32ds_v5 process the DNAseq pipeline in for $3.66 on On Demand machines or $0.37 on Spot machines and in about 121 minutes. While on Spot machines Sentieon DNAseq is capable of processing 30x genomes from FASTQ to VCF with a Spot machine cost of less than $1.50 on a variety of machine types that we tested.
Overall, the new version of the software has decreased cost and, in some cases, decreased turnaround time, with increased performance and range of datasets it can analyze.
Readers should note that all costs represent hardware costs and don’t represent software licensing costs.
To get started with the Sentieon software on Azure, please reach out to info@sentieon.com or visit the Sentieon website at www.sentieon.com
Microsoft Tech Community – Latest Blogs –Read More