vivaldi (Viral Variant Location and Diversity) is an R package for analyzing intrahost viral variation from Illumina sequencing data. The package is built around variant call format (VCF) files and provides tools to:
- import and tidy viral variant calls from one or many VCF files
- merge technical replicates
- filter variants by coverage and allele frequency thresholds
- parse SnpEff annotations
- summarize shared variants and mutation spectra
- quantify diversity using metrics such as Shannon entropy and dN/dS
- generate publication-ready visualizations of variant positions, frequencies, and genome-wide patterns
The package was developed for viral diversity analyses such as those described in the final published paper:
Roder AE, Johnson KEE, Knoll M, Khalfan M, Wang B, Schultz-Cherry S, Banakis S, Kreitman A, Mederos C, Wang W, Ruchnewitz D, Samanovic MI, Mulligan MJ, Lassig M, Łuksza M, Das S, Gresham D, Ghedin E. Optimized Quantification of Intrahost Viral Diversity in SARS-CoV-2 and Influenza Virus Sequence Data. mBio. 2023. https://doi.org/10.1128/mbio.01046-23
install.packages("vivaldi")# install.packages("remotes")
remotes::install_github("GreshamLab/vivaldi")vivaldi depends on widely used R packages for variant parsing, data wrangling, and visualization:
vcfRfor reading VCF filesdplyr,tidyr, andmagrittrfor data manipulationggplot2andplotlyfor plottingseqinrfor reading reference FASTA filesgluefor string handling
The main input is a directory of .vcf files. Depending on the workflow, you may also provide:
- a reference FASTA file used for alignment
- a table of segment or genome sizes
- a replicate mapping table for technical replicate merging
- SnpEff-annotated VCF files for annotation-aware downstream analyses
The package ships with example data accessible via system.file("extdata", package = "vivaldi"), including:
H1N1.fa- example influenza reference FASTASegmentSize.csv- segment size metadatareps.csv- replicate metadatavcfs/- example annotated VCF files
An example processed dataset is also included as example_filtered_SNV_df.
A common vivaldi analysis looks like this:
- Load VCF files with
arrange_data() - Merge technical replicates with
merge_replicates()when replicate sequencing is available - Filter low-confidence variants with
filter_variants() - Expand annotation fields with
prepare_annotations() - Add metadata such as segment sizes with
add_metadata() - Summarize diversity with functions such as
tstv_ratio(),shannon_entropy(), anddNdS_segment() - Visualize results with functions such as
af_distribution(),snv_location(),snv_genome(),snv_segment(),plot_shannon(), andshared_snv_plot()
library(vivaldi)
vardir <- system.file("extdata", "vcfs", package = "vivaldi")
reference_fasta <- system.file("extdata", "H1N1.fa", package = "vivaldi")
seg_sizes <- system.file("extdata", "SegmentSize.csv", package = "vivaldi")
rep_info <- system.file("extdata", "reps.csv", package = "vivaldi")
vcf_df <- arrange_data(vardir, reference_fasta, annotated = "yes")
replicates <- read.csv(rep_info)
sizes <- read.csv(seg_sizes)
merged_df <- merge_replicates(
vcf_df,
replicates,
"rep1",
"rep2",
c("sample", "CHROM", "POS", "REF", "ALT", "ANN", "ALT_TYPE", "major", "minor")
)
filtered_df <- filter_variants(
merged_df,
coverage_cutoff = 0,
frequency_cutoff = 0.01
)
annotated_df <- prepare_annotations(filtered_df)
annotated_df <- add_metadata(annotated_df, sizes, c("CHROM"), c("segment"))
annotated_df <- shannon_entropy(annotated_df, genome_size = 13133)
plot_shannon(annotated_df)arrange_data()- read VCF files and combine them into a single dataframeread_reference_fasta_dna()- extract chromosome or segment sizes from a FASTA fileprepare_annotations()- split SnpEff annotations into separate columnssnpeff_info()- parse annotation information from VCF INFO fieldsadd_metadata()- join external metadata to the variant dataframe
filter_variants()- apply coverage and allele frequency thresholdsmerge_replicates()- keep shared variants across sequencing replicates and compute summary frequencies
tally_it()- count variants over user-defined groupststv_ratio()- calculate transition/transversion ratiosshannon_entropy()- calculate per-position, per-segment, and genome-wide diversity metricsdNdS_segment()- estimate dN/dS summaries by coding feature or segment
af_distribution()- plot minor allele frequency distributionsposition_allele_freq()- inspect the allele frequencies of a single site across samplesshared_snv_plot()andshared_snv_table()- identify and summarize shared variantssnv_location()- visualize SNV positions across samples and segmentssnv_genome()- summarize SNVs across genomessnv_segment()- summarize SNVs by genome segmentplot_shannon()- visualize Shannon entropy summarieststv_plot()- visualize transition/transversion summaries
For a longer worked example, see the package vignette:
browseVignettes("vivaldi")Source documentation and examples are also available in the package help pages:
help(package = "vivaldi")The repository uses standard R package tooling, including testthat tests under tests/testthat and GitHub Actions R-CMD-check workflows.
Marissa Knoll, Katherine Johnson, Megan Hockman, Eric Borenstein, Mohammed Khalfan, Elodie Ghedin, and David Gresham
Maintainer: David Gresham dg107@nyu.edu
If you use vivaldi in published research, please cite the published article rather than the preprint:
Roder AE, Johnson KEE, Knoll M, Khalfan M, Wang B, Schultz-Cherry S, Banakis S, Kreitman A, Mederos C, Wang W, Ruchnewitz D, Samanovic MI, Mulligan MJ, Lassig M, Łuksza M, Das S, Gresham D, Ghedin E. Optimized Quantification of Intrahost Viral Diversity in SARS-CoV-2 and Influenza Virus Sequence Data. mBio. 2023. https://doi.org/10.1128/mbio.01046-23