Data Sources

Software and data sources for variants annotation.

VEP

Current software version is 101.

Annotation uses Variant Effect Predictor (VEP) software.

Source files for Software and Plugins.

Annotation Sources

VEP

This is the main annotation source for VEP.

Source file v101 for homo_sapiens on hg38/GRCh38.

$ wget ftp://ftp.ensembl.org/pub/release-101/variation/vep/homo_sapiens_vep_101_GRCh38.tar.gz

MaxEnt

Current version v20040421.

This is the data source used by MaxEntScan plugin.

Source file fordownload.

ClinVar

Current version is v20201101. ClinVar is updated weekly.

This is the data source for ClinVar to be used with --custom.

# Compressed VCF file
$ curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
# Index file
$ curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi

SpliceAI

Current version is v1.3.

This is the data source used by SpliceAI plugin.

Download requires a log in on illumina platform and BaseSpace sequence CLI.

# Authenticate
$ bs auth
# Get id for dataset genome_scores
$ bs list dataset
# Download
$ bs dataset download --id <datasetid> -o .

For annotation we are using the raw hg38/GRCh38 files and their index:

spliceai_scores.raw.snv.hg38.vcf.gz

spliceai_scores.raw.snv.hg38.vcf.gz.tbi

spliceai_scores.raw.indel.hg38.vcf.gz

spliceai_scores.raw.indel.hg38.vcf.gz.tbi

dbNSFP

Current version 4.1a.

This is the data source used by dbNSFP plugin.

A small modification was made to the source code for the dbNSFP plugin to allow for annotation of non-missense variants. The change is shown below with the original code commented out.

#my %INCLUDE_SO = map {$_ => 1} qw(missense_variant stop_lost stop_gained start_lost);
my %INCLUDE_SO = map {$_ => 1} qw(missense_variant stop_lost stop_gained start_lost splice_donor_variant splice_acceptor_variant splice_region_variant frameshift inframe_insertion inframe_deletion);

Source file dbNSFP.

To create the data source:

# Download and unpack
$ wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.1a.zip
$ unzip dbNSFP4.1a.zip
# Get header
$ zcat dbNSFP4.1a_variant.chr1.gz | head -n1 > h
# Extract information and compress to bgzip
$ zgrep -h -v ^#chr dbNSFP4.1a_variant.chr* | sort -T /path/to/tmp_folder -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.1a.gz
# Create tabix index
$ tabix -s 1 -b 2 -e 2 dbNSFP4.1a.gz

gnomAD Genomes

Current genome version 3.1.

Files are available for download at https://gnomad.broadinstitute.org/downloads.

Files have been preprocessed to reduce the number of annotations using filter_gnomAD.py script inside scripts folder. The annotations that are used and maintained are listed in gnomAD_3.1_fields.tsv file inside variants folder.

gnomAD files have been filtered while splitting by chromosomes. The filtered vcf files have been concatenated, compressed with bgzip and indexed using tabix.

gnomAD Exomes

Current exome version 2.1.1 (hg38/GRCh38 lift-over).

The all chromosomes vcf was downloaded from https://gnomad.broadinstitute.org/downloads.

This file was preprocessed to reduce the number of annotations using the gnomAD_exome_v2_filter.py scripts inside the scripts folder. The annotations that are used and maintained are listed in the gnomAD_2.1_fields.tsv file inside the variants folder.

The filtered vcf was compressed with bgzip and indexed using tabix.

gnomAD Structural Variants

Current SV version is nstd166 (hg38/GRCh38 lift-over).

File was originally downloaded here: https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd166.GRCh38.variant_call.vcf.gz, but that same link now takes to a newer and incorrect file.

See nstd166_GRCh38_readme.txt in the s3://cgap-annotations/gnomAD/SV/ for in-depth explanation. We have copies of both the original (currently used) and the newer file in the bucket.

CADD

Current version is v1.6

CADD SNV and INDEL files were downloaded from https://cadd-staging.kircherlab.bihealth.org/download

$ wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz
$ wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.indel.tsv.gz

This is the data source used by CADD plugin.

Conservation Scores

Current version is UCSC hg38/GRCh38 for phyloP30way, phyloP100way, and phastCons100way

$ wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP30way/hg38.phyloP30way.bw
$ wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/hg38.phyloP100way.bw
$ wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/hg38.phastCons100way.bw

These files were supplied to customs within VEP.

Run VEP

# Base command
vep \
-i input.vcf \
-o output.vep.vcf \
--hgvs \
--fasta <PATH/reference.fa> \
--assembly GRCh38 \
--use_given_ref \
--offline \
--cache_version 101 \
--dir_cache . \
--everything \
--force_overwrite \
--vcf \
--dir_plugins <PATH/VEP_plugins>

# Additional plugins
--plugin SpliceRegion,Extended
--plugin MaxEntScan,<PATH/fordownload>
--plugin TSSDistance
--plugin dbNSFP,<PATH/dbNSFP.gz>,phyloP100way_vertebrate_rankscore,GERP++_RS,GERP++_RS_rankscore,SiPhy_29way_logOdds,SiPhy_29way_pi,PrimateAI_score,PrimateAI_pred,PrimateAI_rankscore,CADD_raw_rankscore,Polyphen2_HVAR_pred,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_score,SIFT_pred,SIFT_converted_rankscore,SIFT_score,REVEL_rankscore,REVEL_score,Ensembl_geneid,Ensembl_proteinid,Ensembl_transcriptid
--plugin SpliceAI,snv=<PATH/spliceai_scores.raw.snv.hg38.vcf.gz>,indel=<PATH/spliceai_scores.raw.indel.hg38.vcf.gz>
--plugin CADD,<PATH/whole_genome_SNVs.tsv.gz>,<PATH/gnomad.genomes.r3.0.indel.tsv.gz>

# Custom annotations
--custom <PATH/clinvar.vcf.gz>,ClinVar,vcf,exact,0,ALLELEID,CLNSIG,CLNREVSTAT,CLNDN,CLNDISDB,CLNDNINCL,CLNDISDBINCL,CLNHGVS,CLNSIGCONF,CLNSIGINCL,CLNVC,CLNVCSO,CLNVI,DBVARID,GENEINFO,MC,ORIGIN,RS,SSR
--custom <PATH/gnomAD.vcf.gz>,gnomADg,vcf,exact,0,AC,AC-XX,AC-XY,AC-afr,AC-ami,AC-amr,AC-asj,AC-eas,AC-fin,AC-mid,AC-nfe,AC-oth,AC-sas,AF,AF-XX,AF-XY,AF-afr,AF-ami,AF-amr,AF-asj,AF-eas,AF-fin,AF-mid,AF-nfe,AF-oth,AF-sas,AF_popmax,AN,AN-XX,AN-XY,AN-afr,AN-ami,AN-amr,AN-asj,AN-eas,AN-fin,AN-mid,AN-nfe,AN-oth,AN-sas,nhomalt,nhomalt-XX,nhomalt-XY,nhomalt-afr,nhomalt-ami,nhomalt-amr,nhomalt-asj,nhomalt-eas,nhomalt-fin,nhomalt-mid,nhomalt-nfe,nhomalt-oth,nhomalt-sas
--custom <PATH/trimmed_gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.gz>,gnomADe2,vcf,exact,0,AC,AN,AF,nhomalt,AC_oth,AN_oth,AF_oth,nhomalt_oth,AC_sas,AN_sas,AF_sas,nhomalt_sas,AC_fin,AN_fin,AF_fin,nhomalt_fin,AC_eas,AN_eas,AF_eas,nhomalt_eas,AC_amr,AN_amr,AF_amr,nhomalt_amr,AC_afr,AN_afr,AF_afr,nhomalt_afr,AC_asj,AN_asj,AF_asj,nhomalt_asj,AC_nfe,AN_nfe,AF_nfe,nhomalt_nfe,AC_female,AN_female,AF_female,nhomalt_female,AC_male,AN_male,AF_male,nhomalt_male,AF_popmax
--custom <PATH/hg38.phyloP100way.bw>,phylop100verts,bigwig,exact,0
--custom <PATH/hg38.phyloP30way.bw>,phylop30mams,bigwig,exact,0
--custom <PATH/hg38.phastCons100way.bw>,phastcons100verts,bigwig,exact,0

dbSNP

Current database version is v151.

# Download all variants file from the GATK folder
$ wget https://ftp.ncbi.nlm.nih.gov/snp/pre_build152/organisms/human_9606_b151_GRCh38p7/VCF/GATK/00-All.vcf.gz
# Parse to reduce size
$ python vcf_parse_keep5.py 00-All.vcf.gz 00-All_keep5.vcf
# Compress and index
$ bgzip 00-All_keep5.vcf
$ bcftools index 00-All_keep5.vcf.gz
$ tabix 00-All_keep5.vcf.gz

Cytoband

The hg38/GRCh38 Cytoband reference file from UCSC: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz.

HGVSg

Current version 20.05

The Human Genome Variation Society has strict guidelines and best practices for describing human genomic variants based on the reference genome, chromosomal position, and variant type. HGVSg can be used to describe all genomic variants, not just those within coding regions. The script used to generate HGVSg infomation in our pipeline implements the recommendations found here for DNA variants (http://varnomen.hgvs.org/recommendations/DNA/). We describe substitions, deletions, insertions, and deletion-insertions for all variants on the 23 nuclear chromosomes and the mitochondrial genome within this field.

Version

Current version accessed 2021-04-20.

VEP: v101

MaxEnt: v20040421

ClinVar: v20201101

SpliceAI: v1.3

dbNSFP: v4.1a

gnomAD: v3.1

gnomAD_exomes: v2.1.1

CADD: v1.6

phyloP30way: hg38/GRCh38

phyloP100way: hg38/GRCh38

phastCons100way: hg38/GRCh38

dbSNP: v151

HGVSg: 20.05

Cytoband: hg38/GRCh38