BICseq2 Mappability File
BICseq2-norm makes use of a unique mappability file to aid in the process of normalizing the raw coverage data presented in the seq files. This mappability file must be generated for each library size (e.g., 150 bp) given that unique mappability will vary with read length. A 100 bp read might not map uniquely at a given position, but a 150 bp read starting from the same position might map uniquely given 50 additional bases at the end.
The current mappability file was generated for 150 bp reads using a custom workflow, as follows:
The file
chromosomes.txtwas created with only the 23 chromosomes from hg38/GRCh38 (e.g., chr1, chr2 … chr22, chrX, chrY; each on their own line). These regions were extracted from our hg38/GRCh38 reference genomeGAPFIXRDPDK5.fato generatehg38_main_chrs.faand afastaindex file was generated for this output.
for file in $(cat chromosomes.txt); do samtools faidx GAPFIXRDPDK5.fa $file >> hg38_main_chrs.fa; done
samtools faidx hg38_main_chrs.fa
Using an archived version of GEMTools (v 1.7.1-i3) distributed in the github repo below, the initial mappability file was generated and converted to
wigformat:
git clone https://github.com/LinjieWu/GenerateMappability
cd GenerateMappability
python setup.py
cd ..
SoftwareDir="<path_to_folder>/GenerateMappability"
export PATH=${SoftwareDir}/gemtools-1.7.1-i3/bin/:$PATH
gem-indexer -T 16 -c dna -i hg38_main_chrs.fa -o hg38_main_chr_index
gem-mappability -T 16 -I hg38_main_chr_index.gem -l 150 -o hg38_full_mappability_150
gem-2-wig -I hg38_main_chr_index.gem -i hg38_full_mappability_150.mappability -o hg38_full_mappability_150
This
wigmappability file must next be converted to abedfile through a series of conversion steps using tools available from UCSC:
./wigToBigWig hg38_full_mappability_150.wig hg38_full_mappability_150.sizes hg38_full_mappability_150.bw
./bigWigToBedGraph hg38_full_mappability_150.bw hg38_full_mappability_150.bedGraph
./bedGraphTobed hg38_full_mappability_150.bedGraph hg38_full_mappability_150.bed 1
After testing this mappability file, we determined that repetitive regions at the centromeres were causing large numbers of artefactual CNVs. BICseq2 had been optimized previously for hg19/GRCh37 with mappability files that excluded the centromeres, so we decided to also exclude the centromeric regions from our hg38/GRCh38 mappability file. The centromeres for hg38/GRCh38 were pulled from UCSC as follows:
Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
Under
assembly, selectDec. 2013 (GRCh38/hg38)Under
group, selectMapping and SequencingUnder
track, selectChromosome Band (Ideogram)Under
filter, selectcreateUnder
gieStain, selectdoesmatch, and typeacenin the text box, then selectsubmitUnder
output format, selectBED - browser extensible dataSelect
get outputSelect
get BED
This
bedfile was saved ascentromeres.bedand subtracted from the existing mappability file:
bedtools subtract -a hg38_full_mappability_150.bed -b centromeres.bed > hg38_full_mappability_150_no_centromeres.bed
Finally, the
bedfile was parsed to generate a single mappability file for each chromosome in the format required byBICseq2-norm:
for file in $(cat chromosomes.txt); do echo $file; grep -P ${file}'\t' hg38_full_mappability_150_no_centromeres.bed | awk -v OFS='\t' '{print $2, $3}' > full_mappability_hg38_150_no_centromeres/${file}_mappability; done