BICseq2 Mappability File
BICseq2-norm
makes use of a unique mappability file to aid in the process of normalizing the raw coverage data presented in the seq
files. This mappability file must be generated for each library size (e.g., 150 bp) given that unique mappability will vary with read length. A 100 bp read might not map uniquely at a given position, but a 150 bp read starting from the same position might map uniquely given 50 additional bases at the end.
The current mappability file was generated for 150 bp reads using a custom workflow, as follows:
The file
chromosomes.txt
was created with only the 23 chromosomes from hg38/GRCh38 (e.g., chr1, chr2 … chr22, chrX, chrY; each on their own line). These regions were extracted from our hg38/GRCh38 reference genomeGAPFIXRDPDK5.fa
to generatehg38_main_chrs.fa
and afasta
index file was generated for this output.
for file in $(cat chromosomes.txt); do samtools faidx GAPFIXRDPDK5.fa $file >> hg38_main_chrs.fa; done
samtools faidx hg38_main_chrs.fa
Using an archived version of GEMTools (v 1.7.1-i3) distributed in the github repo below, the initial mappability file was generated and converted to
wig
format:
git clone https://github.com/LinjieWu/GenerateMappability
cd GenerateMappability
python setup.py
cd ..
SoftwareDir="<path_to_folder>/GenerateMappability"
export PATH=${SoftwareDir}/gemtools-1.7.1-i3/bin/:$PATH
gem-indexer -T 16 -c dna -i hg38_main_chrs.fa -o hg38_main_chr_index
gem-mappability -T 16 -I hg38_main_chr_index.gem -l 150 -o hg38_full_mappability_150
gem-2-wig -I hg38_main_chr_index.gem -i hg38_full_mappability_150.mappability -o hg38_full_mappability_150
This
wig
mappability file must next be converted to abed
file through a series of conversion steps using tools available from UCSC:
./wigToBigWig hg38_full_mappability_150.wig hg38_full_mappability_150.sizes hg38_full_mappability_150.bw
./bigWigToBedGraph hg38_full_mappability_150.bw hg38_full_mappability_150.bedGraph
./bedGraphTobed hg38_full_mappability_150.bedGraph hg38_full_mappability_150.bed 1
After testing this mappability file, we determined that repetitive regions at the centromeres were causing large numbers of artefactual CNVs. BICseq2 had been optimized previously for hg19/GRCh37 with mappability files that excluded the centromeres, so we decided to also exclude the centromeric regions from our hg38/GRCh38 mappability file. The centromeres for hg38/GRCh38 were pulled from UCSC as follows:
Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
Under
assembly
, selectDec. 2013 (GRCh38/hg38)
Under
group
, selectMapping and Sequencing
Under
track
, selectChromosome Band (Ideogram)
Under
filter
, selectcreate
Under
gieStain
, selectdoes
match, and typeacen
in the text box, then selectsubmit
Under
output format
, selectBED - browser extensible data
Select
get output
Select
get BED
This
bed
file was saved ascentromeres.bed
and subtracted from the existing mappability file:
bedtools subtract -a hg38_full_mappability_150.bed -b centromeres.bed > hg38_full_mappability_150_no_centromeres.bed
Finally, the
bed
file was parsed to generate a single mappability file for each chromosome in the format required byBICseq2-norm
:
for file in $(cat chromosomes.txt); do echo $file; grep -P ${file}'\t' hg38_full_mappability_150_no_centromeres.bed | awk -v OFS='\t' '{print $2, $3}' > full_mappability_hg38_150_no_centromeres/${file}_mappability; done