SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method
Research output: Working paper › Preprint
Documents
- Fulltext
Final published version, 903 KB, PDF document
We introduce the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs. The butterfly method is based on a three-component mixture of normal distributions, in which parameters are easily found using the open-source statistical software R. This makes the method transparent, straight-forward to change parameters according to the user’s needs, and easy to analyse the data within R after the SNPs have been called. We contribute with two open-source R packages that make SNP calling easy by helping with bookkeeping and by giving easy access to meta-information about the SNPs on the Illumina Infinium Omni5-4 BeadChip Kit (including chromosome, probe type, and SNP bases). We test our method on > 4 mio. SNPs and compare the results with those obtained with the GenTrain method used by Illumina GenomeStudio as well as SNPs obtained by PCR-free whole genome sequencing (WGS). We demonstrate two variants of our method: one where we account for potential probe type bias by estimating a separate model for each probe type (type I and type II) and another that uses a general model such that the model’s parameter estimates do not depend on the sample that is being analysed. We focused on varying the no-call rate and show how it changed the concordance with that of WGS. This is done by using a threshold on the a posteriori probability of belonging to a SNP cluster and by using the number of beads to adjust the stringency of the no-call mechanism. With the butterfly method, we achieve a SNP call rate of around 99% and a SNP concordance of around 99% with the WGS data. By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data than the GenomeStudio.
Original language | English |
---|---|
Publisher | bioRxiv |
Number of pages | 15 |
DOIs | |
Publication status | Published - 20 Jan 2022 |
Links
- https://www.biorxiv.org/content/10.1101/2022.01.17.476594v1.full.pdf
Final published version
Number of downloads are based on statistics from Google Scholar and www.ku.dk
No data available
ID: 302456466