In pursuit of gene variation of consequence to human health and disease

Research output: Book/ReportPh.D. thesisResearch

From the invention of Sanger sequencing, to the birth of current highthroughput and long-read methodologies, sequencing technology has become an vital tool for scientific research. Biologists released the first version of the human genome in 2001, and continued to refine it over the following years until the complete and final genome sequence was published in 2022. In parallel, the 1000 Genome project has revealed the extent of human genetic variation and polymorphisms, filling a gap in our knowledge about the diversity of the human mutational landscape. Transcriptome sequencing provides a means to study the changes in gene expression patterns and related signaling pathways affected by diseases and other biological processes. With the advancement of computer science, machine learning has been introduced into the field of biological and medical research. Using ML approaches scientists hope to find the biological signals and patterns hidden within massive datasets. The first chapter of this thesis provides an overview of the human genome, transcriptome research and different machine learning algorithms, including their applications in biological and medical research. The last chapter centers around two projects I worked on during my Ph.D. In the first project, simply called DNA prediction, we employed a Central model, a Markov model and a bi-directional Markov model to estimate the probability of the occurrence of four nucleotide types at a site based on its context sequence - the input for these models were the human reference genome. The results show that the base prediction of the human genome was above 50% on average, which should be compared to random guessing (25%). We applied the predicted results to SNP databases, and found that the alternative alleles showed higher probabilities than reference bases for somatic SNPs. In addition, we developed a substitution model to calculate the base mutability. Here, we found that the α matrix relies on a much smaller context sequences, and in the prediction results of the model with one base to each side, we found that cytosine (C) has a higher mutability to thymine (T) in CpG sites. Additionally, our substitution model fits the somatic mutations very well.. In the second project, we developed a generative nerual network consisting of decoder and a Gaussian mixture model - hence, we called it a deep generative decoder model. We applied the decoder model to the study of gene expression data. We used normal individual bulk RNA sequencing samples from the GTEx database to train our model, and made a matrix to show how well the samples can be clustered together by tissue type and their distribution within different Gaussian components. We found that, except for three tissues with a small sample size, the majority of tissue types independently dominated a Gaussian component. Then, the cancer samples from the TCGA database were used to evaluate whether our trained model could generate new data points and match them to the correct Gaussian component of the corresponding tissue. Additionally, our sophisticated model can be used to predict the probability of genes being differentially expressed, by using the negative binomial distribution in our model, which can be used for N-of-1 research. Compared to DESeq2, a commonly used method to obtain differential expressed genes (DEGs), the number of DEGs provided by our model is much smaller. However, in the enrichment expected fraction analysis of driver genes and the analysis of subtype-specific related genes of breast cancer, our model shows a good performance.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages120
Publication statusPublished - 2023

ID: 370743031