In pursuit of gene variation of consequence to humanhealth and disease

Research output: Book/ReportPh.D. thesisResearch

From the invention of Sanger sequencing, to the birth of current high-throughput and long-read methodologies, sequencing technology has becomean vital tool for scientific research. Biologists released the first version of thehuman genome in 2001, and continued to refine it over the following yearsuntil the complete and final genome sequence was published in 2022. Inparallel, the 1000 Genome project has revealed the extent of human geneticvariation and polymorphisms, filling a gap in our knowledge about the diver-sity of the human mutational landscape. Transcriptome sequencing providesa means to study the changes in gene expression patterns and related signal-ing pathways affected by diseases and other biological processes. With theadvancement of computer science, machine learning has been introduced intothe field of biological and medical research. Using ML approaches scientistshope to find the biological signals and patterns hidden within massive datasets.The first chapter of this thesis provides an overview of the human genome,transcriptome research and different machine learning algorithms, includingtheir applications in biological and medical research.The last chapter centers around two projects I worked on during my Ph.D.In the first project, simply called DNA prediction, we employed a Centralmodel, a Markov model and a bi-directional Markov model to estimate theprobability of the occurrence of four nucleotide types at a site based on its con-text sequence - the input for these models were the human reference genome.The results show that the base prediction of the human genome was above50% on average, which should be compared to random guessing (25%). Weapplied the predicted results to SNP databases, and found that the alternativealleles showed higher probabilities than reference bases for somatic SNPs. Inaddition, we developed a substitution model to calculate the base mutability.Here, we found that the α matrix relies on a much smaller context sequences,and in the prediction results of the model with one base to each side, we foundthat cytosine (C) has a higher mutability to thymine (T) in CpG sites. Addi-tionally, our substitution model fits the somatic mutations very well..In the second project, we developed a generative nerual network consisting ofdecoder and a Gaussian mixture model - hence, we called it a deep generativedecoder model. We applied the decoder model to the study of gene expressiondata. We used normal individual bulk RNA sequencing samples from the GTExdatabase to train our model, and made a matrix to show how well the samplescan be clustered together by tissue type and their distribution within differentGaussian components. We found that, except for three tissues with a smallsample size, the majority of tissue types independently dominated a Gaussiancomponent. Then, the cancer samples from the TCGA database were used toevaluate whether our trained model could generate new data points and matchthem to the correct Gaussian component of the corresponding tissue. Addition-ally, our sophisticated model can be used to predict the probability of genesbeing differentially expressed, by using the negative binomial distribution inour model, which can be used for N-of-1 research. Compared to DESeq2, a com-monly used method to obtain differential expressed genes (DEGs), the numberof DEGs provided by our model is much smaller. However, in the enrichmentexpected fraction analysis of driver genes and the analysis of subtype-specificrelated genes of breast cancer, our model shows a good performance.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages120
Publication statusPublished - 2023

ID: 347874602