My seminar will discuss various data-science issues related to
neurogenomics. First, I will focus on classic disorders of the brain,
which affect nearly a fifth of the world’s population. Robust
phenotype-genotype associations have been established for several
psychiatric diseases (e.g., schizophrenia, bipolar disorder). However,
understanding their molecular causes is still a challenge. To address
this, the PsychENCODE consortium generated thousands of transcriptome
(bulk and single-cell) datasets from 1,866 individuals. Using these
data, we have developed interpretable machine learning approaches for
deciphering functional genomic elements and linkages in the brain and
psychiatric disorders. Specifically, we developed a deep-learning
model embedding the physical regulatory network to predict phenotype
from genotype. Our model uses a conditional Deep Boltzmann Machine
architecture and introduces lateral connectivity at the visible layer
to embed the biological structure learned from the regulatory network
and QTL linkages. Our model improves disease prediction (6X compared
to additive polygenic risk scores), highlights key genes for
disorders, and imputes missing transcriptome information from genotype
data alone. Next, I will look at the “data exhaust” from this activity – that is, how one can find other things from the genomic analyses
than what is necessarily intended. I will focus on genomic privacy,
which is a main stumbling block in tackling problems in large-scale
neurogenomics. In particular, I will look at how the quantifications
of expression levels can reveal something about the subjects studied
and how one can take steps to sanitize the data and protect patient
anonymity. Finally, another stumbling block in neurogenomics is more
accurately and precisely phenotyping the individuals. I will discuss
some preliminary work we’ve done in digital phenotyping.