For a complex disease such as cancer, there are many different ways to characterise it using genotyping, gene expression profiling and histology, and clinical sub-typing. These characterisations and sub-typing often lead to different and heterogeneous stratification of patients with varying power of prognosis and prediction for effective clinical treatment. This issue particular has become prominent given the availability of several data repositories including The Cancer Genome Atlas (TCGA). In these repositories multiple types of molecular, image, and clinical data are available for analyses. There is a need for developing innovative computational algorithms for integrating multiple types of genomic and phenotype data.
We have recently developed novel learning and visualisation algorithms to complete workflows for integrative genomics. One particular challenge is the use of mixed data types namely categorical (e.g., clinical traits) and numerical data (expression). I will describe our response in the form of regularised consensus algorithm whereby clustering in the molecular expression space modulates the distributions in the categorical clinical space. Another, challenge is the extraction of suitable features from large histology images for purposes of summarisation and correlative studies. I will describe a workflow that identifies tissue compartments (epithelium, stroma), extracts features, and then correlates them with genetic signatures. Finally, I will briefly describe our recent efforts in correlating proteomic and transcriptomic data captured in the NCI-60 panel through the use of co-expression networks. This work has been conducted in collaboration with Prof. Kun Huang and is supported by the National Science Foundation, and the National Institutes of Health.