Network-driven integrative analysis of multi-omics data

We are currently developing tools for statistical analysis of multi-omics cancer genomics data from a fresh angle. This work aims at providing the implementation of multivariate statistical models over genome-scale biological networks. The project will ultimately help us evaluate the molecular complexity and heterogeneity of all major cancers in TCGA and generate further hypotheses towards improved therapeutic target discovery. The multi-omics data from TCGA, including DNA mutations and copy number, DNA methylation, mRNA transcript expression, microRNA expression, and protein expression data, provides a unique opportunity to assess the impact of pre-translational (pre-protein) molecular variations on the proteomic data, which ultimately mediate cellular characteristics of distinct cancer subtypes. Existing approaches for integrating multiple –omics data sets to identify markers of tumorigenesis or distinguishing cancer subtypes have ignored the intragenic hierarchy from DNA to protein (the central dogma of molecular biology) and intergenic relationship (molecular interaction network) in the analysis. Many statistical analysis approaches view this as a variable selection problem where any molecular type is considered an equal feature. To make a more educated inference on the markers of cancer subtypes and battle the problem of small n large p, we first compiled various types of biological networks including protein-protein interactions, cancer-specific curated biochemical pathways, transcription factor (TF) regulatory networks, and genetic interaction networks. Using these network data, we derive interaction scores for the networks and construct a nearest shrunken centroid classifier for phenotypic groups (e.g. subtypes) with adaptation to network-level data. We are currently developing a method for phenotype prediction combining the network signatures and other clinical and demographic information, to be tested in independent multi-omics data sets.

Cancer therapeutic agents normally inhibit key regulators in particular signalling pathways, such as phosphoinositide 3-kinase (PI3K) - AKT - mammalian target of rapamycin (mTOR) pathway. Past experiences from the clinical trials of first generation therapeutics such as rapamycin or pan-PI3K inhibitors, gave us a valuable lesson on the difficulty of target inhibition due to subtle allosteric regulations in signalling pathways and alternative receptor tyrosine kinase activations through feedback of feed-forward loops in response to the treatments. Network motif analysis is a valuable tool that can identify putative allosteric regulations and feedback loops around specific genes (e.g. drug targets). By augmenting network motifs with quantitative signatures at multiple molecular levels (e.g. RNA and protein), we will introduce the concept of quantitative network motif (qn-motifs) defined by an enzyme-to-enzyme ratio quantitation approach. We will perform this analysis in signalling pathways and networks of key therapeutic targets in the all subtypes of all three cancers, and reference our findings against publicly available cancer genomics data sets. This work is in collaboration with Kwokpui Choi in the Department of Statistics and Applied Probability at NUS.

Interactive visualization of high-dimensional molecular data sets

We are developing an interactive high dimensional data visualization tool to facilitate pathway-level interpretation of time-series -omics data sets with various user-friendly utilities. While there are a number of visualization softwares in the literature, almost none are capable of taking inputs from the user, in terms of feature selection and re-extraction/grouping of target features in a user-driven manner. We are working on a tool that allows the users to modify the data "on the fly" and annotate the gene list in an interactive environment, most importantly on a web API framework. We aim to apply this tool in a large-scale meta-analysis of microarray data sets generated in studies of common respiratory infections (viral and bacterial).

Automated workflows for data extraction in untargeted and targeted mass spectrometry data

Untargeted workflows: In collaboration with SCIEX, we are actively developing a new bioinformatics tool, called MetaboKit, supporting robust metabolomics analysis from data dependent acquisition (DDA) and data independent acquisition (DIA) mass spectrometry. MetaboKit builds a spectral library from user’s own precursor and product ion spectra from DDA analysis, annotated with metabolite identities from external spectral libraries, adducts, dimers, in-source fragments (ISF) and retention time information unique to the chromatography system from the experiment. Using the customized library, the software performs targeted ion chromatogram extraction for precursor ions in DDA analysis and precursor and product ion chromatograms in DIA analysis.

Targeted workflows: We are also working on a new tool MRMkit, a tool that can perform automated peak integration for a large number of samples. MRMkit scans thousands of MRM assay files with hundreds of transitions and provides first-pass extraction results with quality control measures in a matter of several hours. Users are prompted to inspect the ion chromatograms and provide further instructions to deal with transitions with interference or multi-modal shape, which undergo re-integration effort to address imperfect peak integration.