Network-driven integrative analysis of multi-omics data

We are currently developing tools for statistical analysis of multi-omics cancer genomics data from a fresh angle. This work aims at providing the implementation of multivariate statistical models over genome-scale biological networks. The project will ultimately help us evaluate the molecular complexity and heterogeneity of all major cancers in TCGA and generate further hypotheses towards improved therapeutic target discovery. The multi-omics data from TCGA, including DNA mutations and copy number, DNA methylation, mRNA transcript expression, microRNA expression, and protein expression data, provides a unique opportunity to assess the impact of pre-translational (pre-protein) molecular variations on the proteomic data, which ultimately mediate cellular characteristics of distinct cancer subtypes. Existing approaches for integrating multiple –omics data sets to identify markers of tumorigenesis or distinguishing cancer subtypes have ignored the intragenic hierarchy from DNA to protein (the central dogma of molecular biology) and intergenic relationship (molecular interaction network) in the analysis. Many statistical analysis approaches view this as a variable selection problem where any molecular type is considered an equal feature. To make a more educated inference on the markers of cancer subtypes and battle the problem of small n large p, we first compiled various types of biological networks including protein-protein interactions, cancer-specific curated biochemical pathways, transcription factor (TF) regulatory networks, and genetic interaction networks. Using these network data, we derive interaction scores for the networks and construct a nearest shrunken centroid classifier for phenotypic groups (e.g. subtypes) with adaptation to network-level data. We are currently developing a method for phenotype prediction combining the network signatures and other clinical and demographic information, to be tested in independent multi-omics data sets.

Cancer therapeutic agents normally inhibit key regulators in particular signalling pathways, such as phosphoinositide 3-kinase (PI3K) - AKT - mammalian target of rapamycin (mTOR) pathway. Past experiences from the clinical trials of first generation therapeutics such as rapamycin or pan-PI3K inhibitors, gave us a valuable lesson on the difficulty of target inhibition due to subtle allosteric regulations in signalling pathways and alternative receptor tyrosine kinase activations through feedback of feed-forward loops in response to the treatments. Network motif analysis is a valuable tool that can identify putative allosteric regulations and feedback loops around specific genes (e.g. drug targets). By augmenting network motifs with quantitative signatures at multiple molecular levels (e.g. RNA and protein), we will introduce the concept of quantitative network motif (qn-motifs) defined by an enzyme-to-enzyme ratio quantitation approach. We will perform this analysis in signalling pathways and networks of key therapeutic targets in the all subtypes of all three cancers, and reference our findings against publicly available cancer genomics data sets. This work is in collaboration with Kwokpui Choi in the Department of Statistics and Applied Probability at NUS.

Proteogenomics analysis of whole exome sequencing data in cancer

The Cancer Genome Atlas (TCGA) provides various multi-platform molecular profiling data on about 30 types of human cancer and it is therefore an ideal resource to study the downstream impact of genomic variants on the clinical outcomes along with other molecular types of data. Genomic profiling in TCGA is based on whole exome sequencing (WXS), and many coding variants appear sporadically in previously unmapped positions. Hence there is a need to summarize the detected variants to locale or unit unified across thousands of samples. One approach for crystalizing the idea of synchronizing filter for WXS data is proteogenomics analysis. Regardless of whether the coding variants directly alter amino acid sequence or not, it is intuitively appealing to hypothesize that the variants located near well-conserved protein domains or key signaling positions such as post-translational modification (PTM) sites have more functional impact than other variants, at least in the context of primary tumor development or cancer progression. In this project, we are first evaluating the current mapping coverage of key PTMs including serine / threonine / tyrosine phosphorylation and lysine ubiquitination as a preparation for the proteogenomic analysis. We recently developed an algorithm called PTMtopographer to filter existing PTM sites based on physical/chemical properties of local amino acid sequences and to predict the same PTMs on >43000 protein variants to achieve proteome-scale PTM maps for the two types of modifications. We mapped >16,000 protein domains onto the protein variants with PTM prediction and studied relative positions of PTMs with respect to the sequence domains. We will expand the PTM mapping to other important covalent modifications (e.g. acetylation, methylation, and cysteine modification) and use the expanded set of PTM sites and protein domain collection as the anchor to summarize coding variant data (e.g. number of variants near a fixed amino acid position). The aggregate count of coding variants in each amino acid position will represent the fitness of each protein, and the amino acid level data will be correlated with complex clinical outcomes such as cancer subtypes, tumor stage, and overall survival information, etc. Our approach, if successful, will establish a biologically sensible summarization method for WXS data predictive of clinical outcomes in cancer, largely based on protein function.

Interactive visualization of high-dimensional molecular data sets

We are developing an interactive high dimensional data visualization tool to facilitate pathway-level interpretation of time-series -omics data sets with various user-friendly utilities. While there are a number of visualization softwares in the literature, almost none are capable of taking inputs from the user, in terms of feature selection and re-extraction/grouping of target features in a user-driven manner. We are working on a tool that allows the users to modify the data "on the fly" and annotate the gene list in an interactive environment, most importantly on a web API framework. We aim to apply this tool in a large-scale meta-analysis of microarray data sets generated in studies of common respiratory infections (viral and bacterial).

Spectral library building for data independent acquisition mass spectrometry in metabolomics

Data independent acquisition mass spectrometry (DIA-MS) is a promising approach for automatic sampling of MS/MS data for small molecules. DIA analysis with wide isolation windows generates mixed fragmentation spectra, and thus spectral libraries are crucial elements for clear data extraction for DIA data. However, compound identification through benchmarking reference MS/MS libraries often results in a limited number of compounds due to variable fragmentation patterns in different instrument setups. Here we are developing a bioinformatics workflow called MetaboDIA to build customized MS/MS spectral libraries using user’s own data and perform MS/MS-based quantification, complementing MS1-based quantification. Using this pipeline, one can also build a spectral library entirely based on DIA data in the context of large sample analysis. Using a marine algae data set with replicate experiments, we show that fragment ion intensities obtained with a library built from DDA runs provides a good alternative to the conventional quantification based on MS1 data for compounds with MS/MS data. To test its applicability in complex samples, we applied MetaboDIA to a clinical serum metabolomics data set, where we built a DDA-based spectral library containing consensus records for 1829 compounds and performed MS/MS-based quantification in DIA data, yielding quantitative data for sensitive differential expression analysis.

Gene expression regulation analysis in dynamic systems

Mass spectrometry-based proteomics is arguably the most widely used method to study gene regulatory processes beyond transcription. In particular, time series experiments provide immensely rich information on dynamic regulation over different phases of a response. Several such time-resolved proteomics datasets have already been reported, providing intriguing insights into different stages of a response and strong differences to transcript regulation. However, while high-resolution mass spectrometry can generate quantitative expression data for thousands of proteins, data analysis is still enormously challenging because the data often suffer from poor signal-to-noise ratio and missing measurements for a large number of proteins. For this reason, time course protein expression profiles are often incomplete, barring robust statistical inference for many proteins. In addition, time course experiments usually include <10 time points, which is far sparser than what typical statistical tools require.

To address the need for statistical methods for regulation analysis, we have recently developed a statistical framework, PECA. Using equations governing the synthesis and degradation of RNA and proteins, PECA provides statistical summaries describing the extent of gene regulation from the concentration data, in the form of rate ratios. PECA performs hypothesis testing whether the rate ratio has changed at each time point. One of the novel features of PECA is that varying amounts of mRNA are incorporated into the model and thus it eliminates the contributions of transcription and RNA degradation to focus on post-transcriptional regulation. The resulting outcome of the analysis is the classification of each time point as a regulation change point (or not) at the post-transcriptional level. Therefore PECA provides time-resolved information on the post-transcriptional gene expression regulation and summaries at different phases of the response – another feature rendering it distinct from existing approaches.

We are currently developing new modules within the PECA toolbox. First, we are creating a tool (PECA-N) to increase coverage and sensitivity of the core PECA model. We introduce the (network-oriented) Markov random field model into PECA to allow for simultaneous inference of the change point status between interacting proteins using physical protein-protein interaction (PPI) data (e.g. iRefIndex) and functional network data (e.g. HumanNet). Second, we refining the statistical inference procedure of PECA-N to directly estimate the absolute rates of synthesis and degradation for RNAs and proteins, and quantify gene regulatory processes (synthesis, degradation), and quantify their interactions (PECA-R). Third, we will soon be inferring time-resolved ubiquitination (UB) states from data on both global UB and UB linked to proteasomal degradation using a new statistical model (PECA-M). We will then integrate this information with the degradation rate estimates from PECA-R to provide detailed annotation of UB-linked regulatory events (UBICON).

Through these aims, we are developing corresponding statistical models and experimentally validate the reported estimates of the key parameters in the system-wide oxidative stress response, specifically with respect to post-transcriptional regulation. Thus we use PECA’s new modules to generate novel and biologically relevant hypotheses for gene expression regulation from quantitative large-scale data.