Big Data in Transcriptomics & Molecular Biology


Recent technological advances allow for high throughput profiling of biological systems at the molecular level in a cost-efficient manner. The today relatively low cost of data generation is leading us to the "Biological Big Data Era". The availability of such big data sets provides unprecedented opportunities but also raises new challenges for data mining, deep analysis, and integrative analysis over various biological "-omics" layers.
On this webpage, we present various publications and present key concepts in the analysis of "big data", with focus on gene expression profing or high throughput "-omics". This includes bioinformatical approaches based on "machine learning" algorithms as well as "unsupervised" and "supervised" examples of each.
Further we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied (Big data need big theory too!).

Final goal is to link all the molecular information and translate it back from "big data" into meaningful conclusions in precision medicine, systems biology, molecular physiology or pathophysiology.

Reviews and Editorials:
Scientific Papers:
Data Visualisation and Software Tools:

Reviews and Editorials:

Big biological datasets map life's networks -- Multi-omics offers a new way of doing biology.

by Laurel Hamers
Science News Magazine issue: Vol. 190, No. 9, October 29, 2016, p. 24

Michael Snyder’s genes were telling him that he might be at increased risk for type 2 diabetes. The Stanford University geneticist wasn’t worried: He felt healthy and didn’t have a family history of the disease. But as he monitored other aspects of his own biological data over months and years, he saw that diabetes was indeed emerging, even though he showed no symptoms.
Snyder’s story illustrates the power of looking beyond the genome, the complete catalog of an organism’s genetic information. His tale turns the genome’s one-dimensional view into a multidimensional one. In many ways, a genome is like a paper map of the world. That map shows where the cities are. But it doesn’t say anything about which nations trade with each other, which towns have fierce football rivalries or which states will swing for a particular political candidate.
Open one of today’s digital maps, though, and numerous superimposed data sources give a whole lot of detailed, real-time information. With a few taps, Google Maps can show how to get across Boston at rush hour, offer alternate routes around traffic snarls and tell you where to pick up a pizza on the way.
Now, scientists like Snyder are developing these same sorts of tools for biology, with far-reaching consequences. To figure out what’s really happening within an organism — or within a particular organ or cell — researchers are linking the genome with large-scale data about the output of those genes at specific times, in specific places, in response to specific environmental pressures.
While the genome remains mostly stable over time, other “omes” change based on what genes are turned on and off at particular moments in particular places in the body. The proteome (all an organism’s proteins) and the metabolome (all the metabolites, or small molecules that are the outputs of biological processes) are two of several powerful datasets that become more informative when used together in a multi-omic approach. They show how that genomic instruction manual is actually being applied.
“The genome tells you what can happen,” says Oliver Fiehn, a biochemist at the University of California, Davis. The proteome and the metabolome can show what’s actually going on. And just as city planners use data about traffic patterns to figure out where to widen roads and how to time stoplights, biologists can use those entwined networks to predict at a molecular level how individual organisms will respond under specific conditions.
By linking these layers and others to expand from genomics to multi-omics, scientists might be able to meet the goals of personalized medicine: to figure out, for example, what treatment a particular cancer patient will best respond to, based on the network dynamics responsible for a tumor. Or predict whether an experimental vaccine will work before moving into expensive clinical tests. Or help crops grow better during a drought. And while many of those applications are still in the future, researchers are laying the groundwork right now. “Biology is being done in a way that’s never been done before,” says Nitin Baliga, director of the Institute for Systems Biology in Seattle.
Big Data -- Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE
PLoS Biol. 2015 Jul 7;13(7): e1002195 -- eCollection 2015

Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a "four-headed beast"--it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the "genomical" challenges of the next decade.

Big Data in Biology and Medicine.
O. P. Trifonova, V. A. Ilin, E. V. Kolker, A. V. Lisitsa
Based on material from a joint workshop with representatives of the international
Data-Enabled Life Science Alliance, July 4, 2013, Moscow, Russia
Acta Naturale 2013 5  3(18): 13

The task of extracting new knowledge from large data sets is designated by the term “Big Data.” To put it simply, the Big Data phenomenon is when the results of your experiments cannot be imported into an Excel file. Estimated, the volume of Twitter chats throughout a year is several orders of magnitude larger than the volume of a person’s memory accumulated during his/her entire life. As compared to Twitter, all the data on human genomes constitute a negligibly small amount [1]. The problem of converting data sets into knowledge brought up by the U.S. National Institutes of Health in 2013 is the primary area of interest of the Data-Enabled Life Science Alliance (DELSA, [2]. Why have the issues of computer-aided collection of Big Data created incentives for the formation of the DELSA community, which includes over 80 world-leading researchers focused on the areas of medicine, health care, and applied information science? This new trend was discussed by the participants of the workshop “Convergent Technologies: Big Data in Biology and Medicine.”

Trans-Omics -- How To Reconstruct Biochemical Networks Across Multiple ‘Omic’ Layers
Katsuyuki Yugi, Hiroyuki Kubota, Atsushi Hatano, Shinya Kuroda
Trends in Biotechnology 2016 34(4): 276-290

We propose 'trans-omic' analysis for reconstructing global biochemical networks across multiple omic layers by use of both multi-omic measurements and computational data integration. We introduce technologies for connecting multi-omic data based on prior knowledge of biochemical interactions and characterize a biochemical trans-omic network by concepts of a static and dynamic nature. We introduce case studies of metabolism-centric trans-omic studies to show how to reconstruct a biochemical trans-omic network by connecting multi-omic data and how to analyze it in terms of the static and dynamic nature. We propose a trans-ome-wide association study (trans-OWAS) connecting phenotypes with trans-omic networks that reflect both genetic and environmental factors, which can characterize several complex lifestyle diseases as breakdowns in the trans-omic system.

Big Data Bioinformatics.
Greene CS, Tan J, Ung M, Moore JH, Cheng C
J Cell Physiol. 2014 Dec;229(12): 1896-1900

Recent technological advances allow for high throughput profiling of biological systems in a cost-efficient manner. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both "machine learning" algorithms as well as "unsupervised" and "supervised" examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia.

Inferring gene expression regulatory networks from high-throughput measurements.
Zavolan M
Methods. 2015 Sep 1;85: 1-2

While molecular biology has meticulously and successfully built the catalog of components for a large number of cell types, recent technological developments have broadened the spectrum and resolution of measurement techniques. These have led to a flourishing of a number of subfields, including mathematical biology, computational biology, systems biology, synthetic biology, etc. Although the precise definitions and boundaries of these partially overlapping subfields can be debated, it is clear that the general availability of high-throughput approaches of increasing quantitative accuracy has shifted the focus away from single components toward quantitative modeling of whole-cell behaviors. The vision behind this volume was to illustrate some of these approaches and the insights that they have brought to the field. We focused on gene expression, which in eukaryotic cells is a very complex process of many steps, all of which are subject to regulation. We hope that readers find this perspective motivating. I am grateful to the contributing authors that participated in this endeavor, to Dr. Adolf for the invitation to edit such an issue, and to Tiffany Hicks and Liz Weishaar for their great help in seeing the project to completion.
Single-cell Transcriptome Study as Big Data.
Yu P and Lin W
Genomics Proteomics Bioinformatics. 2016 Feb;14(1): 21-30

The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.

Next generation informatics for big data in precision medicine era.
Zhang Y, Zhu Q, Liu H
BioData Min. 2015 8: 34 -- eCollection 2015

The rise of data-intensive biology, advances in informatics technology, and changes in the way health care is delivered has created an compelling opportunity to allow us investigate biomedical questions in the context of "big data" and develop knowledge systems to support precision medicine. To promote such data mining and informatics technology development in precision medicine, we hosted two international informatics workshops in 2014: 1) the first workshop on Data Mining in Biomedical informatics and Healthcare, in conjunction with the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2014), and 2) the first workshop on Translational biomedical and clinical informatics, in conjunction with the 8th International Conference on Systems Biology and the 4th Translational Bioinformatics Conference (ISB/TBC 2014). This thematic issue of BioData Mining presents a series of selected papers from these two international workshops, aiming to address the data mining needs in the informatics field due to the deluge of "big data" generated by next generation biotechnologies such as next generation sequencing, metabolomics, and proteomics, as well as the structured and unstructured biomedical and healthcare data from electronic health records. We are grateful for the BioData Mining's willingness to produce this forward-looking thematic issue.

Integrating transcriptome and proteome profiling: Strategies and applications.
Kumar D, Bansal G, Narang A, Basak T, Abbas T, Dash D
Proteomics. 2016 16(19): 2533-2544

Discovering the gene expression signature associated with a cellular state is one of the basic quests in majority of biological studies. For most of the clinical and cellular manifestations, these molecular differences may be exhibited across multiple layers of gene regulation like genomic variations, gene expression, protein translation and post-translational modifications. These system wide variations are dynamic in nature and their crosstalk is overwhelmingly complex, thus analyzing them separately may not be very informative. This necessitates the integrative analysis of such multiple layers of information to understand the interplay of the individual components of the biological system. Recent developments in high throughput RNA sequencing and mass spectrometric (MS) technologies to probe transcripts and proteins made these as preferred methods for understanding global gene regulation. Subsequently, improvements in "big-data" analysis techniques enable novel conclusions to be drawn from integrative transcriptomic-proteomic analysis. The unified analyses of both these data types have been rewarding for several biological objectives like improving genome annotation, predicting RNA-protein quantities, deciphering gene regulations, discovering disease markers and drug targets. There are different ways in which transcriptomics and proteomics data can be integrated; each aiming for different research objectives. Here, we review various studies, approaches and computational tools targeted for integrative analysis of these two high-throughput omics methods.

Precision Medicine, Personalized Medicine, Omics and Big Data -- Concepts and Relationships.
Xiaohua Douglas Zhang
J Pharmacogenomics Pharmacoproteomics 2015, 6:2

On January 20, 2015, US President Obama announced at his 2015 State of the Union Address that he was launching a new precision medicine initiative [1]. On January 30, the Obama administration unveiled details about the Precision Medicine Initiative. Launched with a $215 million investment in the US President’s 2016 budget, the Precision Medicine Initiative will pioneer a new model of patientpowered research that ultimately help deliver the right treatment to the right patient at the right time [2]. On March 11, 2015, it had been reported that China is planning to invest 60 billion Yuan (nearly $10 billion) in precision medicine (20 billion from the Central Government and the remaining 40 billion from local governments and companies) before 2030 [3].  So, what is precision medicine? How is it related to other terms such as personalized medicine and omics (especially Pharmacogenomics and Pharmacoproteomics)? In this article, I elaborate the concepts and their relationships.

Methods of integrating data to uncover genotype-phenotype interactions.
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D
Nat Rev Genet. 2015 16(2): 85-97

Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.

Big data need big theory too.
Coveney PV, Dougherty ER, Highfield RR
Philos Trans A Math Phys Eng Sci. 2016 374(2080): 20160153.

The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their 'depth' and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote 'blind' big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare.This article is part of the themed issue 'Multiscale modelling at the physics-chemistry-biology interface'.


Expression Data - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions.
Zimmermann P, Bleuler S, Laule O, Martin F, Ivanov NV, Campanoni P, Oishi K, Lugon-Moulin N, Wyss M, Hruz T, Gruissem W.
BioData Min. 2014 7: 18 -- eCollection 2014.

Reference datasets are often used to compare, interpret or validate experimental data and analytical methods. In the field of gene expression, several reference datasets have been published. Typically, they consist of individual baseline or spike-in experiments carried out in a single laboratory and representing a particular set of conditions. Here, we describe a new type of standardized datasets representative for the spatial and temporal dimensions of gene expression. They result from integrating expression data from a large number of globally normalized and quality controlled public experiments. Expression data is aggregated by anatomical part or stage of development to yield a representative transcriptome for each category. For example, we created a genome-wide expression dataset representing the FDA tissue panel across 35 tissue types. The proposed datasets were created for human and several model organisms and are publicly available at

ARMADA -- Using motif activity dynamics to infer gene regulatory networks from gene expression data.
Pemberton-Ross PJ, Pachkov M, van Nimwegen E
Methods. 2015 Sep 1;85: 62-74

Analysis of gene expression data remains one of the most promising avenues toward reconstructing genome-wide gene regulatory networks. However, the large dimensionality of the problem prohibits the fitting of explicit dynamical models of gene regulatory networks, whereas machine learning methods for dimensionality reduction such as clustering or principal component analysis typically fail to provide mechanistic interpretations of the reduced descriptions. To address this, we recently developed a general methodology called motif activity response analysis (MARA) that, by modeling gene expression patterns in terms of the activities of concrete regulators, accomplishes dramatic dimensionality reduction while retaining mechanistic biological interpretations of its predictions (Balwierz, 2014). Here we extend MARA by presenting ARMADA, which models the activity dynamics of regulators across a time course, and infers the causal interactions between the regulators that drive the dynamics of their activities across time. We have implemented ARMADA as part of our ISMARA webserver,, allowing any researcher to automatically apply it to any gene expression time course. To illustrate the method, we apply ARMADA to a time course of human umbilical vein endothelial cells treated with TNF. Remarkably, ARMADA is able to reproduce the complex observed motif activity dynamics using a relatively small set of interactions between the key regulators in this system. In addition, we show that ARMADA successfully infers many of the key regulatory interactions known to drive this inflammatory response and discuss several novel interactions that ARMADA predicts. In combination with ISMARA, ARMADA provides a powerful approach to generating plausible hypotheses for the key interactions between regulators that control gene expression in any system for which time course measurements are available.

From big data analysis to personalized medicine for all: challenges and opportunities.
Alyass A, Turcotte M, Meyre D
BMC Med Genomics. 2015 8:33

Recent advances in high-throughput technologies have led to the emergence of systems biology as a holistic science to achieve more precise modeling of complex diseases. Many predict the emergence of personalized medicine in the near future. We are, however, moving from two-tiered health systems to a two-tiered personalized medicine. Omics facilities are restricted to affluent regions, and personalized medicine is likely to widen the growing gap in health systems between high and low-income countries. This is mirrored by an increasing lag between our ability to generate and analyze big data. Several bottlenecks slow-down the transition from conventional to personalized medicine: generation of cost-effective high-throughput data; hybrid education and multidisciplinary teams; data storage and processing; data integration and interpretation; and individual and global economic relevance. This review provides an update of important developments in the analysis of big data and forward strategies to accelerate the global transition to personalized medicine.

Long non-coding RNA expression profiling in the NCI60 cancer cell line panel using high-throughput RT-qPCR.
Mestdagh P, Lefever S, Volders PJ, Derveaux S, Hellemans J, Vandesompele J
Sci Data. 2016 Jul 3: 160052

Long non-coding RNAs (lncRNAs) form a new class of RNA molecules implicated in various aspects of protein coding gene expression regulation. To study lncRNAs in cancer, we generated expression profiles for 1707 human lncRNAs in the NCI60 cancer cell line panel using a high-throughput nanowell RT-qPCR platform. We describe how qPCR assays were designed and validated and provide processed and normalized expression data for further analysis. Data quality is demonstrated by matching the lncRNA expression profiles with phenotypic and genomic characteristics of the cancer cell lines. This data set can be integrated with publicly available omics and pharmacological data sets to uncover novel associations between lncRNA expression and mRNA expression, miRNA expression, DNA copy number, protein coding gene mutation status or drug response.

Transcriptome marker diagnostics using big data.
Han H and Liu Y
IET Syst Biol. 2016 10(1): 41-48

The big omics data are challenging translational bioinformatics in an unprecedented way for its complexities and volumes. How to employ big omics data to achieve a rivalling-clinical, reproducible disease diagnosis from a systems approach is an urgent problem to be solved in translational bioinformatics and machine learning. In this study, the authors propose a novel transcriptome marker diagnosis to tackle this problem using big RNA-seq data by viewing whole transcriptome as a profile marker systematically. The systems diagnosis not only avoids the reproducibility issue of the existing gene-/network-marker-based diagnostic methods, but also achieves rivalling-clinical diagnostic results by extracting true signals from big RNA-seq data. Their method demonstrates a better fit for personalised diagnostics by attaining exceptional diagnostic performance via using systems information than its competitive methods and prepares itself as a good candidate for clinical usage. To the best of their knowledge, it is the first study on this topic and will inspire the more investigations in big omics data diagnostics.

Global profiling of alternative RNA splicing events provides insights into molecular differences between various types of hepatocellular carcinoma.
Tremblay MP, Armero VE, Allaire A, Boudreault S, Martenon-Brodeur C, Durand M, Lapointe E, Thibault P, Tremblay-Létourneau M, Perreault JP, Scott MS, Bisaillon M
BMC Genomics. 2016 26;17: 683

BACKGROUND: Dysregulations in alternative splicing (AS) patterns have been associated with many human diseases including cancer. In the present study, alterations to the global RNA splicing landscape of cellular genes were investigated in a large-scale screen from 377 liver tissue samples using high-throughput RNA sequencing data.
RESULTS: Our study identifies modifications in the AS patterns of transcripts encoded by more than 2500 genes such as tumor suppressor genes, transcription factors, and kinases. These findings provide insights into the molecular differences between various types of hepatocellular carcinoma (HCC). Our analysis allowed the identification of 761 unique transcripts for which AS is misregulated in HBV-associated HCC, while 68 are unique to HCV-associated HCC, 54 to HBV&HCV-associated HCC, and 299 to virus-free HCC. Moreover, we demonstrate that the expression pattern of the RNA splicing factor hnRNPC in HCC tissues significantly correlates with patient survival. We also show that the expression of the HBx protein from HBV leads to modifications in the AS profiles of cellular genes. Finally, using RNA interference and a reverse transcription-PCR screening platform, we examined the implications of cellular proteins involved in the splicing of transcripts involved in apoptosis and demonstrate the potential contribution of these proteins in AS control.
CONCLUSIONS: This study provides the first comprehensive portrait of global changes in the RNA splicing signatures that occur in hepatocellular carcinoma. Moreover, these data allowed us to identify unique signatures of genes for which AS is misregulated in the different types of HCC.

miRNA-miRNA crosstalk -- from genomics to phenomics.
Xu J, Shao T, Ding N, Li Y, Li X
Brief Bioinform. 2016 Aug 21

The discovery of microRNA (miRNA)-miRNA crosstalk has greatly improved our understanding of complex gene regulatory networks in normal and disease-specific physiological conditions. Numerous approaches have been proposed for modeling miRNA-miRNA networks based on genomic sequences, miRNA-mRNA regulation, functional information and phenomics alone, or by integrating heterogeneous data. In addition, it is expected that miRNA-miRNA crosstalk can be reprogrammed in different tissues or specific diseases. Thus, transcriptome data have also been integrated to construct context-specific miRNA-miRNA networks. In this review, we summarize the state-of-the-art miRNA-miRNA network modeling methods, which range from genomics to phenomics, where we focus on the need to integrate heterogeneous types of omics data. Finally, we suggest future directions for studies of crosstalk of noncoding RNAs. This comprehensive summarization and discussion elucidated in this work provide constructive insights into miRNA-miRNA crosstalk.

Posttranscriptional Regulatory Networks:  From Expression Profi ling to Integrative Analysis of mRNA and MicroRNA Data.
Swanhild U. Meyer, Katharina Stoecker, Steffen Sass, Fabian J. Theis and Michael W. Pfaffl
Chapter 15  in  Quantitative Real-Time PCR: Methods and Protocols   (Methods in Molecular Biology)
by Roberto Biassoni, Alessandro Raso

Protein coding RNAs are posttranscriptionally regulated by microRNAs, a class of small noncoding RNAs. Insights in messenger RNA (mRNA) and microRNA (miRNA) regulatory interactions facilitate the understanding of fi ne-tuning of gene expression and might allow better estimation of protein synthesis. However, in silico predictions of mRNA–microRNA interactions do not take into account the specifi c transcriptomic status of the biological system and are biased by false positives. One possible solution to predict rather reliable mRNA-miRNA relations in the specifi c biological context is to integrate real mRNA and miRNA transcriptomic data as well as in silico target predictions. This chapter addresses the workfl ow and methods one can apply for expression profi ling and the integrative analysis of mRNA and miRNA data, as well as how to analyze and interpret results, and how to build up models of posttranscriptional regulatory networks.

Toward understanding the evolution of vertebrate gene regulatory networks: comparative genomics and epigenomic approaches.
Martinez-Morales JR
Brief Funct Genomics. 2015 Aug 20.

Vertebrates, as most animal phyla, originated >500 million years ago during the Cambrian explosion, and progressively radiated into the extant classes. Inferring the evolutionary history of the group requires understanding the architecture of the developmental programs that constrain the vertebrate anatomy. Here, I review recent comparative genomic and epigenomic studies, based on ChIP-seq and chromatin accessibility, which focus on the identification of functionally equivalent cis-regulatory modules among species. This pioneer work, primarily centered in the mammalian lineage, has set the groundwork for further studies in representative vertebrate and chordate species. Mapping of active regulatory regions across lineages will shed new light on the evolutionary forces stabilizing ancestral developmental programs, as well as allowing their variation to sustain morphological adaptations on the inherited vertebrate body plan.

Laser capture microdissection: Big data from small samples.
Datta S, Malhotra L, Dickerson R, Chaffee S, Sen CK, Roy S
Histol Histopathol. 2015 30(11): 1255-1269.

Any tissue is made up of a heterogeneous mix of spatially distributed cell types. In response to any (patho) physiological cue, responses of each cell type in any given tissue may be unique and cannot be homogenized across cell-types and spatial co-ordinates. For example, in response to myocardial infarction, on one hand myocytes and fibroblasts of the heart tissue respond differently. On the other hand, myocytes in the infarct core respond differently compared to those in the peri-infarct zone. Therefore, isolation of pure targeted cells is an important and essential step for the molecular analysis of cells involved in the progression of disease. Laser capture microdissection (LCM) is powerful to obtain a pure targeted cell subgroup, or even a single cell, quickly and precisely under the microscope, successfully tackling the problem of tissue heterogeneity in molecular analysis. This review presents an overview of LCM technology, the principles, advantages and limitations and its down-stream applications in the fields of proteomics, genomics and transcriptomics. With powerful technologies and appropriate applications, this technique provides unprecedented insights into cell biology from cells grown in their natural tissue habitat as opposed to those cultured in artificial petri dish conditions.

Global regulatory architecture of human, mouse and rat tissue transcriptomes.
Prasad A, Kumar SS, Dessimoz C, Bleuler S, Laule O, Hruz T, Gruissem W, Zimmermann P.
BMC Genomics. 2013 14: 716

BACKGROUND: Predicting molecular responses in human by extrapolating results from model organisms requires a precise understanding of the architecture and regulation of biological mechanisms across species.
RESULTS: Here, we present a large-scale comparative analysis of organ and tissue transcriptomes involving the three mammalian species human, mouse and rat. To this end, we created a unique, highly standardized compendium of tissue expression. Representative tissue specific datasets were aggregated from more than 33,900 Affymetrix expression microarrays. For each organism, we created two expression datasets covering over 55 distinct tissue types with curated data from two independent microarray platforms. Principal component analysis (PCA) revealed that the tissue-specific architecture of transcriptomes is highly conserved between human, mouse and rat. Moreover, tissues with related biological function clustered tightly together, even if the underlying data originated from different labs and experimental settings. Overall, the expression variance caused by tissue type was approximately 10 times higher than the variance caused by perturbations or diseases, except for a subset of cancers and chemicals. Pairs of gene orthologs exhibited higher expression correlation between mouse and rat than with human. Finally, we show evidence that tissue expression profiles, if combined with sequence similarity, can improve the correct assignment of functionally related homologs across species.
CONCLUSION: The results demonstrate that tissue-specific regulation is the main determinant of transcriptome composition and is highly conserved across mammalian species.

Data Visualisation and Software Tools:

GenEx offers advanced methods to analyze real-time qPCR data with simple clicks of the mouse

GenEx is a popular software for qPCR data processing and analysis. Built in a modular fashion GenEx provides a multitude of functionalities for the qPCR community, ranging from basic data editing and management to advanced cutting-edge data analysis.

Basic data editing and management
Arguably the most important part of qPCR experiments is to pre-process the raw data into shape for subsequent statistical analyses. The pre-processing steps need to be performed consistently in correct order and with confidence. GenEx standard’s streamlined and user-friendly interface ensures mistake-free data handling. Intuitive and powerful presentation tools allow professional illustrations of even the most complex experimental designs.

Advanced cutting-edge data analysis
When you need more advanced analyses GenEx 6 is the product for you. Powerful enough to demonstrate feasibility it often proves sufficient for most users demands. Current features include parametric and non-parametric statistical tests, Principal Component Analysis, and Artificial Neural Networks. New features are continuously added to GenEx with close attention to customers’ needs.

New features
Sample handling and samples individual biology often contribute to confounding experimental variability. By using the new nested ANOVA feature in GenEx a user will be able to evaluate variance contributions from each step in the experimental procedure. With a good knowledge of the variance contributions, an appropriate distribution of experimental replicates can be selected to minimize confounding variance and maximize the power of the experimental design! For experiments with complex features, such as for example multifactorial diseases, analytical relationships and classifications may not readily be available. The support vector machine feature in the new version of GenEx is so easy to use that it will make this advanced supervised classification method easily available to novice users, while providing access to advanced parameters for experts.

The methods are suitable to select and validate reference genes, classify samples, group genes, monitor time dependent processes and much more.
Please see the GenEx web page or Online Tutorials                Bookmark
                          and Share

3Omics -- a web-based systems biology tool for analysis, integration and visualization of human transcriptomic, proteomic and metabolomic data.
Kuo TC, Tian TF, Tseng YJ.
BMC Syst Biol. 2013 Jul 23;7:64

BACKGROUND: Integrative and comparative analyses of multiple transcriptomics, proteomics and metabolomics datasets require an intensive knowledge of tools and background concepts. Thus, it is challenging for users to perform such analyses, highlighting the need for a single tool for such purposes. The 3Omics one-click web tool was developed to visualize and rapidly integrate multiple human inter- or intra-transcriptomic, proteomic, and metabolomic data by combining five commonly used analyses: correlation networking, coexpression, phenotyping, pathway enrichment, and GO (Gene Ontology) enrichment.
RESULTS: 3Omics generates inter-omic correlation networks to visualize relationships in data with respect to time or experimental conditions for all transcripts, proteins and metabolites. If only two of three omics datasets are input, then 3Omics supplements the missing transcript, protein or metabolite information related to the input data by text-mining the PubMed database. 3Omics' coexpression analysis assists in revealing functions shared among different omics datasets. 3Omics' phenotype analysis integrates Online Mendelian Inheritance in Man with available transcript or protein data. Pathway enrichment analysis on metabolomics data by 3Omics reveals enriched pathways in the KEGG/HumanCyc database. 3Omics performs statistical Gene Ontology-based functional enrichment analyses to display significantly overrepresented GO terms in transcriptomic experiments. Although the principal application of 3Omics is the integration of multiple omics datasets, it is also capable of analyzing individual omics datasets. The information obtained from the analyses of 3Omics in Case Studies 1 and 2 are also in accordance with comprehensive findings in the literature.
CONCLUSIONS: 3Omics incorporates the advantages and functionality of existing software into a single platform, thereby simplifying data analysis and enabling the user to perform a one-click integrated analysis. Visualization and analysis results are downloadable for further user customization and analysis.
The 3Omics software can be freely accessed at

A multilevel gamma-clustering layout algorithm for visualization of biological networks.
Hruz T, Wyss M, Lucas C, Laule O, von Rohr P, Zimmermann P, Bleuler S.
Adv Bioinformatics. 2013: 920325

Visualization of large complex networks has become an indispensable part of systems biology, where organisms need to be considered as one complex system. The visualization of the corresponding network is challenging due to the size and density of edges. In many cases, the use of standard visualization algorithms can lead to high running times and poorly readable visualizations due to many edge crossings. We suggest an approach that analyzes the structure of the graph first and then generates a new graph which contains specific semantic symbols for regular substructures like dense clusters. We propose a multilevel gamma-clustering layout visualization algorithm (MLGA) which proceeds in three subsequent steps: (i) a multilevel γ -clustering is used to identify the structure of the underlying network, (ii) the network is transformed to a tree, and (iii) finally, the resulting tree which shows the network structure is drawn using a variation of a force-directed algorithm. The algorithm has a potential to visualize very large networks because it uses modern clustering heuristics which are optimized for large graphs. Moreover, most of the edges are removed from the visual representation which allows keeping the overview over complex graphs with dense subgraphs.

Gene expression inference with deep learning.
Chen Y, Li Y, Narayan R, Subramanian A, Xie X
Bioinformatics. 2016 Jun 15;32(12): 1832-1839

MOTIVATION: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes.
RESULTS: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.

On Efficient Feature Ranking Methods for High-Throughput Data Analysis.
Liao B, Jiang Y, Liang W, Peng L, Peng L, Hanyurwimfura D, Li Z, Chen M.
IEEE/ACM Trans Comput Biol Bioinform. 2015 12(6): 1374-1384

Efficient mining of high-throughput data has become one of the popular themes in the big data era. Existing biology-related feature ranking methods mainly focus on statistical and annotation information. In this study, two efficient feature ranking methods are presented. Multi-target regression and graph embedding are incorporated in an optimization framework, and feature ranking is achieved by introducing structured sparsity norm. Unlike existing methods, the presented methods have two advantages: (1) the feature subset simultaneously account for global margin information as well as locality manifold information. Consequently, both global and locality information are considered. (2) Features are selected by batch rather than individually in the algorithm framework. Thus, the interactions between features are considered and the optimal feature subset can be guaranteed. In addition, this study presents a theoretical justification. Empirical experiments demonstrate the effectiveness and efficiency of the two algorithms in comparison with some state-of-the-art feature ranking methods through a set of real-world gene expression data sets.

KnowEnG -- a knowledge engine for genomics.
Sinha S, Song J, Weinshilboum R, Jongeneel V, Han J
J Am Med Inform Assoc. 2015 Nov;22(6): 1115-1119

We describe here the vision, motivations, and research plans of the National Institutes of Health Center for Excellence in Big Data Computing at the University of Illinois, Urbana-Champaign. The Center is organized around the construction of "Knowledge Engine for Genomics" (KnowEnG), an E-science framework for genomics where biomedical scientists will have access to powerful methods of data mining, network mining, and machine learning to extract knowledge out of genomics data. The scientist will come to KnowEnG with their own data sets in the form of spreadsheets and ask KnowEnG to analyze those data sets in the light of a massive knowledge base of community data sets called the "Knowledge Network" that will be at the heart of the system. The Center is undertaking discovery projects aimed at testing the utility of KnowEnG for transforming big data to knowledge. These projects span a broad range of biological enquiry, from pharmacogenomics (in collaboration with Mayo Clinic) to transcriptomics of human behavior.

Data Mining Methods for Omics and Knowledge of Crude Medicinal Plants toward Big Data Biology.
Afendi FM, Ono N, Nakamura Y, Nakamura K, Darusman LK, Kibinge N, Morita AH, Tanaka K, Horai H, Altaf-Ul-Amin M, Kanaya S
Comput Struct Biotechnol J. 2013 4: e201301010 -- eCollection 2013

Molecular biological data has rapidly increased with the recent progress of the Omics fields, e.g., genomics, transcriptomics, proteomics and metabolomics that necessitates the development of databases and methods for efficient storage, retrieval, integration and analysis of massive data. The present study reviews the usage of KNApSAcK Family DB in metabolomics and related area, discusses several statistical methods for handling multivariate data and shows their application on Indonesian blended herbal medicines (Jamu) as a case study. Exploration using Biplot reveals many plants are rarely utilized while some plants are highly utilized toward specific efficacy. Furthermore, the ingredients of Jamu formulas are modeled using Partial Least Squares Discriminant Analysis (PLS-DA) in order to predict their efficacy. The plants used in each Jamu medicine served as the predictors, whereas the efficacy of each Jamu provided the responses. This model produces 71.6% correct classification in predicting efficacy. Permutation test then is used to determine plants that serve as main ingredients in Jamu formula by evaluating the significance of the PLS-DA coefficients. Next, in order to explain the role of plants that serve as main ingredients in Jamu medicines, information of pharmacological activity of the plants is added to the predictor block. Then N-PLS-DA model, multiway version of PLS-DA, is utilized to handle the three-dimensional array of the predictor block. The resulting N-PLS-DA model reveals that the effects of some pharmacological activities are specific for certain efficacy and the other activities are diverse toward many efficacies. Mathematical modeling introduced in the present study can be utilized in global analysis of big data targeting to reveal the underlying biology.

Visualization of -omics data for systems biology
Gehlenborg N, O'Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Kohlbacher O, Neuweger H, Schneider R, Tenenbaum D, Gavin AC.
Nat Methods. 2010 7(3): S56-68

High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data, and users now have many visualization methods and tools to choose from. The challenge is to create clear, meaningful and integrated visualizations that give biological insight, without being overwhelmed by the intrinsic complexity of the data. In this review, we discuss how visualization tools are being used to help interpret protein interaction, gene expression and metabolic profile data, and we highlight emerging new directions.

Images made with R

Bioinformatics Made Easy
Search bioinformatics tools and run genomic analysis in the cloud

We are excited to invite you to beta test of InsideDNA platform which provide:
  • over 600 most used bioinformatics tools including TopHat, Bowtie2, OrthoMCL, samtools, bamtools, BEAST, phyml, abyss, SOAPdeNovog
  • powerful compute nodes up to 208 Gb RAM and 32 core
  • unlimited number of compute nodes for each user
  • effortless way to launch any bioinformatics tool How it works?
Currently, our service is free and we are thrilled to provide 10 Gb of storage space and 10 compute credits to each new user. These 10 credits roughly equal to 260 hours of computational work on different compute nodes*. We hope that you will be pleasantly surprised by how much analysis you can do during these hours. In addition, if you fill our entry survey, we will give you an extra 10 compute credits. The survey aims to make InsideDNA application better and more user friendly.
How will it work in the future?
While we are trying to make this service as affordable as possible for researchers, compute nodes are provided to us by a third party and we can only keep current service free of charge for several months and for limited number of users. After that we will have to charge for computing with a price of $10 USD per 10 compute credits (~260 hours of work). We only deduct credits when you actually do the analysis - not when you are idle.
Bug, errors and problems
Despite we have been testing InsideDNA for several months internally, it is still likely to have bugs. Thus, we kindly ask you to report any issues or problems you may experience with InsideDNA. Please provide any feedback to this email:
Next releases and forthcoming features
Currently we are working on more exciting features including provisioning of a much bigger storage space for each user. Vote for different features in our application to get them done quicker or talk to us and suggest other features which you think may be useful!
Enjoy happy sequence crunching with InsideDNA!