(aside image)

Research Overview

We develop new methods for?computational inference, machine learning, and probabilistic modeling. Our current research focus is on learning from multiple data sources, in particular

and their combination. Furthermore, we devise novel approaches to

The methods we develop are generic and meant to be usable in various application fields. We currently specialize in the following areas (for more information, please follow the links):



Introduction to Learning from Multiple Data Sources

We develop statistical machine learning methods for extracting useful regularities from large, high-dimensional data sets. In many practical data analysis tasks, the most challenging problem is the small sample size combined with the high dimensionality of data points. The idea of multi-source machine learning is to exploit additional data sets (e.g., earlier measurement experiments in molecular biology or neuroscience) even if they are only partially relevant for the data set of interest. Our methods extend and generalize the paradigms of multi-view, multi-way and multi-task learning. Moreover, we develop new principles and methods for data visualization and retrieval to facilitate human-computer interaction in the knowledge discovery process.

Multi-View Learning

Multi-view learning analyzes how several data sources (views) describing the same data objects can be combined to extract more relevant information. We focus on unsupervised settings, where the relevance comes from statistical dependencies between the different views. For example, a collection of images with captions can be represented with two views, one describing the contents of the image and the other one describing the caption. Dependencies between these representations reveal more information on the intended semantics of the images than either view alone. We have developed new theory for decomposing variation in multiple views into source-specific and shared components, building on Bayesian latent-variable models. The dependencies between the views are captured by assigning flexible source-specific models for describing the structured "noise" in each view. Moreover, as an extension of traditional multi-view learning where a known one-to-one sample pairing between the views is assumed, we have developed solutions for learning the matching itself.

Related keywords: canonical correlation analysis, data integration, data fusion, co-occuring data

Representative Publications

Multi-Task Learning

We have recently introduced a learning problem called relevant subtask learning, a variant of multi-task learning, which aims to solve the small-data problem by intelligently making use of other, potentially related background data sets. In contrast to typical multi-task learning, our problem formulation is fundamentally asymmetric: test data is known to fit one task, the task of interest, and other tasks may contain subtasks relevant for the task of interest. No other task needs to be wholly relevant, and it is not known which parts of it are relevant. We introduce probabilistic modeling approaches to exploit partially relevant background data sets to improve on the task of interest. Focused multi-task approaches are useful in many biological applications, for instance in choosing control samples in differential gene expression experiments.

Representative Publications

Multi-Way Learning

Finding effects of one or multiple known covariates from the data is one of the most common statistical problems, commonly solved by Analysis of Variance (ANOVA), its multivariate generalization (MANOVA), or in general by linear models. The traditional ANOVA-type methods are not applicable to molecular biology data where the dimensionality of the problem is very large and the number of observations is (relatively) small. We have recently introduced a Bayesian method for solving this problem of multi-way analysis of small sample-size, high-dimensional data sets. Moreover, the multi-way data analysis problem becomes even more complicated when heterogeneous data with multiple covariates are integrated from multiple sources. Different data sources usually have distinct, unmatched variable spaces with different dimensionalities. We have generalized ANOVA-type analysis to the case of multiple data sources by considering the source (view) as an additional covariate and utilizing dependencies between the sources. We introduced a model which is able to find the multi-way covariate effects and to partion them into shared and source-specific effects.

Representative Publications

Information Visualization

Visualization of mutual similarities of entities in high-dimensional data sets is a central problem in exploratory data analysis and knowledge discovery. It is generally not possible to show all the similarity relationships within a high-dimensional data set perfectly on a low-dimensional display; some properties become necessarily lost or misrepresented. To explicitly approach this problem, we formulate visualization as a visual information retrieval task and quantify the necessary trade-off in terms of standard information retrieval measures, precision and recall. The method has been extended to network visualization, supervised visualization, linear visualization, visualization given an annotation ontology, and to a generative modeling formulation.

Representative Publications

  • Jaakko Peltonen and Samuel Kaski. Generative modeling for maximizing precision and recall in information visualization. In Geoffrey Gordon, David Dunson, and Miroslav Dudik, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of JMLR W&CP, pages 597–587. JMLR, 2011. (abstract, pdf)
  • Samuel Kaski and Jaakko Peltonen. Dimensionality reduction for data visualization. IEEE Signal Processing Magazine, 28(2):100--104, 2011. (DOI)
  • Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization. Journal of Machine Learning Research, 11:451-490, 2010. (abstract, preprint pdf, final pdf at JMLR)
  • Jarkko Venna and Samuel Kaski. Nonlinear Dimensionality Reduction as Information Retrieval. In Marina Meila and Xiaotong Shen, editors, Proceedings of AISTATS 2007, the 11th International Conference on Artificial Intelligence and Statistics. Omnipress, 2007. JMLR Workshop and Conference Proceedings, Volume 2: AISTATS 2007. (abstract, pdf)

Retrieval of Relevant Data

Large repositories of genome-wide measurement data pose the research question of how to systematically relate different data sets. Re-usage of data sets increases the statistical power of novel studies and opens up the possibility to put biological results in the context of previous studies. To complement keyword search functionalities provided by most repositories for retrieval of similarly annotated studies, we developed machine learning methods that relate studies through their actual measurement data, along with visualization tools that allow exploring and interpreting the results. In the REx project (Retrieval of Relevant Experiments), relevance is defined by a model of biology that is both data- and knowledge-driven.

Representative Publications

  • José Caldas, Nils Gehlenborg, Eeva Kettunen, Ali Faisal, Mikko Rönty, Andrew G Nicholson, Sakari Knuutila, Alvis Brazma, and Samuel Kaski. Data-Driven Information Retrieval in Heterogeneous Collections of Transcriptomics Data Links SIM2s to Malignant Pleural Mesothelioma.. Bioinformatics, 28(2): 246-253, 2012. (html). See also: Supplementary website and Software.
  • José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12): i145-i153, 2009. (html). See also: Software, Poster (best poster award at the 5th ISCB Student Council Symposium).

Applications

We do research in the following application areas (for more information, please follow the links):

Publication list of the research group