Phenome-wide Mantis-ML

About

Overview

Background

Mantis-ML¹ is a framework for identifying novel gene-disease associations, leveraging multiple datasets within a semi-supervised machine learning environment. For a given disease, the method seeks to provide a probability of assocation to that disease for each gene in the exome. The work-flow is summarised in the following schematic:

Mantis-ML overview for a given disease

In particular the methodology relies on data pertaining to

Genes that are known to be associated to a given disease.
Gene features - these are signalling pathways, biological processes, tissues etc., known to be implicated in the disease of interest.

The problem can be cast within a binary classification framework, with each exome gene classified as either 'associated' or 'not associated' to a given (user-specified) disease. Semi-supervised learning is required as only 'positive' examples of genes known to be associated to a disease are provided. The remaining genes are effectively unlabelled, so the learning challenge is then to classify the unlabelled data as either 'associated' or 'not associated' to a given disease.

¹Vitsios, Dimitrios, and Slavé Petrovski. "Mantis-ml: Disease-Agnostic Gene Prioritization from High-Throughput Genomic Screens by Stochastic Semi-supervised Learning." The American Journal of Human Genetics 106.5 (2020): 659-678.

Phenome-wide Mantis-ML

Phenome-wide Mantis-ML systematically constructs these two inputs for a vast array of diseases taken from three resources - these are the Human Phenotype Ontology (HPO), Open Targets (OT) and Genomics England (GEL).

With these two inputs in place, Phenome-wide Mantis-ML then predicts the probability of associations on all other genes in the exome. More details regarding the semi-supervised learning are provided below (see also the original paper).

Data

Data sources

Diseases from three resources were examined. Each of these diseases can be queried in the main search bar.

Disease resources

Human Phenotype Ontology: 2575 diseases
Open Targets: 2500 diseases
Genomics England: 145 diseases

To collaborate on accessing/generating Mantis-ML results related to other phenotypes please get in touch.

In order to predict whether a gene is associated with a given disease, Mantis-ML leverages data from multiple sources. Gene measurements fall into two categories - generic or disease specific. Generic measurements are thoses that are indpendent of the disease under consideration - for example genetic intolerance scores or measures of genetic variation. Some resources are used as a source of generic features and disease/tissue specific features.

Generic resources

ExAC: Exome Aggregation Consortium (ExAC) summarises genetic variation in approximately 60k samples.
Essential mouse genes: Capture genetic variation with the objective of identifying essential genes.
Genic-intolerance scores: Two types of genic-intolerance scores are integrated: Residual Variation Intolerance Score (RVIS) and Missense Tolerance Score (MTR).
GnomAD: The Genome Aggregation Database (gnomAD) summarises genetic variation across the whole genome.
InWeb_IM: Human protein-protein interaction network data. Protein-protein interactions are characterised as ‘inferred’ or ‘experimental’ based on the validation degree recorded for each interaction in the original analysis.
GWAS: Genome-wide Association (GWAS) data captures genetic associations of a gene across any disease.
MGI: Mouse Genome Informatics (MGI) summarises knockout effects in mice across any phenotype.

Disease/tissue specific resources

GTEx: Quantifies expression leveles in different tissues.
GWAS: Genome-wide Association (GWAS) data captures genetic associations for a given disease.
MGI: Mouse Genome Informatics (MGI) summarises disease-specific knockout effects in mice.
MSigDB: Molecular Signatures Database (MSigDB)

The reference gene set was given as all genes annotated by the ExAC and gnomAD (v2) consortia (n=18,626), such that all genes have sufficient evidence and annotation.

Methods

Semi-supervised learning

The objective of the learning exercise is to classify each gene in the exome as either 'associated' or 'not associated' to a given gene (i.e. a binary classification problem). As is the case here, semi-supervised learning provides a means to work with only partially labelled data. Given a user-specified disease, then of the 18,626 exome genes, positive labels are given to those that are known to be associated to that disease. The remaining genes are treated as unlabelled, so the objective of the learning exercise is to make predictions on these unlabelled data.

Identify positives: Assign positive labels to genes known to be associated to the given disease.
Repeat for L iterations: Perform the following on L random iterations.

Randomly partition all unlabelled data points into M subsets
Iterate on M balanced datasets:

Construct negative class: Assign all unlabelled data-points in the partition to the negative class.
Subset positive class: Randomly subsample 80% of all positive labelled data points.
Perform k-fold cross validation: Make predictions on each element of the balanced dataset through k-fold cross validation and retain for each gene the out-of-fold predictions only.

Construct the Mantis-ML score: The final score is then the average over all iterations in which the gene was classified out-of-fold. At most, this can be L x M x k.

The classifier used throughout was scikit-learn's Random Forest.

Constructing input data

Phenome-wide Mantis-ML provides a principled and scalable means to apply Mantis-ML to a vast array of diseases in the phenome. A prerequisite for this is automating the construction of the input data which has previously required laborious and lengthy curation. In particular, Phenome-wide Mantis-ML successfully exploits natural language processing (NLP) methods to provide this level of automation. Details around this automation procedure are provided in the paper.

Network visualisations and clustering

Based on the results of scaling Mantis-ML phenome-wide, we construct both gene and disease networks. Network construction relies on projecting data-points from a high-dimensional space down to the 2D plane. This can be achieved in the simplest case by principal component analysis, though this may miss some of the complex structure present in the original data. As a result, we instead opt for t-distributed stochastic neighbor embeddings (t-SNE) to perform this projection. We provide two types of disease networks - the first uses association scores for each disease as the starting data, the second uses the feature importance obtained during the learning phase of Mantis-ML.

Gene networks: For gene networks, each gene is assigned a signature comprising different association scores with the gene and each disease in the phenome. As such, genes that are close in the network represent similar phenotypic signatures.
Disease network (Mantis-ML scores): For this disease network, each disease is assigned a genetic signature comprising association scores between all genes and the given disease. Diseases that are close in this network therefore suggest the diseases share a similar set of assocations.
Disease network (Feature importance): For this disease network, each disease is characterised by the feature importance observed during the learning phase of Mantis-ML. Noting that underlying Mantis-ML is a classifier predicting whether a gene is or isn't associated to a disease, a secondary output of this is learning also which features are useful for predicting whether a gene is associated or not. This occurs after selecting a set of features to use based on semantic similarity.

It is also possible to identify gene clusters based on these signatures. Exploiting again the phenotypic signatures of each gene, k-means clustering is performed for a variety of values of k (the number of clusters). The results of this exercise are shown in the Gene Network page.

Results

Disease results

Framed as a semi-supervised classification problem it is possible to evaluate the classification area-under-curve (AUC) for each disease in each resource. We summaries these results in the accompanying boxplot. Furthermore, querying the resoure for a specific disease returns a selection of results relating to Mantis-ML scores and subsequent analyses based on the predicted gene-disease assocations. Each result on the disease specific page is described as follows.

Mantis-ML scores: the predicted probability score for a given disease (range: 0 to 1). We also provide the normalised score (Z-score) that shows how a gene association with a given disease compares against all other diseases. Higher (and positive) Z-score values reflect stronger gene-disease associations that are not present in the rest of diseases. For example, a Z-score of 1.5 means that a gene-disease association is 1.5 standard deviations higher than the average mantis-ml score of that gene across all diseases.

Gene Ontology (GO) enrichment: GO enrichment takes the top 5% most associated genes to the given disease and performs a GO enrichment for biological processes. The GO enrichment provides a statistical measure of overlap between this gene set and an alternative gene set (e.g. those annotated for a certain biological process), accounting for overlap occurring by chance alone. The result of this is to provide a p-value quantifying the significance of the overlap between the two gene sets, which is then converted into a PHRED score (-10log₁₀(p)) with larger values indicating greater significance.

Enrichment with PheWAS: For the purposes of validating the Mantis-ML scores, a comparison is provided between these associations and those given by a PheWAS collapsing analysis based on UK Biobank data. The collapsing analysis introduces gene-disease associations for a collection of different collapsing models or 'qualifying variants' (QV). In addition they are available only for diseases in the ICD10 ontology, which in general does not have a strict correspondence between diseases in each resource.

As a result, for the user-specified disease, it is necessary to isolate the n most similar diseases in the ICD10 ontology before comparing associations. For each of these n similar diseases, lists of associated genes are constructed based on the PheWAS collapsing analysis. This is repeated for each QV model in the collapsing analysis.

Enrichment between Mantis-ML scores and PheWAS results

Mantis-ML gene-disease associations are then compared to each of the n lists corresponding to the similar diseases. This is repeated for each collapsing model, resulting in n x m enrichment tests where m is the number of collapsing models. To compare betweeen Mantis-ML's association scores and any given list a step-wise enrichment analysis is performed, where the N, N+1, N+2... most associated genes according to Mantis-ML are compared to one of the n x m lists constructed above. The curves in the disease page show the significance of this overlap in terms of the PHRED score (-10log₁₀(p)) as a function of N for a given ICD10 disease, where the dropdown enables the user to choose between the most similar ICD10 diseases.

For a given ICD10 disease the best performing QV model is plotted alongside two negative controls. The first negative control is the same ICD10 disease but using the synonymous collapsing model as a biological negative control. The second negative control is the best performing QV model but for the most different ICD10 disease.

Genetic relevance box-plots: In addition to the enrichment curves that summarise the overlap between Mantis-ML associations and PheWAS associations for a particular ICD10 disease, it is possible to estimate the 'genetic relevance' of Mantis-ML's association scores, using PheWAS associations as the gold standard. This is performed through comparing a group of ICD10 codes that are similar to the input disease within a group that is randomly sampled from all ICD10 codes. More specifically, taking enrichment curves generated above for the n most similar ICD10 diseases and the n least similar ICD10 diseases, after summarising each curve with a number (e.g. AUC) and aggregating over these 'n' similar ICD10 codes, we are able to estimate the rank of where this lies out of codes chosen at random.

Comparison between most and least similar ICD10 codes

Aggregating AUCs of enrichment curves between mantis-ml genes and those significant under PheWAS results in a single 'true' statistic. We therefore establish the rank of this true statistic within a collection of null statistics, generating through randomly sampling ICD10 codes and repeating this aggregation process. The higher the rank in these codes, the stronger the overlap between mantis-ml and semantically similar ICD10 codes.

Low ranks are not necessarily suggestive of low genetic support for the derived associations. Instead, they could merely reflect that there are not many semantically similar ICD10 codes that share a strong overlap with mantis-ml. After aggregating over multiple ICD10 codes, any substantial overlap would therefore be diluted. Equally, calculation of the null statistics necessarily relies on random sampling which can itself introduce a degree of noise into the rank estimation. Finally, while the PheWAS rare-variant collapsing analysis has provided demonstrably robust associations in previous works, there is inevitably a degree to which the available sample size may be unable to capture the underlying genetic causality.

Feature importance: The feature importance provides a measure of which of the input features contributed most to the predictions.

Gene results

Querying a specific gene returns a set of association scores for that gene and each disease in the phenome (categorised by resource). Scores provide a measure of association between each disease in the phenome and the queried gene. Both Mantis-ML and normalised scores are provided.

Mantis-ml scores: the predicted probability score for a given disease (range: 0 to 1).
Mantis-ml normalised scores: normalised score (Z-score) that shows how a gene association with a given disease compares against all other diseases. Higher (and positive) Z-score values reflect stronger gene-disease associations that are not present in the rest of diseases. For example, a Z-score of 1.5 means that a gene-disease association is 1.5 standard deviations higher than the average mantis-ml score of that gene across all diseases.