disgenet2r: An R package to explore the molecular underpinnings of human diseases

IBI group

2020-06-15


Introduction

The disgenet2r package contains a set of functions to retrieve, visualize and expand DisGeNET data. DisGeNET is a discovery platform that contains information about the genetic basis of human diseases (Piñero et al. 2015, 2017, 2019). DisGeNET integrates data from several expert curated databases and from text-mining the biomedical literature.

The current version of DisGeNET (v7.0) contains 1134942 gene-disease associations (GDAs), between 21671 genes and 30170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369554 variant-disease associations (VDAs), between 194515 variants and 14155 diseases, traits, and phenotypes.

The information in DisGeNET is organized according to the original data source (Table 1). Diseases are identified using the UMLS concept unique identifier (CUI), but mappings to commonly employed biomedical vocabularies such as MeSH, OMIM, DO, HPO, and ICD-9 are also provided. The genes are identified using the NCBI Entrez Identifier, but annotations to the official gene symbol, the UniProt identifier, and the Panther Protein Class are also supplied. Finally, the GDAs and VDAs can be ranked using the DisGeNET score. The DisGeNET score ranges from 0 to 1, and takes into account the evidence supporting the association (See more information at http://disgenet.org/dbinfo/)

DisGeNET data is also represented as Resource Description Framework (RDF), which provides new opportunities for data integration, making possible to link DisGeNET data to other external RDF datasets (Queralt-Rosinach et al. 2016).

Table: Sources of DisGeNET data

Source_Name Type_of_data Description
CTD_human GDAs The Comparative Toxicogenomics Database, human data
CGI GDAs The Cancer Genome Interpreter
CLINGEN GDAs The Clinical Genome Resource
GENOMICS_ENGLAND GDAs The Genomics England PanelApp
ORPHANET GDAs The portal for rare diseases and orphan drugs
PSYGENET GDAs Psychiatric disorders Gene association NETwork
HPO GDAs Human Phenotype Ontology
UNIPROT GDAs/VDAs The Universal Protein Resource
CLINVAR GDAs/VDAs ClinVar, public archive of relationships among sequence variation and human phenotype
GWASCAT GDAs/VDAs The NHGRI-EBI GWAS Catalog
GWASDB GDAs/VDAs The GWas Database
CTD_mouse GDAs The Comparative Toxicogenomics Database, Mus musculus data
MGD GDAs The Mouse Genome Database
CTD_rat GDAs The Comparative Toxicogenomics Database, Rattus norvergicus data
RGD GDAs The Rat Genome Database
BEFREE GDAs/VDAs Data from text mining medline abstracts using the BeFree System (Bravo et al. 2015)
LHGDN GDAs Literature-derived human gene-disease network generated by text mining NCBI GeneRIFs (Bundschus et al. 2008)
CURATED GDAs/VDAs Human curated sources: CTD, ClinGen, CGI, UniProt, Orphanet, PsyGeNET, Genomics England PanelApp
INFERRED GDAs Inferred data from: HPO,ClinVar, GWASCat, GwasDB
ANIMAL_MODELS GDAs Data from animal models: CTD_rat, RGD, CTD_mouse, MGD
ALL GDAs/VDAs All data sources

Contact

For questions regarding disgenet2r, contact our support account at .

Installation and first run

The package disgenet2r is available through Bitbucket. The package requires an R version > 3.5. Additionally, the following packages are needed: VennDiagram, stringr, tidyr, SPARQL, RCurl, igraph, ggplot2, and reshape2.

Install disgenet2r by typing in R:

To load the package:

In the following document, we illustrate how to use the disgenet2r package through a series of examples.

Retrieving Gene-Disease Associations from DisGeNET

Searching by gene

The gene2disease function retrieves the GDAs in DisGeNET for a given gene, or a for a list of genes. The gene(s) can be identified by either the NCBI gene identifier, or the official Gene Symbol, and the type of identifier used must be specified using the parameter vocabulary. By default, vocabulary = “HGNC”, to switch to Entrez Gene identifiers, set vocabulary to ENTREZ.

The function also requires the user to specify the source database using the argument database. By default, all the functions in the disgenet2r package use as source database CURATED, which includes GDAs from CTD (human data), PsyGeNET, the HPO, Genomics England PanelApp, ClinGen, CGI, UniProt, and Orphanet.

The information can be filtered using the DisGeNET score. The argument score is filled with a range of score to perform the search. The score is entered as a vector which first position is the initial value of score, and the second argument is the final value of score. Both values will always be included. By default, score=c(0,1).

In the example, the query for the Leptin Receptor (Gene Symbol LEPR, and Entrez Identifier 3953) is performed in all databases in DisGeNET (database = "ALL").

The function gene2disease produces an object DataGeNET.DGN that contains the results of the query.

## [1] "DataGeNET.DGN"
## attr(,"package")
## [1] "disgenet2r"

Type the name of the object to display its attributes: the input parameters such as whether a single entity, or a list were searched (single or list), the type of entity (gene-disease), the selected database (ALL), the score range used in the search (0-1), and the gene ncbi identifier (3953).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        gene-disease 
##  . Database:     ALL 
##  . Score:        0-1 
##  . Term:        3953 
##  . Results:  416

To obtain the data frame with the results of the query, apply the extract function:

##   disease_class score uniprotid                             disease_name gene_dpi diseaseid year_final
## 1                0.72    P48357               LEPTIN RECEPTOR DEFICIENCY    0.808  C3554225       2019
## 2       C23;C18  0.70    P48357                                  Obesity    0.808  C0028754       2019
## 3       C18;C19  0.60    P48357 Diabetes Mellitus, Non-Insulin-Dependent    0.808  C0011860       2019
##                                                                      disease_class_name geneid gene_symbol protein_class
## 1                                                                                         3953        LEPR  DTO_05007599
## 2    Pathological Conditions, Signs and Symptoms;    Nutritional and Metabolic Diseases   3953        LEPR  DTO_05007599
## 3                      Nutritional and Metabolic Diseases;    Endocrine System Diseases   3953        LEPR  DTO_05007599
##      ei     el source gene_dsi disease_type disease_semantic_type year_initial protein_class_name gene_pli
## 1 1.000 strong    ALL    0.433      disease   Disease or Syndrome         1998          Signaling  0.99475
## 2 0.930           ALL    0.433      disease   Disease or Syndrome         1966          Signaling  0.99475
## 3 0.975           ALL    0.433      disease   Disease or Syndrome         1966          Signaling  0.99475

The same query can be performed using the Gene Symbol (LEPR). Additionally, a minimun threshold for the score can be defined. In the example, a cutoff of score=c(0.2,1) is imposed. Notice how the number of diseases associated to the Leptin Receptor drops from 264 to 68 when the score is restricted.

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        gene-disease 
##  . Database:     ALL 
##  . Score:        0.3-1 
##  . Term:        LEPR 
##  . Results:  79

Visualizing the diseases associated to a single gene

The disgenet2r package offers two options to visualize the results of querying DisGeNET for a single gene: a network showing the diseases associated to the gene of interest (Gene-Disease Network), and a network showing the MeSH Disease Classes of the diseases associated to the gene (Gene-Disease Class Network). These graphics can be obtained by changing the class argument in the plot function.

By default, the plot function produces a Gene-Disease Network on a DataGeNET.DGN object (Figure 1). In the Gene-Disease Network the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association. The prop parameter allows to adjust the width of the edges while keeping the proportionality to the score.

Figure 1: The **Gene-Disease Network** for the Leptin Receptor gene

Figure 1: The Gene-Disease Network for the Leptin Receptor gene

The results can also be visualized in a network in which diseases are grouped by the MeSH Disease Class if the class argument is set to “DiseaseClass” (Gene-Disease Class Network, Figure 2). In the Gene-Disease Class Network, the node size of is proportional to the fraction of diseases in the disease class, with respect to the total number of diseases with disease classes associated to the gene. In the example, the Lepin Receptor is associated mainly to Nutritional and Metabolic Diseases. There is 1 disease in the example that does not have annotations to MeSH disease class (Shown as a warning).

## [1] "warning: 1 disease(s) not shown in the plot"
Figure 2: The **Disease Class Network** for the Leptin Receptor Gene

Figure 2: The Disease Class Network for the Leptin Receptor Gene

Searching multiple genes

The gene2disease function can also receive a list of genes as input, either as Entrez Gene Identifiers or Gene Symbols. In the example, we show how to create a vector with the Gene Symbols of several genes belonging to the family of voltage-gated potassium channels (Table 2) and then, we apply the function gene2disease.

Table 2: Example of voltage-gated potassium channel family members

Name Description
KCNE1 potassium channel, voltage gated subfamily E regulatory beta subunit 1
KCNE2 potassium channel, voltage gated subfamily E regulatory beta subunit 2
KCNH1 potassium channel, voltage gated eag related subfamily H, member 1
KCNH2 potassium channel, voltage gated eag related subfamily H, member 2
KCNG1 potassium voltage-gated channel modifier subfamily G member 1

Creating the vector with the list of genes belonging to the voltage-gated potassium channel family.

The gene2disease function also requires the user to specify the source database using the argument database, and optionally, the DisGeNET score can also be applied to filter the results.

## Warning in gene2disease(gene = myListOfGenes, score = c(0.2, 1), verbose = TRUE): 
##  One or more of the genes in the list is not in DisGeNET ( 'CURATED' ):
##    - KCNG1
## Object of class 'DataGeNET.DGN'
##  . Search:      list 
##  . Type:        gene-disease 
##  . Database:     CURATED 
##  . Score:        0.2-1 
##  . Term:       KCNE1 ... KCNH2 
##  . Results:  51

Visualizing the diseases associated to multiple genes

By default, plotting a DataGeNET.DGN resulting of the query with a list of genes produces a Gene-Disease Network where the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association (Figure 3).

Figure 3: The **Gene-Disease Network** for a list of genes belonging to the voltage-gated potassium channel family

Figure 3: The Gene-Disease Network for a list of genes belonging to the voltage-gated potassium channel family

Setting the argument class to “Heatmap” produces a Gene-Disease Heatmap (Figure 4), where the scale of colors is proportional to the score of the GDA. The argument limit can be used to limit the numer of rows to the top scoring GDAs. By default, the plot shows the 50 highest scoring GDAs.

Figure 4: The **Gene-Disease Heatmap** for a list of genes belonging to the voltage-gated potassium channel family

Figure 4: The Gene-Disease Heatmap for a list of genes belonging to the voltage-gated potassium channel family

These results can also be visualized as a Gene-Disease Class Heatmap by setting the argument class to “DiseaseClass” (Figure 5). In this case, diseases are grouped by the their MeSH disease classes, and the colour scale is proportional to the percentage of diseases in each MeSH disease class. In the example, genes are associated mainly to Cardiovascular Diseases, and to Congenital, Hereditary, and Neonatal Diseases and Abnormalities.

## [1] "warning: 3 disease(s) not shown in the plot"
Figure 5: The **Gene-Disease Class Heatmap** for a list of genes belonging to the voltage-gated potassium channel family

Figure 5: The Gene-Disease Class Heatmap for a list of genes belonging to the voltage-gated potassium channel family

Searching by disease

The disease2gene function allows to retrieve the genes associated to a disease, or a list of diseases. The function uses as input the disease, or list of diseases of interest (as UMLS CUI, MeSH, OMIM, Disease Ontology, ICD9CM, NCIt, EFO, or Orphanet Identifiers), the disease vocabulary employed (OMIM (OMIM), MESH (MeSH),ICD9CM (ICD9-CM), DO (Disease Ontology), NCI (NCI thesaurus), ORDO (Orphanet), or EFO (EFO) and the database (by default, CURATED). A threshold value for the score can be set, like in the gene2disease function.

In the example, we will use the disease2gene function to retrieve the genes associated to the UMLS CUI C0036341. This function also receives as input the database, in the example, CURATED, and a score range, in the example, from 0.4 to 1.

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        C0036341 
##  . Results:  227

The same results are obtained when querying DisGeNET with the MeSH identifier for Schizophrenia (D012559).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        D012559 
##  . Results:  227

The same results are obtained when querying DisGeNET with the OMIM identifier for Schizophrenia (181500).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        181500 
##  . Results:  227

The same results are obtained when querying DisGeNET with the ICD9-CM identifier for Schizophrenia (295).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        295 
##  . Results:  227

The same results are obtained when querying DisGeNET with the NCI identifier for Schizophrenia (C3362).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        C3362 
##  . Results:  227

The same results are obtained when querying DisGeNET with the DO identifier for Schizophrenia (5419).

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        disease-gene 
##  . Database:     CURATED 
##  . Score:        0.4-1 
##  . Term:        HP:0100753 
##  . Results:  227

Visualizing the genes associated to a single disease

There are two options to visualize the results from searching a single disease: a Gene-Disease Network showing the genes related to the disease of interest (Figure 6), and a Disease-Protein Class Network with the genes grouped by Panther Protein Class (Figure 7).

Figure 6 shows the default Gene-Disease Network for Schizophrenia. As in the case of the gene2disease function, the blue nodes is the disease, the pink nodes are genes, and the width of the edges is proportional to the score of the association.

Figure 6: The **Gene-Disease Network** for genes associated to Schizophrenia

Figure 6: The Gene-Disease Network for genes associated to Schizophrenia

Alternatively, in the Disease-Protein Class Network, genes are grouped by the Panther Protein Class (Figure 7). This is a better choice when there is a large number of genes associated to the disease. This plot uses as class argument “ProteinClass”. The resulting network will show in blue the disease, and in green the Protein Classes of the genes associated to the disease. The node size is proportional to the number of genes in the Panther Protein Class. In the example, the largest proportion of the genes associated to Schizophrenia are receptors. Notice again that not all genes have annotations to Panther Protein classes (69 genes in Figure 7)

## [1] "warning: 89 gene(s) not shown in the plot"
Figure 7: The **Protein Class-Disease Network** for genes associated to Schizophrenia

Figure 7: The Protein Class-Disease Network for genes associated to Schizophrenia

Searching multiple diseases

The disease2gene function also accepts as input a list of diseases (as UMLS CUI, MeSH, OMIM, Disease Ontology, Orphanet, Dechipher, or ICD9CM Identifiers), the database (by default, CURATED), and optionally, a value range for the score. In the example, we have selected a list of 10 diseases. Table 3 shows the UMLS CUIs and the corresponding disease names.

Table 3: Disease list selected for illustrating the disease2gene multiple search

UMLS_CUI Disease_Name
C0036341 Schizophrenia
C0036341 Alzheimer’s Disease
C0030567 Parkinson Disease
C0005586 Bipolar Disorder

Creating the vector with the list of diseases.

In the example, we will search in CURATED data, using a score range of 0.4-1.

Visualizing the genes associated to multiple diseases

The default plot of the results of querying DisGeNET with a list of diseases produces a Gene-Disease Network where the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association (Figure 8).