The DisGeNET-RDF Linked Dataset is an alternative way to access the DisGeNET data and provides new opportunities for data integration, querying and integrating DisGeNET data to other external RDF datasets.
The DisGeNET RDF API has been selected among the 10 ELIXIR Recommended Interoperability Resources ELIXIR announced its first portfolio of Recommended Interoperability Resources (RIRs) to facilitate interoperability and reusability of life science data and support the principles of FAIR data management on December 2018. The full list of ELIXIR Recommended Interoperability Resources is available here.
The RDF version of DisGeNET has been developed in the context of the Open PHACTS project to provide disease relevant information to the knowledge base on pharmacological data. DisGeNET-RDF has been integrated in the Open PHACTS Discovery Platform among other resources such as ChEMBL, WikiPathways and neXtProt. Aimed at exploring and querying DisGeNET data across the linked data in the platform, APIs are currently available in the Open PHACTS API v1.5 (see the OPS API Web site for up to date information).
To perform faceted and precise searches the DisGeNET-RDF linked data is accessible via a Faceted browser.
In addition, DisGeNET-RDF linked data can be accessed for question-answering via a SPARQL endpoint . An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface here, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI. Furthermore, some DisGeNET queries are available at Bioqueries. See the SPARQL Endpoint Example Queries section for more details and query examples.
The RDF Linked Dataset is accompanied with a full dataset description, which is compliant with the W3C HCLS specification. For more information on the dataset description of the RDF Dataset go to Metadata Description section.
To download the dump files please, go to the Data Downloads section.
The RDF distribution of DisGeNET includes new annotation and new linksets:
- All linksets updated, and all ontologies updated.
- Changed the data model for the variant attributes
Linked Dataset Description
There are four main components in the RDF dataset: GDA content, VDA content, metadata description of the RDF dataset (VoID description), and linkouts to other Linked Datasets. The current RDF representation of DisGeNET (v6.0.0) has 62,359,775 triples serialized in Turtle syntax. The triples annotate 628,685 gene-disease associations (GDAs), 17,549 genes, and 24,166 diseases, disorders, traits, and clinical or abnormal human phenotypes and 210,498 variant-disease associations (VDAs), between 117,337 variants and 10,358 diseases, traits, and phenotypes. The RDF graph model is centered around two concepts the GDA concept and the VDA concept and their attributes. The genes are identified by the NCBI Entrez identifiers and the diseases are identified by the UMLS CUI and the variants are identified by the dbSNP id. Entities and properties are semantically defined using standard ontologies such as the National Cancer Institute thesaurus (NCIt), and resources identified by using de-referenceable IRIs. GDAs are integrated using the DisGeNET Association Type Ontology and they are semantically harmonized using SIO classes (see the DisGeNET ontology section below).
A full dataset description of the RDF Linked Dataset is provided using among others the Vocabulary of Interlinked Datasets (VoID), an RDF Schema W3C recommended vocabulary for expressing metadata about RDF datasets. This dataset description, which is compliant with the W3C HCLS specification and the Open PHACTS specification, includes the provenance of the DisGeNET relational database, the primary databases, and the BeFree text mining tool (see the DisGeNET VoID file description). The type of curation and level of evidence of each original database are also tracked and annotated. Each data instance in DisGeNET is explicitly referenced to this dataset description in order to granulate and trace back the provenance to the instance level.
In addition, linkouts to the LOD are set in order to both enrich DisGeNET GDAs annotations with external Semantic Web resources, and to extend the current GDAs content of the Web of knowledge. Specifically, a total number of 4,962,315 linksets to the LOD through Bio2RDF, linked life data network projects among others exists in the current version. All entities linked are related using the same SKOS predicate skos:exactMatch. Other linkset statistics between entities can be found at the DisGeNET DataHub site in the DataHub registry. Consequently, DisGeNET appears in the last update of the LOD cloud diagram (2014-08-30 update). This diagram shows datasets published in Linked Data format and it is built based on their metadata description on the DataHub as well as on metadata extracted from a crawl of the Linked Data Web.
The data model of the RDF representation of DisGeNET is shown below. Click on the picture to zoom in.
In this new release, GDAs are now identified by "303 URIs" following the W3C recommendation to build URIs for the Semantic Web. Each GDA is defined by a unique combination of a gene (NCBI GeneID), a disease (UMLS CUI), an association type defined by our ontology (see section below), a data source of provenance, and a PubMed article (PMID) giving evidence to the gene-disease association. A unique identifier based on Universally Unique Identifiers (UUID) generated by a cryptographic hash function, is established for each GDA. The DisGeNET GDA ID is composed by: 'DGN' + UUID, e.g. DGN7ab3d8cae0c9f1150cb65a985aa8c0a1. The new namespace is 'http://rdf.disgenet.org/resource/gda/'. The new GDA IRI pattern is: namespace + DisGeNET ID,
For an example of triples related to a single gene-disease association in DisGeNET, see here.
The DisGeNET Association Type Ontology
The DisGeNET Association Type Ontology was developed in our group to fill the gap in formal semantics for the definition of types of associations described between a gene and a disease in biological databases. This ontology was generated using all terms provided by the GDAs original databases. It is an OWL ontology that can be accessed at GeneDiseaseAssociation.owl. The DisGeNET ontology is integrated into the Sematicscience Integrated Ontology (SIO), which is an OWL ontology that provides essential types and relations for the rich description of objects, processes and their attributes [PDF]. You can check SIO gene-disease association classes from this URL or download the entire SIO OWL-DL ontology file . The SIO ontology can be also accessed at the NCBO Bioportal. DisGeNET GDAs in RDF are semantically harmonized using SIO classes.
Access to the RDF Linked Dataset
The DisGeNET-RDF data dump and the VoID description file are accessible to download.
DisGeNET-RDF linked data can be navigated via a Faceted browser.
DisGeNET-RDF data are accessible using the query language SPARQL via our public SPARQL endpoint. The dataset is stored in a Virtuoso's QUAD Store in which the name of the graph is 'http://rdf.disgenet.org'. It is powered by Virtuoso open-source v7.1.0.
An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface at the DisGeNET LODEStar Endpoint, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI.
The DisGeNET-RDF dataset is deployed in the graph: 'http://rdf.disgenet.org'.
The namespaces required to query DisGeNET are:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dctypes: <http://purl.org/dc/dcmitype/>
PREFIX wi: <http://http://purl.org/ontology/wi/core#>
PREFIX eco: <http://http://purl.obolibrary.org/obo/eco.owl#>
PREFIX prov: <http://http://http://www.w3.org/ns/prov#>
PREFIX pav: <http://http://http://purl.org/pav/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
RDF Entity Examples
In order to help the user to query DisGeNET RDF data, for each type of entity represented in DisGeNET we provide an example of its RDF annotation serialized in Turtle syntax, see here.
Access DisGeNET via ontology
To facilitate the retrieval of data, several ontologies are deployed in the quad store in order to perform question/answering walking the ontologies. The deployed ontologies are:
- The Semanticscience Integrated Ontology (SIO),
- the Human Disease Ontology (DO),
- the Orphanet Rare Disease Ontology (ORDO),
- the NCI thesaurus (NCIt),
- the Human Phenotype Ontology (HPO),
- the Experiment Factor Ontology (EFO).
- the Evidence Code Ontology (ECO).
Please, note the coverage of DisGeNET with other disease terminologies summarized in the disease table in downloads.
SPARQL Endpoint Example Queries
The purpose of DisGeNET linked dataset is to enable richer queries over the data. Below we provide examples of how to explore DisGeNET data.
#Retrieve the reference allele the possible alternatives and the the alternative frequency if available for the variant rs5035.
SPARQL Endpoint Example Federated Queries
The purpose of the Federated queries is to integrate DisGeNET data with other Linked Datasets in the LOD cloud. The DisGeNET SPARQL endpoint supports federated queries over other linked datasets such as UniProt, Gene Expression Atlas (GXA), and WikiPathways. The DisGeNET SPARQL endpoint supports the syntax and semantics of SPARQL 1.1 for executing queries distributed over different SPARQL endpoints. Below we provide examples of how to explore perform federated queries.DisGeNET + WikiPathways DisGeNET + other SPARQL endpoints (ChEMBL, Ensembl, Uniprot,..)
FED1: DisGeNET + WikiPathways (queries made in collaboration with the WikiPathways RDF team. Thanks!!!)
Query 2.3: Retrieve the pathways associated with 'Schizophrenia', and show the number Schizophrenia genes in each pathway# Give me all pathways in WikiPathways and the total number of disease genes in each pathway for 'Schizophrenia' (MeSH:D012559). We will consider associations from CURATED sources with DisGeNET score greater than 0.35. Output the disease name, WikiPathways pathway ID, pathway name, and the number of disease genes.
Query 2.5: Retrieve the number of genes for 'Bardet-Biedl Syndrome' disease and indicate the number of genes present in pathways# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), give me the total number of associated genes in DisGeNET and the total number of these genes in WikiPathways. Output the disease name, the total number of disease genes in Wikipathways and the total number of disease genes in DisGeNET.
Query 2.6: Retrieve the genes and the pathways associated with 'Bardet-Biedl Syndrome' disease. In addition, list all the genes involved in each of the pathways found# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the genes, the pathway(s) in which the gene is involved from WikiPathways, and all the genes present in each of these pathways. Output the disease name, gene in DisGeNET, pathway ID, and gene in Wikipathways.
Query 2.7: Retrieve the total number of both disease genes and all genes involved in each pathway for the 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID, and the number of all genes in each of these pathway. Output the disease name, DisGeNET genes in the pathway, pathway ID, and all genes in the Wikipathways pathway.
Query 2.8: Retrieve the total number of both disease genes and all genes involved in each disease pathway and secondary pathways for 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID or let's call it disease pathway, the number of all genes shared between the disease pathway and a secondary pathway, and the secondary pathway ID. Output the disease name, DisGeNET genes in the pathway, disease pathway ID, and all genes in the disease pathway shared with another pathway, and the secondary pathway ID.
Query 2.9: Retrieve the number of disease genes, the number of all genes, and the number of secondary pathways in each disease pathway for 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID or let's call it disease pathway, the number of all genes in the disease pathway, and the number of secondary pathways annotated to genes in each disease pathway. Output the disease name, DisGeNET genes in the pathway, disease pathway ID, the number of all genes in the disease pathway, and the number of secondary pathways.
EBI RDF Source
Attention this example must be executed from the EBI-Ensembl sparql endpoint.
# Give me all diseases in DisGeNET associated with a gene in Ensembl with EnsemblGeneId ENSG00000169057 (MECP2). Output the Ensemble gene ID, the disease, the gene-disease association and the score for this association.
About RDF, Linked Data, Semantic Web technologies
Good introductions to the field are:
- Wikipedia, always a good place to start!
- W3C, look at whom develops the standards. The World Wide Web Consortium (W3C) is an international community where Member organizations, a full-time staff, and the public work together to develop Web standards.
- EBI RDF Platform documentation, the EBI is one of the major Linked Data providers of Life Sciences. There is a comprehensive documentation in its website that is worth reading.
NanopublicationsThe Integrative Biomedical Informatics Group is pleased to announce the second publication of the DisGeNET Nanopublications that is a Linked Dataset implemented in combination of the nanopublication approach [nanopub.org] and the Trusty URIs technique [PDF]. It is an alternative way to mine statements about gene-disease associations contained in DisGeNET. Nanopublications are a new way of publishing structured data that allows the tracking of provenance along with the scientific statement. The Trusty URIs is a novel technique to make resources in the Web immutable and verifiable, and to ensure the unambiguity of the data linking in the (semantic) Web. This new Linked Dataset provides nanopublications about scientific statements of human GDAs. These GDAs published as Trusty URI nanopublications are machine-interpretable, immutable, permanent, and verifiable. Each GDA statement has its provenance description providing evidence, attribution, creation time, and further context of its creation. Each GDA is classified as “CURATED”, “PREDICTED”, or “LITERATURE” in the DisGeNET context to categorize the evidence of the statement based on the type of assertion and curation made in the original databases. DisGeNET nanopublications include metadata annotations about the general topic of the nanopublications, i.e. ‘Gene-Disease Association’, semantically described by SIO to facilitate its discoverability in the Semantic Web (see PDF).
Linked Dataset DescriptionThe third release of DisGeNET published as nanopublications is a distribution of DisGeNET v4.0 (Nanopublications version v18.104.22.168). The dataset consists of 1,414,902 nanopublications, representing the same number of scientific statements for 429,036 different GDAs with their detailed provenance, levels of evidence and publication information descriptions, all annotated as RDF statements and encapsulated into the nanopublication RDF graphs (5,659,608 graphs in total). Specifically, the dataset is composed of 48,106,668 N-Quads, i.e. RDF triples with their graph (or “context”) added as the fourth member in the tuple (Subject, Predicate, Object, Context), everything being serialized in TriG syntax.
DisGeNET Nanopublication Schema
The official guidelines to create nanopublications were used. A DisGeNET nanopublication is modeled by 4 named graphs: head, assertion, provenance and publication information. The head graph defines the structure of the nanopublication by linking to the other graph URIs. The assertion graph contains the description for a specific single GDA assertion. The provenance graph includes provenance, evidence and attribution statements that were directly mapped from the VoID description of the RDF dataset. Finally, the publication information graph includes all the metadata information regarding the nanopublication itself, see figure below (Click on the image to zoom in). The source of data for the DisGeNET nanopublications set is the RDF Linked Dataset version of DisGeNET. To implement Trusty URIs, the GitHub Java implementation was used.
Access to the Nanopublications Linked Dataset
DisGeNET nanopublications can be accessed in two ways: they can be downloaded as a file in TriG format from the download section, and they are deployed in a new decentralized nanopublication server network, which is a distributed server network with a REST API to provide and propagate nanopublications identified by trusty URIs [ref( Kuhn et. al. 2015 )]. DisGeNET nanopublications are registered in datahub with other datasets formatted as nanopubublications. New: For performance reasons DisGeNET nanopublications are not accessible anymore via our SPARQL endpoint. To download the current dataset, which is the nanopublication distribution of the DisGeNET v4.0: nanopubs-v22.214.171.124.
For performance reasons DisGeNET nanopublications are not accessible anymore via our SPARQL endpoint.
To download the current dataset, which is the nanopublication distribution of the DisGeNET v4.0: nanopubs-v126.96.36.199.
SPARQL Example queriesDisGeNET nanopublications can be explored using the query language SPARQL via a SPARQL endpoint. With illustrative queries we show how to explore GDAs with DisGeNET nanopublications and how to integrate them with relationships published in other LOD sources. As example we can query DisGeNET nanopubs to answer the following question:
What are the proteins (and their protein interactions) associated to Alzheimer Disease with curated evidence?
Query 1.1: Retrieving Gene-Disease Associations# First, we query DisGeNET for all the genes associated to Alzheimer Disease (umls:C0002395). This query only involves the assertion graph.
Query 1.2: Filtering By Evidence# Second, we filter the prior results with those assertions annotated as CURATED DisGeNET evidence. This query involves the provenance graph.
Query 1.3: Linking with Other LOD Resources# Finally, we cross DisGeNET prior results with the Interaction Reference Index database data, which contains protein-protein interactions (PPI) annotations, through Bio2RDF::irefindex SPARQL endpoint, federating the query. Since in DisGeNET-RDF is also represented the relation between gene and the protein/s that encodes, we are able to cross DisGeNET with Bio2RDF::irefindex by Protein resources through the corresponding linkset to 'http://bio2rdf.org/uniprot:UniProtID'.
The RDF distribution of DisGeNET includes new annotation and new linksets:
- All linksets updated, i.e. all ontologies updated.
- Disease-phenotype annotation data have been integrated from 3 different sources:
- Annotations from the Human Phenotype Ontology
- Text-mined annotations from The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Groza et al., 2015
- Text-mined annotations from Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Hoehndorf et al., 2015.
- RDF enhancement and data model changes
DisGeNET v4.0 RDF Release Information
The RDF distribution of DisGeNET includes all DisGeNET v4.0 new content, besides new annotation and new linksets:
- More GDAs comprising more than 17,000 genes and 15,000 diseases as linked data in the Semantic Web.
- All linksets updated, i.e. all ontologies updated.
- Disease-phenotype annotation data from the Human Phenotype Ontology.
- New linksets to the Experimental Factor Ontology (EFO).
- New annotation: all diseases annotated to the original term(s) of provenance.
- EFO is also deployed in our SPARQL endpoint such as the Human Disease Ontology, the Human Phenotype Ontology and ORDO, in order to perform queries walking the ontology hierarchy. See examples in the SPARQL section.
- RDF enhancement, data model changes, and fixed bugs:
- Updated RDF Schema to encompass new annotations.
- Changed formal description of the property linking diseases to their phenotypic profile: sio:'is manifested as' (sio:SIO_000341) replaced by sio:'has phenotype' (sio:SIO_001279).
- Fixed data typing language: language tags correctly added on RDF Literal data types.
DisGeNET v3.0 RDF Release InformationThe RDF distribution of DisGeNET includes new annotation and new linksets:
- More GDAs comprising 17 000 genes and more than 14 000 diseases as linked data in the Semantic Web.
- All linksets updated, i.e. all ontologies updated.
- New disease-phenotype annotation data from the Human Phenotype Ontology.
- New linksets to NCI Thesaurus, Orphanet Rare Disease Ontologies (ORDO), and DECIPHER.
- New taxonomic annotation: all GDAs annotated to the Homo sapiens (Human) taxon.
- New full metadata description of the dataset compliant with the W3C HCLS and the Open PHACTS specifications (the Open PHACTS specifications specially used for linkset descriptions).
- More mappings to the Linked Open Data cloud.
- New and alternative LODEStar SPARQL access.
- New types of searches: six ontologies are deployed in our SPARQL endpoint such as the Human Disease Ontology, the Human Phenotype Ontology and ORDO, in order to perform queries walking the ontology hierarchy. See an example in the SPARQL section.
- RDF enhancement, data model changes, and fixed bugs:
- New "303 URIs" for DisGeNET GDAs and PANTHER class entities.
- New labels.
- Primary source evidence better described with the Evidence Code Ontology and new properties.
- New name descriptions: foaf:name predicate replaced by dcterms:title.
- Fixed formal description of the DisGeNET Score: Score described as an object property, and not as a datatype property.
- Fixed formal description of gene-disease association type 'label' from original source attribute: now described as a datatype property by a new predicate: sio:SIO_000255 replaced by sio:SIO_000300.
DisGeNET Nanopublication v5.0
The nanopublication distribution of DisGeNET includes all DisGeNET v5.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.
DisGeNET v4.0 Nanopublication release:
The nanopublication distribution of DisGeNET includes all DisGeNET v4.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.
DisGeNET v3.0 Nanopublication release:
The nanopublication distribution of DisGeNET includes all DisGeNET v3.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.