Note to users
The description provided in this page refers to the RDF release 5.0, corresponding to DisGeNET version 5.0. The RDF release 6.0 is not yet available
The DisGeNET-RDF Linked Dataset is an alternative way to access the DisGeNET data and provides new opportunities for data integration, querying and integrating DisGeNET data to other external RDF datasets.
The DisGeNET RDF API has been selected among the 10 ELIXIR Recommended Interoperability Resources ELIXIR announced its first portfolio of Recommended Interoperability Resources (RIRs) to facilitate interoperability and reusability of life science data and support the principles of FAIR data management on December 2018. The full list of ELIXIR Recommended Interoperability Resources is available here.
The RDF version of DisGeNET has been developed in the context of the Open PHACTS project to provide disease relevant information to the knowledge base on pharmacological data. DisGeNET-RDF has been integrated in the Open PHACTS Discovery Platform among other resources such as ChEMBL, WikiPathways and neXtProt. Aimed at exploring and querying DisGeNET data across the linked data in the platform, APIs are currently available in the Open PHACTS API v1.5 (see the OPS API Web site for up to date information).
To perform faceted and precise searches the DisGeNET-RDF linked data is accessible via a Faceted browser.
In addition, DisGeNET-RDF linked data can be accessed for question-answering via a SPARQL endpoint. An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface here, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI. Furthermore, some DisGeNET queries are available at Bioqueries. See the 4.3 SPARQL Endpoint: Example Queries section for more details and query examples.
The RDF Linked Dataset is accompanied with a full dataset description, which is compliant with the W3C HCLS specification. For more information on the dataset description of the RDF Dataset go to Metadata Description section.
To download the dump files please, go to the Data Downloads section.
The RDF distribution of DisGeNET includes new annotation and new linksets:
- All linksets updated, i.e. all ontologies updated.
- Disease-phenotype annotation data have been integrated from 3 different sources:
- Annotations from the Human Phenotype Ontology
- Text-mined annotations from The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Groza et al., 2015
- Text-mined annotations from Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Hoehndorf et al., 2015.
- RDF enhancement and data model changes
DisGeNET Nanopublication v5.0
The nanopublication distribution of DisGeNET includes all DisGeNET v5.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.
Linked Dataset Description
There are three main components in the RDF dataset: GDA content, metadata description of the RDF dataset (VoID description), and linkouts to other Linked Datasets. The current RDF representation of DisGeNET (v4.0.0) has 30,506,021 triples serialized in Turtle syntax that annotate 429,036 gene-disease associations (GDAs), 17,381 genes, and 15,093 diseases involved in these associations. The RDF graph model is centered on the GDA concept, and different information around GDA, such as the gene and disease involved, and the type of association is represented. Also, the gene identified by the Entrez or NCBI GeneID and the disease identified by the UMLS CUI have different attributes annotated (see the Schema below). Entities and properties are semantically defined using standard ontologies such as the National Cancer Institute thesaurus (NCIt), and resources identified by using de-referenceable IRIs. GDAs are integrated using the DisGeNET Association Type Ontology and they are semantically harmonized using SIO classes (see the DisGeNET ontology section below).
A full dataset description of the RDF Linked Dataset is provided using among others the Vocabulary of Interlinked Datasets (VoID), an RDF Schema W3C recommended vocabulary for expressing metadata about RDF datasets. This dataset description, which is compliant with the W3C HCLS specification and the Open PHACTS specification, includes the provenance of the DisGeNET relational database, the primary databases, and the BeFree text mining tool (see the DisGeNET VoID file description). The type of curation and level of evidence of each original database are also tracked and annotated. Each data instance in DisGeNET is explicitly referenced to this dataset description in order to granulate and trace back the provenance to the instance level.
In addition, linkouts to the LOD are set in order to both enrich DisGeNET GDAs annotations with external Semantic Web resources, and to extend the current GDAs content of the Web of knowledge. Specifically, a total number of 4,962,315 linksets to the LOD through Bio2RDF, linked life data network projects among others exists in the current version. All entities linked are related using the same SKOS predicate skos:exactMatch. Other linkset statistics between entities can be found at the DisGeNET DataHub site in the DataHub registry. Consequently, DisGeNET appears in the last update of the LOD cloud diagram (2014-08-30 update). This diagram shows datasets published in Linked Data format and it is built based on their metadata description on the DataHub as well as on metadata extracted from a crawl of the Linked Data Web.
The data model of the RDF representation of DisGeNET is shown below. Click on the picture to zoom in.
In this new release, GDAs are now identified by "303 URIs" following the W3C recommendation to build URIs for the Semantic Web. Each GDA is defined by a unique combination of a gene (NCBI GeneID), a disease (UMLS CUI), an association type defined by our ontology (see section below), a data source of provenance, and a PubMed article (PMID) giving evidence to the gene-disease association. A unique identifier based on Universally Unique Identifiers (UUID) generated by a cryptographic hash function, is established for each GDA. The DisGeNET GDA ID is composed by: 'DGN' + UUID, e.g. DGN7ab3d8cae0c9f1150cb65a985aa8c0a1. The new namespace is 'http://rdf.disgenet.org/resource/gda/'. The new GDA IRI pattern is: namespace + DisGeNET ID,
For an example of triples related to a single gene-disease association in DisGeNET, see here.
The DisGeNET Association Type Ontology
The DisGeNET Association Type Ontology was developed in our group to fill the gap in formal semantics for the definition of types of associations described between a gene and a disease in biological databases. This ontology was generated using all terms provided by the GDAs original databases. It is an OWL ontology that can be accessed at GeneDiseaseAssociation.owl. The DisGeNET ontology is integrated into the Sematicscience Integrated Ontology (SIO), which is an OWL ontology that provides essential types and relations for the rich description of objects, processes and their attributes [PDF]. You can check SIO gene-disease association classes from this URL or download the entire SIO OWL-DL ontology file . The SIO ontology can be also accessed at the NCBO Bioportal. DisGeNET GDAs in RDF are semantically harmonized using SIO classes.
Access to the RDF Linked Dataset
The DisGeNET-RDF data dump and the VoID description file are accessible to download.
DisGeNET-RDF linked data can be navigated via a Faceted browser.
DisGeNET-RDF data are accessible using the query language SPARQL via our public SPARQL endpoint. The dataset is stored in a Virtuoso's QUAD Store in which the name of the graph is 'http://rdf.disgenet.org'. It is powered by Virtuoso open-source v7.1.0.
An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface at the DisGeNET LODEStar Endpoint, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI.
The DisGeNET-RDF dataset is deployed in the graph: 'http://rdf.disgenet.org'.
The namespaces required to query DisGeNET are:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dctypes: <http://purl.org/dc/dcmitype/>
PREFIX wi: <http://http://purl.org/ontology/wi/core#>
PREFIX eco: <http://http://purl.obolibrary.org/obo/eco.owl#>
PREFIX prov: <http://http://http://www.w3.org/ns/prov#>
PREFIX pav: <http://http://http://purl.org/pav/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
RDF Entity Examples
In order to help the user to query DisGeNET RDF data, for each type of entity represented in DisGeNET we provide an example of its RDF annotation serialized in Turtle syntax, see here.
Access DisGeNET via ontology
To facilitate the retrieval of data, several ontologies are deployed in the quad store in order to perform question/answering walking the ontologies. The deployed ontologies are:
- The Semanticscience Integrated Ontology (SIO),
- the Human Disease Ontology (DO),
- the Orphanet Rare Disease Ontology (ORDO),
- the NCI thesaurus (NCIt),
- the Human Phenotype Ontology (HPO),
- the Experiment Factor Ontology (EFO).
- the Evidence Code Ontology (ECO).
Please, note the coverage of DisGeNET with other disease terminologies summarized in the disease table in downloads.
SPARQL Endpoint Example Queries
The purpose of DisGeNET linked dataset is to enable richer queries over the data. Below we provide examples of how to explore DisGeNET data.
Query 1.3: Retrieve all the supporting evidences for the association between Rett Syndrome and the MECP2 gene# Give me all the supporting evidences in DisGeNET, for the association between the "Rett Syndrome" disease (umls:C0035372) and the MECP2 gene (ncbigene:4204).
Query 1.7: Retrieve all the GDAs classified according to the association types of the DisGeNET ontology# Give me all GDAs in DisGeNET and the type of relationship between genes and diseases.
SPARQL Endpoint Example Federated Queries
The purpose of the Federated queries is to integrate DisGeNET data with other Linked Datasets in the LOD cloud. The DisGeNET SPARQL endpoint supports federated queries over other linked datasets such as UniProt, Gene Expression Atlas (GXA), and WikiPathways. The DisGeNET SPARQL endpoint supports the syntax and semantics of SPARQL 1.1 for executing queries distributed over different SPARQL endpoints. Below we provide examples of how to explore perform federated queries.
FED1: DisGeNET + WikiPathways (queries made in collaboration with the WikiPathways RDF team. Thanks!!!)
PREFIX wp: <http://vocabularies.wikipathways.org/wp#>
Query 2.3: Retrieve the pathways associated with 'Schizophrenia', and show the number Schizophrenia genes in each pathway# Give me all pathways in WikiPathways and the total number of disease genes in each pathway for 'Schizophrenia' (MeSH:D012559). We will consider associations from CURATED sources with DisGeNET score greater than 0.35. Output the disease name, WikiPathways pathway ID, pathway name, and the number of disease genes.
Query 2.5: Retrieve the number of genes for 'Bardet-Biedl Syndrome' disease and indicate the number of genes present in pathways# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), give me the total number of associated genes in DisGeNET and the total number of these genes in WikiPathways. Output the disease name, the total number of disease genes in Wikipathways and the total number of disease genes in DisGeNET.
Query 2.6: Retrieve the genes and the pathways associated with 'Bardet-Biedl Syndrome' disease. In addition, list all the genes involved in each of the pathways found# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the genes, the pathway(s) in which the gene is involved from WikiPathways, and all the genes present in each of these pathways. Output the disease name, gene in DisGeNET, pathway ID, and gene in Wikipathways.
Query 2.7: Retrieve the total number of both disease genes and all genes involved in each pathway for the 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID, and the number of all genes in each of these pathway. Output the disease name, DisGeNET genes in the pathway, pathway ID, and all genes in the Wikipathways pathway.
Query 2.8: Retrieve the total number of both disease genes and all genes involved in each disease pathway and secondary pathways for 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID or let's call it disease pathway, the number of all genes shared between the disease pathway and a secondary pathway, and the secondary pathway ID. Output the disease name, DisGeNET genes in the pathway, disease pathway ID, and all genes in the disease pathway shared with another pathway, and the secondary pathway ID.
Query 2.9: Retrieve the number of disease genes, the number of all genes, and the number of secondary pathways in each disease pathway for 'Bardet-Biedl Syndrome'# For 'Bardet-Biedl Syndrome' disease (MeSH:D020788), retrieve from DisGeNET the number of disease genes in a pathway in WikiPathways, the pathway ID or let's call it disease pathway, the number of all genes in the disease pathway, and the number of secondary pathways annotated to genes in each disease pathway. Output the disease name, DisGeNET genes in the pathway, disease pathway ID, the number of all genes in the disease pathway, and the number of secondary pathways.
EBI RDF Source
PREFIX sbmlrdf: <http://identifiers.org/biomodels.vocabulary#>
About RDF, Linked Data, Semantic Web technologies
Good introductions to the field are:
- Wikipedia, always a good place to start!
- W3C, look at whom develops the standards. The World Wide Web Consortium (W3C) is an international community where Member organizations, a full-time staff, and the public work together to develop Web standards.
- EBI RDF Platform documentation, the EBI is one of the major Linked Data providers of Life Sciences. There is a comprehensive documentation in its website that is worth reading.
DisGeNET-RDF Getting Started
Guidelines for Linked Data
Do you want to provide Linked Data? The best place to start is reading the guidelines!
In this subsection we provide links to documents that had been very useful to develop DisGeNET-RDF.
- Provide data in RDF: Open PHACTS RDFization Guidelines
- Provide metadata in RDF: W3C HCLS Dataset Metadata Guidelines
- Provide nanopublications: Nanopub.org Nanopublication Guidelines
- Provide Trusty URIs: The Trusty URI approach
- Provide FAIR data: Be FAIR: Findable, Accessible, Interoperable and Reusable Principles. The FAIR DATA movement: In the era of digital science, the Data FAIRport is an international open initiative to promote the publication of data and information in research in a valuable way. The FAIR Data approach is an enabler. An important step in the FAIR Data approach is to publish existing and new datasets in a semantically interoperable format that can be understood by computer systems. By semantically annotating data items and metadata, we can use computer systems to (semi) automatically combine different data sources, resulting in richer knowledge discovery activities. ELIXIR and the NIH commons support FAIR.
NanopublicationsThe Integrative Biomedical Informatics Group is pleased to announce the second publication of the DisGeNET Nanopublications that is a Linked Dataset implemented in combination of the nanopublication approach [nanopub.org] and the Trusty URIs technique [PDF]. It is an alternative way to mine statements about gene-disease associations contained in DisGeNET. Nanopublications are a new way of publishing structured data that allows the tracking of provenance along with the scientific statement. The Trusty URIs is a novel technique to make resources in the Web immutable and verifiable, and to ensure the unambiguity of the data linking in the (semantic) Web. This new Linked Dataset provides nanopublications about scientific statements of human GDAs. These GDAs published as Trusty URI nanopublications are machine-interpretable, immutable, permanent, and verifiable. Each GDA statement has its provenance description providing evidence, attribution, creation time, and further context of its creation. Each GDA is classified as “CURATED”, “PREDICTED”, or “LITERATURE” in the DisGeNET context to categorize the evidence of the statement based on the type of assertion and curation made in the original databases. DisGeNET nanopublications include metadata annotations about the general topic of the nanopublications, i.e. ‘Gene-Disease Association’, semantically described by SIO to facilitate its discoverability in the Semantic Web (see PDF).
Linked Dataset DescriptionThe third release of DisGeNET published as nanopublications is a distribution of DisGeNET v4.0 (Nanopublications version v22.214.171.124). The dataset consists of 1,414,902 nanopublications, representing the same number of scientific statements for 429,036 different GDAs with their detailed provenance, levels of evidence and publication information descriptions, all annotated as RDF statements and encapsulated into the nanopublication RDF graphs (5,659,608 graphs in total). Specifically, the dataset is composed of 48,106,668 N-Quads, i.e. RDF triples with their graph (or “context”) added as the fourth member in the tuple (Subject, Predicate, Object, Context), everything being serialized in TriG syntax.
DisGeNET Nanopublication Schema
The official guidelines to create nanopublications were used. A DisGeNET nanopublication is modeled by 4 named graphs: head, assertion, provenance and publication information. The head graph defines the structure of the nanopublication by linking to the other graph URIs. The assertion graph contains the description for a specific single GDA assertion. The provenance graph includes provenance, evidence and attribution statements that were directly mapped from the VoID description of the RDF dataset. Finally, the publication information graph includes all the metadata information regarding the nanopublication itself, see figure below (Click on the image to zoom in). The source of data for the DisGeNET nanopublications set is the RDF Linked Dataset version of DisGeNET. To implement Trusty URIs, the GitHub Java implementation was used.
Access to the Nanopublications Linked Dataset
DisGeNET nanopublications can be accessed in two ways: they can be downloaded as a file in TriG format from the download section, and they are deployed in a new decentralized nanopublication server network, which is a distributed server network with a REST API to provide and propagate nanopublications identified by trusty URIs [ref]. DisGeNET nanopublications are registered in datahub with other datasets formatted as nanopubublications. New: For performance reasons DisGeNET nanopublications are not accessible anymore via our SPARQL endpoint. To download the current dataset, which is the nanopublication distribution of the DisGeNET v4.0: nanopubs-v126.96.36.199.
For performance reasons DisGeNET nanopublications are not accessible anymore via our SPARQL endpoint.
To download the current dataset, which is the nanopublication distribution of the DisGeNET v4.0: nanopubs-v188.8.131.52.
SPARQL Example queriesDisGeNET nanopublications can be explored using the query language SPARQL via a SPARQL endpoint. With illustrative queries we show how to explore GDAs with DisGeNET nanopublications and how to integrate them with relationships published in other LOD sources. As example we can query DisGeNET nanopubs to answer the following question:
What are the proteins (and their protein interactions) associated to Alzheimer Disease with curated evidence?
Query 1.1: Retrieving Gene-Disease Associations# First, we query DisGeNET for all the genes associated to Alzheimer Disease (umls:C0002395). This query only involves the assertion graph.
Query 1.2: Filtering By Evidence# Second, we filter the prior results with those assertions annotated as CURATED DisGeNET evidence. This query involves the provenance graph.
Query 1.3: Linking with Other LOD Resources# Finally, we cross DisGeNET prior results with the Interaction Reference Index database data, which contains protein-protein interactions (PPI) annotations, through Bio2RDF::irefindex SPARQL endpoint, federating the query. Since in DisGeNET-RDF is also represented the relation between gene and the protein/s that encodes, we are able to cross DisGeNET with Bio2RDF::irefindex by Protein resources through the corresponding linkset to 'http://bio2rdf.org/uniprot:UniProtID'.
DisGeNET v4.0 RDF Release Information
The RDF distribution of DisGeNET includes all DisGeNET v4.0 new content, besides new annotation and new linksets:
- More GDAs comprising more than 17,000 genes and 15,000 diseases as linked data in the Semantic Web.
- All linksets updated, i.e. all ontologies updated.
- Disease-phenotype annotation data from the Human Phenotype Ontology.
- New linksets to the Experimental Factor Ontology (EFO).
- New annotation: all diseases annotated to the original term(s) of provenance.
- EFO is also deployed in our SPARQL endpoint such as the Human Disease Ontology, the Human Phenotype Ontology and ORDO, in order to perform queries walking the ontology hierarchy. See examples in the SPARQL section.
- RDF enhancement, data model changes, and fixed bugs:
- Updated RDF Schema to encompass new annotations.
- Changed formal description of the property linking diseases to their phenotypic profile: sio:'is manifested as' (sio:SIO_000341) replaced by sio:'has phenotype' (sio:SIO_001279).
- Fixed data typing language: language tags correctly added on RDF Literal data types.
DisGeNET v3.0 RDF Release InformationThe RDF distribution of DisGeNET includes new annotation and new linksets:
- More GDAs comprising 17 000 genes and more than 14 000 diseases as linked data in the Semantic Web.
- All linksets updated, i.e. all ontologies updated.
- New disease-phenotype annotation data from the Human Phenotype Ontology.
- New linksets to NCI Thesaurus, Orphanet Rare Disease Ontologies (ORDO), and DECIPHER.
- New taxonomic annotation: all GDAs annotated to the Homo sapiens (Human) taxon.
- New full metadata description of the dataset compliant with the W3C HCLS and the Open PHACTS specifications (the Open PHACTS specifications specially used for linkset descriptions).
- More mappings to the Linked Open Data cloud.
- New and alternative LODEStar SPARQL access.
- New types of searches: six ontologies are deployed in our SPARQL endpoint such as the Human Disease Ontology, the Human Phenotype Ontology and ORDO, in order to perform queries walking the ontology hierarchy. See an example in the SPARQL section.
- RDF enhancement, data model changes, and fixed bugs:
- New "303 URIs" for DisGeNET GDAs and PANTHER class entities.
- New labels.
- Primary source evidence better described with the Evidence Code Ontology and new properties.
- New name descriptions: foaf:name predicate replaced by dcterms:title.
- Fixed formal description of the DisGeNET Score: Score described as an object property, and not as a datatype property.
- Fixed formal description of gene-disease association type 'label' from original source attribute: now described as a datatype property by a new predicate: sio:SIO_000255 replaced by sio:SIO_000300.
DisGeNET v3.0 Nanopublication release:
The nanopublication distribution of DisGeNET includes all DisGeNET v3.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.
DisGeNET v4.0 Nanopublication release:
The nanopublication distribution of DisGeNET includes all DisGeNET v4.0 gene-disease association statements along with its provenance, evidence, and attribution structured as nanopublications. Please, refer to the release notes and the RDF section in this page for more details.