DisGeNET RDF

The DisGeNET-RDF Linked Dataset is an alternative way to access the DisGeNET data and provides new opportunities for data integration, querying and integrating DisGeNET data to other external RDF datasets.

The DisGeNET RDF API has been selected among the 10 ELIXIR Recommended Interoperability Resources ELIXIR announced its first portfolio of Recommended Interoperability Resources (RIRs) to facilitate interoperability and reusability of life science data and support the principles of FAIR data management on December 2018. The full list of ELIXIR Recommended Interoperability Resources is available here.

The RDF version of DisGeNET has been developed in the context of the Open PHACTS project to provide disease relevant information to the knowledge base on pharmacological data. DisGeNET-RDF has been integrated in the Open PHACTS Discovery Platform among other resources such as ChEMBL, WikiPathways and neXtProt. Aimed at exploring and querying DisGeNET data across the linked data in the platform, APIs are currently available in the Open PHACTS API v1.5 (see the OPS API Web site for up to date information).

To perform faceted and precise searches the DisGeNET-RDF linked data is accessible via a Faceted browser.

In addition, DisGeNET-RDF linked data can be accessed for question-answering via a SPARQL endpoint . An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface here, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI. Furthermore, some DisGeNET queries are available at Bioqueries. See the SPARQL Endpoint Example Queries section for more details and query examples.

The RDF Linked Dataset is accompanied with a full dataset description, which is compliant with the W3C HCLS specification. For more information on the dataset description of the RDF Dataset go to Metadata Description section.

Release Information

DisGeNET-RDF v7.0

The RDF distribution of DisGeNET includes new annotation and new linksets:

All linksets updated, and all ontologies updated.
The Risk allele of the disease variant now available for ClinVar, the GWAS Catalog and GWASdb.
The protein class is now modeled using the categories in the Drug Target Ontology

Linked Dataset Description

There are four main components in the RDF dataset: GDA content, VDA content, metadata description of the RDF dataset (VoID description), and linkouts to other Linked Datasets. The current RDF representation of DisGeNET (v7.0.0) has 99,057,987 triples serialized in Turtle syntax. The triples annotate 1,134,942 gene-disease associations (GDAs), 21,671 genes, and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes and 369,554 variant-disease associations (VDAs), between 194,515 variants and 14,155 diseases, traits, and phenotypes. The RDF graph model is centered around two concepts the GDA concept and the VDA concept and their attributes. The genes are identified by the NCBI Entrez identifiers and the diseases are identified by the UMLS CUI and the variants are identified by the dbSNP id. Entities and properties are semantically defined using standard ontologies such as the National Cancer Institute thesaurus (NCIt), and resources identified by using de-referenceable IRIs. GDAs are integrated using the DisGeNET Association Type Ontology and they are semantically harmonized using SIO classes (see the DisGeNET ontology section below).

A full dataset description of the RDF Linked Dataset is provided using among others the Vocabulary of Interlinked Datasets (VoID), an RDF Schema W3C recommended vocabulary for expressing metadata about RDF datasets. This dataset description, which is compliant with the W3C HCLS specification and the Open PHACTS specification, includes the provenance of the DisGeNET relational database, the primary databases, and the BeFree text mining tool (see the DisGeNET VoID file description). The type of curation and level of evidence of each original database are also tracked and annotated. Each data instance in DisGeNET is explicitly referenced to this dataset description in order to granulate and trace back the provenance to the instance level.

In addition, linkouts to the LOD are set in order to both enrich DisGeNET GDAs annotations with external Semantic Web resources, and to extend the current GDAs content of the Web of knowledge. Specifically, a total number of 3,308,936 linksets to the LOD through Bio2RDF, linked life data network projects among others exists in the current version. All entities linked are related using the same SKOS predicate skos:exactMatch. Other linkset statistics between entities can be found at the DisGeNET DataHub site in the DataHub registry. Consequently, DisGeNET appears in the last update of the LOD cloud diagram (2020 May update). This diagram shows datasets published in Linked Data format and it is built based on their metadata description on the DataHub as well as on metadata extracted from a crawl of the Linked Data Web.

Metadata Description

The RDF Linked Dataset is accompanied with a full dataset description, which is compliant with the W3C HCLS specification. The full VoID description at DisGeNET_VoID.ttl.gz.

DisGeNET-RDF Schema

The data model of the RDF representation of DisGeNET is shown below. Click on the picture to zoom in.

In this new release, GDAs are now identified by "303 URIs" following the W3C recommendation to build URIs for the Semantic Web. Each GDA is defined by a unique combination of a gene (NCBI GeneID), a disease (UMLS CUI), an association type defined by our ontology (see section below), a data source of provenance, and a PubMed article (PMID) giving evidence to the gene-disease association. A unique identifier based on Universally Unique Identifiers (UUID) generated by a cryptographic hash function, is established for each GDA. The DisGeNET GDA ID is composed by: 'DGN' + UUID, e.g. DGN7ab3d8cae0c9f1150cb65a985aa8c0a1. The new namespace is 'http://rdf.disgenet.org/resource/gda/'. The new GDA IRI pattern is: namespace + DisGeNET ID,

e.g. 'http://rdf.disgenet.org/resource/gda/DGN7ab3d8cae0c9f1150cb65a985aa8c0a1'.

For an example of triples related to a single gene-disease association in DisGeNET, see here.

The DisGeNET Association Type Ontology

The DisGeNET Association Type Ontology was developed in our group to fill the gap in formal semantics for the definition of types of associations described between a gene and a disease in biological databases. This ontology was generated using all terms provided by the GDAs original databases. It is an OWL ontology that can be accessed at GeneDiseaseAssociation.owl. The DisGeNET ontology is integrated into the Sematicscience Integrated Ontology (SIO), which is an OWL ontology that provides essential types and relations for the rich description of objects, processes and their attributes [PDF]. You can check SIO gene-disease association classes from this URL or download the entire SIO OWL-DL ontology file . The SIO ontology can be also accessed at the NCBO Bioportal. DisGeNET GDAs in RDF are semantically harmonized using SIO classes.

Access to the RDF Linked Dataset

Faceted Browser

DisGeNET-RDF linked data can be navigated via a Faceted browser.

SPARQL Endpoint

DisGeNET-RDF data are accessible using the query language SPARQL via our public SPARQL endpoint. The dataset is stored in a Virtuoso's QUAD Store in which the name of the graph is 'http://rdf.disgenet.org'. It is powered by Virtuoso open-source v7.1.0.

An alternative SPARQL interaction with the DisGeNET-RDF data is via a LODEStar interface at the DisGeNET LODEStar Endpoint, which is a SPARQL endpoint and linked data browser for querying and browsing RDF datasets developed in the EBI.

DisGeNET GRAPH

The DisGeNET-RDF dataset is deployed in the graph: 'http://rdf.disgenet.org'.

DisGeNET NAMESPACES*

The namespaces required to query DisGeNET are:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX so: <http://purl.obolibrary.org/obo/SO_>
PREFIX ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dctypes: <http://purl.org/dc/dcmitype/>
PREFIX wi: <http://http://purl.org/ontology/wi/core#>
PREFIX eco: <http://http://purl.obolibrary.org/obo/eco.owl#>
PREFIX prov: <http://http://http://www.w3.org/ns/prov#>
PREFIX pav: <http://http://http://purl.org/pav/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX dto: <http://diseasetargetontology.org/dto/>

*Our SPARQL endpoint is configured with these prefixes, thus their definition is not required when executing queries from our endpoint.

RDF Entity Examples

In order to help the user to query DisGeNET RDF data, for each type of entity represented in DisGeNET we provide an example of its RDF annotation serialized in Turtle syntax, see here.

Access DisGeNET via ontology

To facilitate the retrieval of data, several ontologies are deployed in the quad store in order to perform question/answering walking the ontologies. The deployed ontologies are:

The Semanticscience Integrated Ontology (SIO),
the Human Disease Ontology (DO),
the Orphanet Rare Disease Ontology (ORDO),
the NCI thesaurus (NCIt),
the Human Phenotype Ontology (HPO),
the Experiment Factor Ontology (EFO).
the Evidence Code Ontology (ECO).

Please, note the coverage of DisGeNET with other disease terminologies summarized in the disease table in downloads.

SPARQL Endpoint Example Queries

The purpose of DisGeNET linked dataset is to enable richer queries over the data. Below we provide examples of how to explore DisGeNET data.

Examples

Query 1.1: Retrieve all the gene-disease associations (GDAs) and their general description

# Retrieve all the GDAs of type 'Therapeutic' (sio:SIO_001120) and their general description. SELECT ?gda ?sio_type ?label ?comment ?title ?id ?voidSubset WHERE { ?gda rdf:type ?type ; rdfs:label ?label ; rdfs:comment ?comment ; dcterms:title ?title ; dcterms:identifier ?id ; void:inDataset ?voidSubset . ?type rdfs:label ?sio_type FILTER(?type=sio:SIO_001122) } LIMIT 20

DisGeNET RDF

Release Information

DisGeNET-RDF v7.0

Linked Dataset Description

Metadata Description

DisGeNET-RDF Schema

The DisGeNET Association Type Ontology

Access to the RDF Linked Dataset

Faceted Browser

SPARQL Endpoint

DisGeNET GRAPH

DisGeNET NAMESPACES*

RDF Entity Examples

Access DisGeNET via ontology

SPARQL Endpoint Example Queries

Examples

SPARQL Endpoint Example Federated Queries

Examples

FED1: DisGeNET + WikiPathways (queries made in collaboration with the WikiPathways RDF team. Thanks!!!)

NAMESPACE

EBI RDF Source

Documentation

About RDF, Linked Data, Semantic Web technologies

DisGeNET-RDF Getting Started

Nanopublications

Linked Dataset Description

DisGeNET Nanopublication Schema

Access to the Nanopublications Linked Dataset

SPARQL Example queries

Query 1.1: Retrieving Gene-Disease Associations

Query 1.2: Filtering By Evidence

Query 1.3: Linking with Other LOD Resources

Version History

DisGeNET-RDF v7.0

DisGeNET-RDF v6.0

DisGeNET-RDF v5.0

DisGeNET v4.0 RDF Release Information

DisGeNET v3.0 RDF Release Information

DisGeNET Nanopublication v5.0

DisGeNET v4.0 Nanopublication release:

DisGeNET v3.0 Nanopublication release: