Instance Matcher

Instance-based matching: probe SPARQL endpoints for bioregistry URI patterns.

Given a bioregistry resource prefix (e.g. "ensembl"), this module queries every rdfsolve data source for the RDF classes whose instances match the resource’s known URI prefixes. When two datasets both contain instances of the same resource, a mapping edge is emitted between their respective classes.

The result is an InstanceMapping that can be serialised to JSON-LD and imported into the rdfsolve database alongside mined schemas. The JSON-LD format is identical to a mined schema’s, so the frontend parseJSONLD pipeline works without any changes - skos:narrowMatch edges become walkable graph edges in the UI.

Typical usage:

from rdfsolve.sources import load_sources_dataframe
from rdfsolve.instance_matcher import probe_resource

datasources = load_sources_dataframe()
mapping = probe_resource("ensembl", datasources)
jsonld = mapping.to_jsonld()
probe_resource(prefix: str, datasources: DataFrame, predicate: str = 'http://www.w3.org/2004/02/skos/core#narrowMatch', dataset_names: list[str] | None = None, timeout: float = 60.0) InstanceMapping[source]

Probe SPARQL endpoints for a bioregistry resource.

Steps:

  1. Resolve URI format prefixes for prefix via bioregistry.

  2. Optionally filter datasources to dataset_names.

  3. For each dataset, query its endpoint with each URI prefix using STRSTARTS-based SELECT DISTINCT ?c.

  4. Build pairwise MappingEdge instances between any two distinct classes that both matched the resource - including two classes within the same dataset (e.g. Gene and GeneAnnotation in the same endpoint both having Ensembl instance URIs are linked just like cross-dataset classes).

  5. Return an InstanceMapping ready for .to_jsonld().

Parameters:
  • prefix – Bioregistry prefix, e.g. "ensembl".

  • datasources – DataFrame with at least columns dataset_name and endpoint_url.

  • predicate – Mapping predicate URI. Defaults to skos:narrowMatch. Override to skos:exactMatch, owl:sameAs, etc. as appropriate.

  • dataset_names – If given, only probe these datasets.

  • timeout – SPARQL HTTP timeout per request in seconds.

Returns:

InstanceMapping with edges, match_results, and provenance about.

Raises:

ValueError – If prefix is unknown to bioregistry.