Instance Matcher
Instance-based matching: probe SPARQL endpoints for bioregistry URI patterns.
Given a bioregistry resource prefix (e.g. "ensembl"), this module
queries every rdfsolve data source for the RDF classes whose instances
match the resource’s known URI prefixes. When two datasets both contain
instances of the same resource, a mapping edge is emitted between their
respective classes.
The result is an InstanceMapping
that can be
serialised to JSON-LD and imported into the rdfsolve database alongside
mined schemas. The JSON-LD format is identical to a mined schema’s, so
the frontend parseJSONLD pipeline works without any changes -
skos:narrowMatch edges become walkable graph edges in the UI.
Typical usage:
from rdfsolve.sources import load_sources_dataframe
from rdfsolve.instance_matcher import probe_resource
datasources = load_sources_dataframe()
mapping = probe_resource("ensembl", datasources)
jsonld = mapping.to_jsonld()
- probe_resource(prefix: str, datasources: DataFrame, predicate: str = 'http://www.w3.org/2004/02/skos/core#narrowMatch', dataset_names: list[str] | None = None, timeout: float = 60.0) InstanceMapping[source]
Probe SPARQL endpoints for a bioregistry resource.
Steps:
Resolve URI format prefixes for prefix via bioregistry.
Optionally filter datasources to dataset_names.
For each dataset, query its endpoint with each URI prefix using
STRSTARTS-basedSELECT DISTINCT ?c.Build pairwise
MappingEdgeinstances between any two distinct classes that both matched the resource - including two classes within the same dataset (e.g.GeneandGeneAnnotationin the same endpoint both having Ensembl instance URIs are linked just like cross-dataset classes).Return an
InstanceMappingready for.to_jsonld().
- Parameters:
prefix – Bioregistry prefix, e.g.
"ensembl".datasources – DataFrame with at least columns
dataset_nameandendpoint_url.predicate – Mapping predicate URI. Defaults to
skos:narrowMatch. Override toskos:exactMatch,owl:sameAs, etc. as appropriate.dataset_names – If given, only probe these datasets.
timeout – SPARQL HTTP timeout per request in seconds.
- Returns:
InstanceMappingwithedges,match_results, and provenanceabout.- Raises:
ValueError – If prefix is unknown to bioregistry.