Miner
Schema Miner - extract RDF schema patterns via simple SELECT queries.
Instead of building VoID on the endpoint with heavy CONSTRUCT + BIND queries, this module runs three lightweight SELECT DISTINCT queries and assembles the schema in Python:
Typed-object patterns:
SELECT DISTINCT ?sc ?p ?oc WHERE { ?s ?p ?o . ?s a ?sc . ?o a ?oc . }Literal patterns (datatype properties):
SELECT DISTINCT ?sc ?p (DATATYPE(?o) AS ?dt) WHERE { ?s ?p ?o . ?s a ?sc . FILTER(isLiteral(?o)) }Untyped-URI patterns (URI objects without
rdf:type):SELECT DISTINCT ?sc ?p WHERE { ?s ?p ?o . ?s a ?sc . FILTER(isURI(?o)) FILTER NOT EXISTS { ?o a ?any } }
All queries use OFFSET / LIMIT pagination via
SparqlHelper.select_chunked().
The primary export is MinedSchema (-> JSON-LD). It can also
be converted to a VoID graph for downstream LinkML / SHACL / RDF-config
export via VoidParser.
- class SchemaMiner(endpoint_url: str, graph_uris: str | list[str] | None = None, chunk_size: int = 10000, class_chunk_size: int | None = None, class_batch_size: int = 15, delay: float = 0.5, timeout: float = 120.0, counts: bool = True, two_phase: bool = True, unsafe_paging: bool = False, report_path: str | Path | None = None, filter_service_namespaces: bool = True, untyped_as_classes: bool = False, authors: list[dict[str, str]] | None = None, qlever_version: dict[str, str] | None = None, one_shot: bool = False)[source]
Bases:
objectMine RDF schema patterns from a SPARQL endpoint.
- Parameters:
endpoint_url – SPARQL endpoint URL.
graph_uris – Optional named-graph URI(s) to restrict queries to.
chunk_size – Number of rows per paginated request.
class_chunk_size – Page size for Phase-1 class discovery in two-phase mode.
Nonedisables pagination (single query).class_batch_size – Number of classes grouped into one
VALUESquery in Phase-2 of two-phase mining. Default15. Higher values send fewer queries but each query is heavier.delay – Seconds to sleep between pagination requests.
timeout – HTTP timeout per request (seconds).
counts – Whether to also run COUNT queries for triple counts.
two_phase – Use two-phase mining (default). Phase 1 discovers all
rdf:typeclasses; phase 2 queries properties per class. Much gentler on heavyweight endpoints like QLever/PubChem/UniProt. PassFalsefor the legacy single-pass strategy.filter_service_namespaces – When
True(the default), remove patterns whose subject, property, or object URI belongs to a service/system namespace (Virtuoso, OpenLink, etc.) from the final result.untyped_as_classes – When
True, treat untyped URI objects (those without an explicitrdf:type) asowl:Classreferences instead of the genericrdfs:Resourcesentinel. DefaultFalse.
Initialize a SchemaMiner.
- mine(dataset_name: str | None = None) MinedSchema[source]
Run all queries and return a
MinedSchema.- Parameters:
dataset_name – Optional human-readable name attached to the metadata.
Notes
The method also populates a
MiningReportwith per-phase timing, query counts, and failure stats. If a report_path was given at construction time, the JSON is flushed to disk after each phase completes.
- mine_schema(endpoint_url: str, graph_uris: str | list[str] | None = None, dataset_name: str | None = None, chunk_size: int = 10000, class_chunk_size: int | None = None, class_batch_size: int = 15, delay: float = 0.5, timeout: float = 120.0, counts: bool = True, two_phase: bool = True, report_path: str | Path | None = None, filter_service_namespaces: bool = True, untyped_as_classes: bool = False, authors: list[dict[str, str]] | None = None, qlever_version: dict[str, str] | None = None, one_shot: bool = False) MinedSchema[source]
One-shot helper: mine a schema and return
MinedSchema.- Parameters:
endpoint_url – SPARQL endpoint URL.
graph_uris – Named-graph URI(s) to restrict queries to.
dataset_name – Human-readable name for the dataset.
chunk_size – Pagination page size for pattern queries (single-pass and count queries).
class_chunk_size – Page size for the Phase-1 class-discovery query in two-phase mode.
None(default) disables pagination - the class list is fetched in a single query. Set to a positive integer when the endpoint has too many classes for one response.class_batch_size – Number of classes to group into a single VALUES query in Phase-2 of two-phase mining. Default
15. Higher values send fewer queries but each query is heavier.delay – Delay between pages (seconds).
timeout – HTTP timeout per request.
counts – Fetch triple counts per pattern.
two_phase – Use two-phase mining (default
True). PassFalsefor the legacy single-pass strategy.one_shot – Run each pattern query as a single unbounded SELECT with no LIMIT/OFFSET and no fallback chain. Intended for local QLever endpoints. When
True,two_phaseis ignored.report_path – If given, write an analytics JSON report to this path. The file is updated incrementally after each mining phase.
filter_service_namespaces – Strip patterns whose URIs belong to service / system namespaces (Virtuoso, OpenLink, etc.) from the result. Default
True.untyped_as_classes – Treat untyped URI objects as
owl:Classreferences instead of the genericrdfs:Resourcesentinel. DefaultFalse.
- Returns:
Contains patterns and provenance metadata.
- Return type: