Miner

Schema Miner - extract RDF schema patterns via simple SELECT queries.

Instead of building VoID on the endpoint with heavy CONSTRUCT + BIND queries, this module runs three lightweight SELECT DISTINCT queries and assembles the schema in Python:

Typed-object patterns:

SELECT DISTINCT ?sc ?p ?oc WHERE {
  ?s ?p ?o . ?s a ?sc . ?o a ?oc .
}

Literal patterns (datatype properties):

SELECT DISTINCT ?sc ?p (DATATYPE(?o) AS ?dt) WHERE {
  ?s ?p ?o . ?s a ?sc . FILTER(isLiteral(?o))
}

Untyped-URI patterns (URI objects without rdf:type):

SELECT DISTINCT ?sc ?p WHERE {
  ?s ?p ?o . ?s a ?sc .
  FILTER(isURI(?o))
  FILTER NOT EXISTS { ?o a ?any }
}

All queries use OFFSET / LIMIT pagination via SparqlHelper.select_chunked().

The primary export is MinedSchema (-> JSON-LD). It can also be converted to a VoID graph for downstream LinkML / SHACL / RDF-config export via VoidParser.

class SchemaMiner(endpoint_url: str, graph_uris: str | list[str] | None = None, chunk_size: int = 10000, class_chunk_size: int | None = None, class_batch_size: int = 15, delay: float = 0.5, timeout: float = 120.0, counts: bool = True, two_phase: bool = True, unsafe_paging: bool = False, report_path: str | Path | None = None, filter_service_namespaces: bool = True, untyped_as_classes: bool = False, authors: list[dict[str, str]] | None = None, qlever_version: dict[str, str] | None = None, one_shot: bool = False)[source]

Bases: object

Mine RDF schema patterns from a SPARQL endpoint.

Parameters:

endpoint_url – SPARQL endpoint URL.
graph_uris – Optional named-graph URI(s) to restrict queries to.
chunk_size – Number of rows per paginated request.
class_chunk_size – Page size for Phase-1 class discovery in two-phase mode. None disables pagination (single query).
class_batch_size – Number of classes grouped into one VALUES query in Phase-2 of two-phase mining. Default 15. Higher values send fewer queries but each query is heavier.
delay – Seconds to sleep between pagination requests.
timeout – HTTP timeout per request (seconds).
counts – Whether to also run COUNT queries for triple counts.
two_phase – Use two-phase mining (default). Phase 1 discovers all rdf:type classes; phase 2 queries properties per class. Much gentler on heavyweight endpoints like QLever/PubChem/UniProt. Pass False for the legacy single-pass strategy.
filter_service_namespaces – When True (the default), remove patterns whose subject, property, or object URI belongs to a service/system namespace (Virtuoso, OpenLink, etc.) from the final result.
untyped_as_classes – When True, treat untyped URI objects (those without an explicit rdf:type) as owl:Class references instead of the generic rdfs:Resource sentinel. Default False.

Initialize a SchemaMiner.

mine(dataset_name: str | None = None) → MinedSchema[source]

Run all queries and return a MinedSchema.

Parameters:: dataset_name – Optional human-readable name attached to the metadata.

Notes

The method also populates a MiningReport with per-phase timing, query counts, and failure stats. If a report_path was given at construction time, the JSON is flushed to disk after each phase completes.

mine_schema(endpoint_url: str, graph_uris: str | list[str] | None = None, dataset_name: str | None = None, chunk_size: int = 10000, class_chunk_size: int | None = None, class_batch_size: int = 15, delay: float = 0.5, timeout: float = 120.0, counts: bool = True, two_phase: bool = True, report_path: str | Path | None = None, filter_service_namespaces: bool = True, untyped_as_classes: bool = False, authors: list[dict[str, str]] | None = None, qlever_version: dict[str, str] | None = None, one_shot: bool = False) → MinedSchema[source]

One-shot helper: mine a schema and return MinedSchema.

Parameters:

endpoint_url – SPARQL endpoint URL.
graph_uris – Named-graph URI(s) to restrict queries to.
dataset_name – Human-readable name for the dataset.
chunk_size – Pagination page size for pattern queries (single-pass and count queries).
class_chunk_size – Page size for the Phase-1 class-discovery query in two-phase mode. None (default) disables pagination - the class list is fetched in a single query. Set to a positive integer when the endpoint has too many classes for one response.
class_batch_size – Number of classes to group into a single VALUES query in Phase-2 of two-phase mining. Default 15. Higher values send fewer queries but each query is heavier.
delay – Delay between pages (seconds).
timeout – HTTP timeout per request.
counts – Fetch triple counts per pattern.
two_phase – Use two-phase mining (default True). Pass False for the legacy single-pass strategy.
one_shot – Run each pattern query as a single unbounded SELECT with no LIMIT/OFFSET and no fallback chain. Intended for local QLever endpoints. When True, two_phase is ignored.
report_path – If given, write an analytics JSON report to this path. The file is updated incrementally after each mining phase.
filter_service_namespaces – Strip patterns whose URIs belong to service / system namespaces (Virtuoso, OpenLink, etc.) from the result. Default True.
untyped_as_classes – Treat untyped URI objects as owl:Class references instead of the generic rdfs:Resource sentinel. Default False.

Returns:

Contains patterns and provenance metadata.

Return type:

MinedSchema