Sources

Load data-source definitions from data/sources.yaml.

The canonical source registry is a YAML file containing a flat list of mappings, one per SPARQL data source. Each mapping carries:

  • name - unique human-readable identifier.

  • endpoint - SPARQL endpoint URL.

  • graph_uris - named graphs to query.

  • use_graph - whether to wrap queries in a GRAPH clause.

  • two_phase - use two-phase mining (default True).

  • Optional tuning knobs: chunk_size, class_batch_size, class_chunk_size, timeout, delay, counts, unsafe_paging.

Legacy CSV files (data/sources.csv) and JSON-LD files are still accepted: the reader auto-detects the format by extension.

Typical usage:

from rdfsolve.sources import load_sources

for src in load_sources("data/sources.yaml"):
    print(src["name"], src["endpoint"])
class SourceEntry[source]

Bases: TypedDict

Typed dictionary for a single data-source definition.

name: str
endpoint: str
void_iri: str
graph_uris: list[str]
use_graph: bool
two_phase: bool
chunk_size: int
class_batch_size: int
class_chunk_size: int | None
timeout: float
delay: float
counts: bool
unsafe_paging: bool
notes: str
load_sources(path: str | Path | None = None) list[SourceEntry][source]

Load data-source definitions from a YAML, JSON-LD, or CSV file.

Parameters:

path – Path to the sources file. When None the default data/sources.yaml (or .jsonld / .csv fallback) is used.

Returns:

One dict per data source, keys normalised to snake_case. Sources without an endpoint are included (callers may skip them).

Return type:

list[SourceEntry]

load_sources_dataframe(path: str | Path | None = None) DataFrame[source]

Load sources and return a DataFrame.

The DataFrame has columns compatible with probe_resource(): dataset_name, endpoint_url, graph_uri, use_graph, void_iri.

Parameters:

path – Path to the sources file. None = auto-detect default.