Sources

Load data-source definitions from data/sources.yaml.

The canonical source registry is a YAML file containing a flat list of mappings, one per SPARQL data source. Each mapping carries:

name - unique human-readable identifier.
endpoint - SPARQL endpoint URL.
graph_uris - named graphs to query.
use_graph - whether to wrap queries in a GRAPH clause.
two_phase - use two-phase mining (default True).
Optional tuning knobs: chunk_size, class_batch_size, class_chunk_size, timeout, delay, counts, unsafe_paging.

Legacy CSV files (data/sources.csv) and JSON-LD files are still accepted: the reader auto-detects the format by extension.

Typical usage:

from rdfsolve.sources import load_sources

for src in load_sources("data/sources.yaml"):
    print(src["name"], src["endpoint"])

class SourceEntry[source]

Typed dictionary for a single data-source definition.

load_sources(path: str | Path | None = None) → list[SourceEntry][source]

Load data-source definitions from a YAML, JSON-LD, or CSV file.

Parameters:: path – Path to the sources file. When None the default data/sources.yaml (or .jsonld / .csv fallback) is used.
Returns:: One dict per data source, keys normalised to snake_case. Sources without an endpoint are included (callers may skip them).
Return type:: list[SourceEntry]

load_sources_dataframe(path: str | Path | None = None) → DataFrame[source]

Load sources and return a DataFrame.

The DataFrame has columns compatible with probe_resource(): dataset_name, endpoint_url, graph_uri, use_graph, void_iri.

Parameters:: path – Path to the sources file. None = auto-detect default.