Tools
Utility tools for RDFSolve - download helpers, decompression, etc.
Benchmark
Benchmarking and metrics collection for RDFSolve mining runs.
Captures system-level performance data suitable for computer-science papers and reproducibility analysis:
Machine specifications (CPU, RAM, OS, kernel, hostname)
Per-run resource usage (wall time, peak RSS, CPU time, disk I/O)
QLever index/server metrics (index size, indexing time)
Environment info (Python version, rdfsolve version, QLever version)
Per-dataset statistics (triple count, file sizes, errors)
All data is written to benchmarks.jsonl (one JSON object per line)
for easy aggregation with pandas / polars.
Usage from mine_local.py:
from rdfsolve.tools.benchmark import BenchmarkCollector
bench = BenchmarkCollector(output_dir=Path("mined_schemas"))
with bench.track("affymetrix", method="local-mine") as run:
_mine_single_local(endpoint, name, out, args)
run.add_extra("triples_indexed", 123456)
# run data auto-flushed to benchmarks.jsonl
- class MachineInfo(hostname: str = '', os_name: str = '', os_release: str = '', os_version: str = '', architecture: str = '', cpu_model: str = '', cpu_count_logical: int = 0, cpu_count_physical: int = 0, ram_total_gb: float = 0.0, python_version: str = '', rdfsolve_version: str = '', qlever_version: str = '')[source]
Bases:
objectStatic information about the host machine.
- collect_machine_info() MachineInfo[source]
Gather static machine specifications.
- class RunMetrics(dataset: str = '', method: str = '', endpoint: str = '', started_at: str = '', finished_at: str = '', wall_time_s: float = 0.0, cpu_user_s: float = 0.0, cpu_system_s: float = 0.0, peak_rss_mb: float = 0.0, read_bytes: int = 0, write_bytes: int = 0, success: bool = False, error: str = '', classes_found: int = 0, properties_found: int = 0, triples_count: int = 0, output_files: dict[str, str]=<factory>, output_sizes_mb: dict[str, float]=<factory>, extra: dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectResource usage captured for a single mining run.
- class BenchmarkCollector(output_dir: Path)[source]
Bases:
objectCollects and persists benchmark data for mining runs.
Usage:
collector = BenchmarkCollector(output_dir) with collector.track("drugbank", method="local-mine") as run: # … do the actual mining … run.add_extra("index_size_mb", 420) # run metrics auto-saved to benchmarks.jsonl
Initialize a BenchmarkCollector with an output directory Path.
- Parameters:
output_dir – Path to output directory
- property machine_info: MachineInfo
Return the static machine info.
Decompress
Decompress RDF data files (.gz, .xz) in bulk.
Supports the two compression formats found across RDF data repositories:
.gz used by e.g. RDFPortal (
*.ttl.gz), Bio2RDF (*.nq.gz).xz used by e.g. UniProt FTP (
*.rdf.xz,*.owl.xz)
Usage
# Decompress a single file (auto-detects format) python -m rdfsolve.tools.decompress data/medgen/MGCONSO.ttl.gz
# Decompress everything under a directory python -m rdfsolve.tools.decompress data/uniprot_local/ –recursive
# Dry-run: just show what would be decompressed python -m rdfsolve.tools.decompress data/ –recursive –dry-run
# Keep the original compressed files python -m rdfsolve.tools.decompress data/ –recursive –keep
- decompress_directory(directory: Path, *, recursive: bool = False, keep: bool = False, extensions: set[str] | None = None) list[Path][source]
Decompress all supported compressed files in a directory.
- Parameters:
directory – Root directory to scan.
recursive – If True, walk subdirectories as well.
keep – Passed to
decompress_file().extensions – Restrict to a subset of extensions (e.g.
{".gz"}). Defaults to all supported extensions.
- Returns:
Paths to successfully decompressed files.
- Return type:
list[Path]
- decompress_file(src: Path, *, dest: Path | None = None, keep: bool = False, chunk_size: int = 65536) Path | None[source]
Decompress a single
.gzor.xzfile.- Parameters:
src – Path to the compressed file.
dest – Explicit output path. When None the suffix is stripped (e.g.
foo.ttl.gz->foo.ttl).keep – If True the original compressed file is kept; otherwise it is removed after successful decompression.
chunk_size – Read/write buffer size in bytes (default 64 KiB).
- Returns:
Path to the decompressed file, or None on error.
- Return type:
Path | None