Benchmarking
Archi provides benchmarking functionality via the archi evaluate CLI command to measure retrieval and response quality.
Evaluation Modes
Two modes are supported (can be used together):
SOURCES Mode
Checks if retrieved documents contain the correct sources by comparing metadata fields.
- Default match field:
file_name(configurable per-query) - Override with
sources_match_fieldin the queries file
RAGAS Mode
Uses the Ragas evaluator for four metrics:
- Answer relevancy: How relevant the answer is to the question
- Faithfulness: Whether the answer is grounded in the retrieved context
- Context precision: How relevant the retrieved documents are
- Context relevancy: How much of the retrieved context is useful
Preparing the Queries File
Provide questions, expected answers, and correct sources in JSON format:
[
{
"question": "Does Jorian Benke work with the PPC?",
"sources": [
"https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/",
"CMSPROD-42"
],
"answer": "Yes, Jorian works with the PPC and her topic is Lorentz invariance.",
"source_match_field": ["url", "ticket_id"]
}
]
| Field | Required | Description |
|---|---|---|
question |
Yes | The question to ask |
sources |
Yes | List of source identifiers (URLs, ticket IDs, etc.) |
answer |
Yes | Expected answer (used for RAGAS evaluation) |
source_match_field |
No | Metadata fields to match sources against (defaults to config value) |
See examples/benchmarking/queries.json for a complete example.
Configuration
services:
benchmarking:
agent_class: CMSCompOpsAgent
agent_md_file: examples/agents/cms-comp-ops.md
provider: local
model: qwen3:32b
ollama_url: http://host.containers.internal:7870
queries_path: examples/benchmarking/queries.json
out_dir: bench_out
modes:
- "RAGAS"
- "SOURCES"
mode_settings:
sources:
default_match_field: ["file_name"]
ragas_settings:
embedding_model: OpenAI
| Key | Default | Description |
|---|---|---|
agent_class |
— | Pipeline/agent class to run for benchmark questions |
agent_md_file |
— | Path to a single agent markdown file |
provider |
— | Provider used for benchmark question answering |
model |
— | Model used for benchmark question answering |
ollama_url |
— | Ollama base URL when provider: local |
queries_path |
— | Path to the queries JSON file |
out_dir |
— | Output directory for results (must exist) |
modes |
— | List of evaluation modes (RAGAS, SOURCES) |
mode_settings.ragas_settings.timeout |
180 |
Max seconds per QA pair for RAGAS evaluation |
mode_settings.ragas_settings.batch_size |
Ragas default | Number of QA pairs to evaluate at once |
archi evaluate now requires benchmark runtime fields under services.benchmarking.
services.chat_app fields are not used for benchmark runtime configuration.
RAGAS Settings
| Key | Description |
|---|---|
embedding_model |
OpenAI or HuggingFace |
Running
Evaluate one or more configurations:
# Single config file
archi evaluate -n benchmark -c config.yaml -e .secrets.env
# Directory of configs (for comparing hyperparameters)
archi evaluate -n benchmark -cd configs/ -e .secrets.env
# With GPU support
archi evaluate -n benchmark -c config.yaml -e .secrets.env --gpu-ids all
Make sure the out_dir exists before running.
Results
Results are saved in a timestamped subdirectory of out_dir (e.g., bench_out/2042-10-01_12-00-00/).
To analyze results, see scripts/benchmarking/ which contains:
- Plotting functions
- An IPython notebook with usage examples (
benchmark_handler.ipynb)