Benchmarking

Archi provides benchmarking functionality via the archi evaluate CLI command to measure retrieval and response quality.

Evaluation Modes

Two modes are supported (can be used together):

SOURCES Mode

Checks if retrieved documents contain the correct sources by comparing metadata fields.

  • Default match field: file_name (configurable per-query)
  • Override with sources_match_field in the queries file

RAGAS Mode

Uses the Ragas evaluator for four metrics:

  • Answer relevancy: How relevant the answer is to the question
  • Faithfulness: Whether the answer is grounded in the retrieved context
  • Context precision: How relevant the retrieved documents are
  • Context relevancy: How much of the retrieved context is useful

Preparing the Queries File

Provide questions, expected answers, and correct sources in JSON format:

[
  {
    "question": "Does Jorian Benke work with the PPC?",
    "sources": [
      "https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/",
      "CMSPROD-42"
    ],
    "answer": "Yes, Jorian works with the PPC and her topic is Lorentz invariance.",
    "source_match_field": ["url", "ticket_id"]
  }
]
Field Required Description
question Yes The question to ask
sources Yes List of source identifiers (URLs, ticket IDs, etc.)
answer Yes Expected answer (used for RAGAS evaluation)
source_match_field No Metadata fields to match sources against (defaults to config value)

See examples/benchmarking/queries.json for a complete example.


Configuration

services:
  benchmarking:
    agent_class: CMSCompOpsAgent
    agent_md_file: examples/agents/cms-comp-ops.md
    provider: local
    model: qwen3:32b
    ollama_url: http://host.containers.internal:7870
    queries_path: examples/benchmarking/queries.json
    out_dir: bench_out
    modes:
      - "RAGAS"
      - "SOURCES"
    mode_settings:
      sources:
        default_match_field: ["file_name"]
      ragas_settings:
        embedding_model: OpenAI
Key Default Description
agent_class Pipeline/agent class to run for benchmark questions
agent_md_file Path to a single agent markdown file
provider Provider used for benchmark question answering
model Model used for benchmark question answering
ollama_url Ollama base URL when provider: local
queries_path Path to the queries JSON file
out_dir Output directory for results (must exist)
modes List of evaluation modes (RAGAS, SOURCES)
mode_settings.ragas_settings.timeout 180 Max seconds per QA pair for RAGAS evaluation
mode_settings.ragas_settings.batch_size Ragas default Number of QA pairs to evaluate at once

archi evaluate now requires benchmark runtime fields under services.benchmarking. services.chat_app fields are not used for benchmark runtime configuration.

RAGAS Settings

Key Description
embedding_model OpenAI or HuggingFace

Running

Evaluate one or more configurations:

# Single config file
archi evaluate -n benchmark -c config.yaml -e .secrets.env

# Directory of configs (for comparing hyperparameters)
archi evaluate -n benchmark -cd configs/ -e .secrets.env

# With GPU support
archi evaluate -n benchmark -c config.yaml -e .secrets.env --gpu-ids all

Make sure the out_dir exists before running.


Results

Results are saved in a timestamped subdirectory of out_dir (e.g., bench_out/2042-10-01_12-00-00/).

To analyze results, see scripts/benchmarking/ which contains:

  • Plotting functions
  • An IPython notebook with usage examples (benchmark_handler.ipynb)