Measure Retrieval Before You Change Defaults

Benchmark retrieval models and code embedding candidates against fixed corpora, real GNO code, and pinned public OSS slices. Use bounded autonomous search to find better models without turning research into chaos.

Key Benefits

  • Regression-first retrieval quality checks
  • Hybrid benchmark snapshots and deltas
  • Code embedding benchmark across canonical, repo, and OSS slices
  • Bounded autonomous search loops
  • Per-collection model recommendations backed by benchmark results

Example Commands

bun run eval:hybrid bun run bench:code-embeddings --candidate bge-m3-incumbent --write bun run research:embeddings:autonomous:list-search-candidates bun run research:embeddings:autonomous:search --dry-run

Get Started

Ready to try Benchmarks?

Why This Matters

Changing a retrieval model because it feels better is a good way to regress the product slowly and not notice until users do.

GNO treats retrieval changes as something to measure:

  1. establish the incumbent baseline
  2. run challengers on fixed benchmark corpora
  3. compare on real GNO code
  4. compare on pinned public OSS slices
  5. only then document or promote a winner

Two Benchmark Lanes

Hybrid Retrieval

For the full retrieval stack:

bun run eval:hybrid
bun run eval:hybrid:baseline
bun run eval:hybrid:delta

These runs answer:

  • does retrieval quality improve?
  • where does latency move?
  • did we accidentally regress the full hybrid path?

Code Embeddings

For code-focused embedding models:

# Incumbent baseline
bun run bench:code-embeddings --candidate bge-m3-incumbent --write

# Benchmark a real challenger
bun run research:embeddings:autonomous:run-candidate qwen3-embedding-0.6b

Current fixtures:

  • canonical — fixed multi-language code corpus
  • repo-serve — real GNO src/serve slice
  • oss-slices — pinned public OSS repo slices

Current Results

Current code-embedding numbers, as documented in the benchmark artifacts:

Fixture bge-m3 vector nDCG@10 Qwen3-Embedding-0.6B-GGUF vector nDCG@10 Notes
repo-serve 0.1003 0.6872 Real GNO src/serve slice
oss-slices 0.6116 1.0 Pinned public OSS slices

Interpretation:

  • Qwen ties the incumbent on the tiny canonical corpus, then wins hard on both the real GNO code slice and the public OSS slice
  • this is why the current recommendation is collection-scoped, not a blanket default change
  • jina-code-embeddings-0.5b-GGUF is not listed as a current winner because it hit native-runtime embedding-id failures and collapsed on repo-serve

Full benchmark artifacts live in the repository under:

  • evals/fixtures/code-embedding-benchmark/
  • research/embeddings/

Current Code Winner

Current best result:

  • Qwen3-Embedding-0.6B-GGUF

Practical recommendation:

  • keep bge-m3 as the global default for mixed notes/docs collections
  • use Qwen3-Embedding-0.6B-GGUF as a per-collection models.embed override for code-heavy collections

Example:

collections:
  - name: gno-code
    path: /Users/you/work/gno/src
    pattern: "**/*.{ts,tsx,js,jsx,go,rs,py,swift,c}"
    models:
      embed: "hf:Qwen/Qwen3-Embedding-0.6B-GGUF/Qwen3-Embedding-0.6B-Q8_0.gguf"

That gives you a code-specialist embedder without forcing every prose-heavy collection onto the slower model.

Autonomous Search, Bounded

GNO’s model search loops are intentionally constrained:

  • fixed candidate list
  • fixed corpora
  • fixed scoring policy
  • human decision before changing defaults
bun run research:embeddings:autonomous:list-search-candidates
bun run research:embeddings:autonomous:leaderboard
bun run research:embeddings:autonomous:search --dry-run

This is not an uncontrolled “let the agent mutate the whole repo” setup. It is a bounded model-comparison loop designed to be trustworthy enough for product decisions.

Learn More