Status: accepted Date: 2026-04-06 Author: Gordon Mickel

Context

GNO’s original chunker was prose-oriented: character windows, semantic break heuristics, stable line tracking, and overlap. That works well for markdown and converted documents, but it can split code files in the middle of functions or class/type blocks.

This hurts:

  • code search snippet quality
  • retrieval chunk alignment for source files
  • user trust in line-range-based navigation

The task for this ADR was to improve code-file chunking without:

  • adding schema churn
  • forcing every file type through a parser
  • introducing a second chunking configuration system
  • regressing markdown/prose ingestion

Decision

Adopt an automatic first-pass code-aware chunking mode for a narrow set of source file types:

  • .ts
  • .tsx
  • .js
  • .jsx
  • .py
  • .go
  • .rs

For those extensions, GNO prefers structural breakpoints such as:

  • imports
  • functions
  • classes
  • interfaces/types/enums where relevant
  • other top-level code-definition boundaries

If no useful structural breakpoint is available near the target chunk boundary, GNO falls back to the existing markdown/prose chunker.

If the file extension is unsupported, GNO falls back to the existing markdown/prose chunker.

This first pass is automatic-only. No user-selectable chunking mode is added in this change.

Rationale

Why automatic-only

  • keeps the user model simple
  • avoids adding a new config/control surface before the behavior proves itself
  • matches the product goal: better code retrieval by default, not expert tuning

Why narrow language support

  • highest practical payoff for GNO’s developer-heavy audience
  • easier to validate safely
  • avoids pretending “all code is supported” when only some languages can be chunked well

Why fallback instead of failure

Fallback is the production path, not an error path.

That preserves:

  • existing indexing behavior for unsupported code files
  • stable line-range semantics
  • no hard dependency on parser availability for unrelated content

Why no schema change

The existing chunk contract already carries what downstream retrieval needs:

  • seq
  • pos
  • text
  • startLine
  • endLine

Changing chunk generation without changing storage shape keeps the blast radius low.

Consequences

Positive

  • better code-file chunk boundaries
  • better snippet alignment for code search
  • safer retrieval-visible behavior for source files
  • no impact on markdown/PDF/Office mirrors unless they already use supported code file extensions

Tradeoffs

  • heuristic structural chunking is simpler than full AST-aware parsing
  • some very large functions will still split internally when needed to stay within chunk-size constraints
  • automatic-only means less operator control in the first pass

Operator Visibility

gno doctor reports the code-chunking mode and supported extensions.

Docs must clearly state:

  • supported extensions
  • automatic-only mode
  • fallback behavior for unsupported files or files with no useful structural boundaries

Non-goals

  • full symbol indexing
  • full AST dependency on every supported platform/file type
  • per-language deep semantic analysis
  • collection-specific chunking configuration