// Methodology v1.1

The specification behind every Archonics audit.

Every Free Scan, Dimension Zero Audit, Instant Audit, and Full Audit applies the same five-dimension framework at different depths. This is the public version of the spec — no secret sauce, no black box. If the findings hold up, the framework holds up.

VERSION · 1.1 PUBLISHED · April 27, 2026 AUDIENCE · Public

Core thesis

Most production agent failures are not model failures. They are context engineering failures. The model received ambiguous instructions, poorly described tools, bloated context, or no feedback loop that would have caught the problem before production. The audit methodology is designed to surface these failures systematically rather than anecdotally.

An audit examines five dimensions. The first — Dimension Zero — asks whether a calling agent can find this service at all. The remaining four examine the agent's internal quality. All five are scored independently and synthesized into a prioritized fix list.

Dimension 0 · Discovery & indexability

Before any of the other four dimensions matter, a service-providing agent has to be discoverable to the agents that would invoke it. An agent that isn't indexed in the catalogs and registries other agents query is invisible — its prompt quality, tool design, context strategy, and eval coverage all become moot. Dimension Zero is the prerequisite the rest of the methodology rides on.

What we examine

Discovery-surface coverage. Is the service present in every index a calling agent might check? MCP Registry, x402scan, CDP Bazaar, agentic.market, PulseMCP, Glama.ai, and the GitHub topic graph. Missing surfaces are the leading cause of zero traffic.
Discovery-doc correctness. Does the service serve /.well-known/x402 (no extension) AND /.well-known/x402.json? Do the discovery JSONs match the v1 vs v2 schema each indexer actually parses? Each indexer uses subtly different field names — the schema diff is the leading cause of "processing forever" failures.
Settlement-time indexing signals. Is the service using a facilitator that auto-catalogs (Coinbase CDP) or a self-hosted facilitator that doesn't? Is the v2 SDK Bazaar extension wired up, OR are v1 outputSchema fields shaped to be extractable? Has a successful settlement actually fired since the schema fix? Without it, the catalog never picks up the service.
Brand & metadata legibility. Is the service name searchable in each index? Is the description keyword-dense enough for semantic search? Has a brand squatter registered a lookalike domain that's outranking the legitimate service in the catalog? (We've observed this happen within hours of an npm release.)
Agent-readable surfaces. Does the service expose llms.txt, a structured /agents.json manifest, an OpenAPI document, a sitemap, and proper robots directives? These are what an agent fetches when it lands on the URL trying to understand what's available.
Discovery telemetry. Does the service log enough to distinguish indexer probes from real agent invocations? Without this, the team can't tell whether they have a discovery problem, a conversion problem, or both.

OutputPer-directory verdict (indexed / not indexed / partial), specific failure mode for each missing surface with reproduction steps, and a worker-side or config-side fix for each.

Dimension 1 · System prompt analysis

A system prompt is a specification document written in natural language. Like any specification, it can be evaluated for clarity, consistency, completeness, and fitness for purpose.

What we examine

Role clarity. Does the prompt establish a clear operating identity, or does it hedge across multiple roles that create behavioral ambiguity?
Instruction conflicts. Are there directives that contradict each other? (e.g., "be concise" and "always explain your reasoning in detail.") Agents resolve conflicts unpredictably, producing inconsistent behavior.
Negative space. What does the prompt not say? Missing guidance on error handling, edge cases, refusals, and tool-use priority is a frequent failure source.
Priority structure. When instructions conflict at runtime, which wins? Well-engineered prompts establish explicit priority; most don't.
Token efficiency. What fraction of the prompt is load-bearing? Dead weight in the system prompt increases cost on every turn and can dilute attention to the instructions that matter.
Format specification. Is output structure specified with enough precision that downstream parsing is reliable?
Failure-mode coverage. Does the prompt specify what to do when the agent cannot complete the task, lacks information, or encounters ambiguous input?

OutputFindings list with severity (critical / high / medium / low) and evidence (specific quoted passages or identified gaps).

Dimension 2 · Tool definition review

Tool descriptions are prompts in disguise. The model reads each tool's description and parameter schema to decide when to call it and with what arguments. Weak tool definitions cause tool-call hallucinations, parameter errors, and missed opportunities.

What we examine

Description quality. Does the tool description start with a clear action verb and communicate when this tool should (and should not) be used? Descriptions that only describe what the tool does without describing when to use it produce reliable underuse or misuse.
Parameter schema precision. Are parameter types tight? (e.g., enum vs. free string, specific format vs. "any text.") Loose schemas invite invalid calls.
Parameter description coverage. Every parameter should have a description that communicates intent, acceptable values, and edge cases.
Error response design. What does the tool return when it fails? Models handle structured errors with actionable guidance far better than raw stack traces or generic "Error occurred."
Tool set coherence. Do multiple tools have overlapping purposes? Models split calls unpredictably when two tools could plausibly handle the same request.
Tool set minimalism. Is every tool earning its place? Each additional tool increases context cost and decision complexity.
Discoverability. If a tool should be used in a specific scenario, does its description explicitly name that scenario?

OutputPer-tool findings plus a tool-set-level assessment.

Dimension 3 · Context packing analysis

Context is the most expensive resource an agent has. Waste is ubiquitous. The audit examines what goes into the context window, when, and why.

What we examine

Content audit. What is actually in context on a typical turn? System prompt, tool definitions, conversation history, retrieved documents, memory, reminders. We quantify each.
Redundancy detection. Is information repeated across system prompt, tool descriptions, and retrieved context? Redundancy produces attention dilution and wasted tokens.
Freshness logic. For retrieved or injected context (memory, RAG results, prior turns), what determines inclusion? Is inclusion logic tuned, or does it default to "include everything relevant"?
Ordering. Models weight recent and salient context more heavily. Is high-priority information positioned to survive attention competition?
Truncation risk. What happens as conversations grow long? Does the agent have a strategy for context overflow, or does it silently drop content?
Cost per turn. Dollar cost of a representative interaction, broken down by context category. Surfaces the highest-ROI reduction targets.
Cache utilization. For providers with prompt caching, is the static portion of the prompt positioned to maximize cache hits?

OutputContext inventory with cost breakdown, redundancy map, and a prioritized reduction plan.

Dimension 4 · Evaluation gap analysis

The final dimension examines what the team knows about their agent's behavior. An agent without evals is an agent whose quality is a rumor.

What we examine

Eval coverage. What behaviors are tested? What behaviors are shipped but not tested?
Regression protection. When a prompt changes, what catches the downstream breakage? Most teams we audit have zero automated regression coverage on prompt changes.
Tool-call accuracy. Is there a test that the agent calls the right tool with the right arguments for a given scenario?
Behavioral guardrails. Are refusals, safety behaviors, and edge-case handling tested, or are they assumed to work?
Production observability. What is logged? Can the team reconstruct why a specific production call produced a specific output?
Failure-case library. Does the team collect the specific failures users have reported, and are those cases codified into tests?
Eval-development feedback loop. When a new failure is observed in production, how long until there's a test preventing its recurrence? For most teams, the answer is "never."

OutputGap analysis mapping shipped behaviors against test coverage, with recommended high-ROI eval additions.

Severity scoring

Every finding is assigned one of four severity levels. Severity is based on expected user-visible impact, not on how intellectually interesting the issue is. A sloppy system prompt that nevertheless produces reliable outputs gets a lower severity than a clean prompt with a subtle instruction conflict that fires on 2% of real traffic.

Critical

Active cause of production failures, or a failure mode one bad input away from firing. Fix immediately.

High

Reliably produces degraded quality under normal operation. Fix this sprint.

Medium

Measurable quality impact but not user-visible on typical traffic. Fix this quarter.

Low

Efficiency or polish issue. Fix when convenient.

Prioritization framework

Every audit concludes with a ranked fix list. Ranking is a function of severity (above), effort to fix (trivial / modest / significant / major), blast radius (does the fix improve one behavior or propagate across the system), and reversibility (can we ship it and roll back cleanly if it regresses).

The top items on the fix list are always high-severity, low-effort, high-blast-radius, reversible changes. Teams get disproportionate value from shipping these first.

Deliverable structure

Every audit produces a written report with this structure:

Executive summary — 1 page, non-technical leadership can read this.
Context — what system we audited, what we had access to, what we didn't.
Findings by dimension — discovery / prompt / tools / context / eval.
Prioritized fix list — top 10, ordered.
Recommended eval additions — tests that would have caught the findings.
Open questions — what we need to know to deepen the analysis.

Report length scales with tier:

Tier	Scope	Output
Free Scan	3 findings on system prompt, tool definitions, or context packing (D1, D2, or D3)	Single page, no exec summary
Dimension Zero Audit · $19	D0 only, full depth, programmatic via x402 (POST /dimension-zero-audit)	Structured JSON: per-directory verdicts, prioritized fixes, ready-to-paste code
Instant Audit · $49	D0–D4, programmatic	~5–10 page PDF
Full Audit · $750	D0–D4, human-reviewed, tuned to team context	15–25 page PDF

Privacy posture

Archonics audits process prompts, tool definitions, and sample interactions. This content may contain proprietary intellectual property, customer data, or security-sensitive information.

No prospect content is retained after audit delivery unless the customer explicitly requests retention for a follow-up engagement.
No prospect content is used to train Archonics models or improve the methodology against that specific customer's systems.
Anonymized patterns (e.g., "18/20 audited systems lack tool-call regression tests") may inform methodology evolution; specific content never does.
Customers requesting higher assurance (NDAs, DPAs) are accommodated at the Full Audit tier.

Detailed handling is documented on the privacy page.

Versioning

This methodology is versioned. Every audit report references the methodology version it was produced under. Improvements based on audit experience are tracked in a changelog appended to the internal specification.

ARCHONICS · AUDIT METHODOLOGY v1.1 · APRIL MMXXVI