Amy
Internals

Internals, Agent Orchestration

The deepest dive: what runTurn actually does. The full reference is src/orchestrator/index.ts, this page walks every step, every routing pattern, every validation gate, and the Fact Sheet contract t…

The deepest dive: what runTurn actually does. The full reference is src/orchestrator/index.ts, this page walks every step, every routing pattern, every validation gate, and the Fact Sheet contract that pins synthesis to deterministic numbers. Today this code runs in-process inside the CLI; the same module is what a future TurnWorkflow on Cloudflare will wrap step-by-step.

The pipeline implements the paper The Anatomy of a Personal Health Agent (Heydari et al., 2025) §F.2 plus Amy's two extensions: a Hypothesis Investigator for vague queries, and a CoDaS-style validation phase (Kim et al., 2026) that gates every quantitative claim before it can be synthesized into the answer.


Quick navigation


The 9-step pipeline

runTurn(userMessage, ctx) runs these in order:

#StepImplementationModelTypical wall timeFailure modeRetry behavior
0Vagueness classifyclassifyVaguenessfastModel (sonnet)~1-3sReturns "low" on any parse errorNone, defaults to low
1aInvestigator (high vagueness only)runInvestigatormodel (opus)~30-60sEmpty hypotheses listNone, emits empty briefing
1bRouting (low/medium vagueness)classifyJson w/ TASK_ASSIGNMENT promptfastModel~2-5sLLM emits garbage agent namesanitiseRouting canonicalises or returns "" (→ fallback reply)
2Question rephrase per agentclassifyJson w/ QUESTION_REPHRASEfastModel~2-4sJSON parse errortry/catch → falls back to the original userMessage
3Supporting agents (sequential)runDsAgent, runDeAgentper-agent (see below)~30-120s eachSandbox/tool failure surfaced in their answerInternal retries (DS: up to 3 sandbox attempts)
4Main agentrunDsAgent / runDeAgent / runHcAgentper-agent~30-120sSame as above; empty mainText if dispatch fell throughNone at this layer
5Reflection (only if supporting list was non-empty)classifyJson w/ REFLECTION promptfastModel~3-5s, + reflection sub-calls if YESReturns nothing useful → caught and skippedSingle shot. If decision="YES", runs the named follow-up agents (DS w/ maxRetries=0, DE full)
6ValidationvalidateFindings over (ds.findings ∪ investigator.findings ∪ reflectionFindings)gates: python sandbox; Critic + Assessment: validatorModel (opus)~5-30s per findingGate runner crash → verdict="rejected", hard_rejection setNone, every finding either passes through or is flagged
7Synthesis (streams)ask w/ FINAL_SYNTHESIS promptmodel (opus)~10-30sLLM call failure bubbles upNone
7bFact-check passfactCheckReply (regex-based)none~10msAlways returns, issues array may be emptyN/A
8Memory extractionextractMemoriesfastModel~3-5sJSON parse → empty listtry/catch → returns []

The pipeline emits OrchestratorEvents throughout (src/orchestrator/events.ts) so the CLI (and any future SSE stream) can render every transition live.


Sequence diagram

sequenceDiagram
  autonumber
  participant U as User
  participant O as Orchestrator
  participant V as Vagueness Classifier
  participant INV as Investigator
  participant R as Router
  participant DS as Data Science Agent
  participant DE as Domain Expert Agent
  participant HC as Health Coach Agent
  participant VAL as Validator
  participant S as Synthesis (Opus, streams)
  participant M as Memory Extractor

  U->>O: user message
  O->>V: classifyVagueness(message)
  V-->>O: low | medium | high

  alt vagueness = high
    O->>INV: runInvestigator(store, profile, biomarkers, testedHypotheses)
    INV-->>O: digest, hypotheses, findings (top-K), briefing
    O->>VAL: validateFindings(inv.findings)
    VAL-->>O: ValidatedFinding[] + FactSheet
    O->>O: augmentBriefingWithVerdicts (+ factCheckReply)
    O->>M: extractMemories + validatedToMemories
    M-->>O: MemoryEntry[]
    O-->>U: briefing
  else vagueness = low | medium
    O->>R: classifyJson(TASK_ASSIGNMENT)
    R-->>O: { main_agent, supporting_agents, workflow }
    O->>O: sanitiseRouting(...)
    alt routing.main_agent = ""
      O->>O: fallback ask()
      O->>M: extractMemories
      O-->>U: fallback reply
    else
      O->>O: classifyJson(QUESTION_REPHRASE)
      par supporting agents (sequential)
        O->>DS: runDsAgent(subQ_DS)
        DS-->>O: DsTrace (answer, findings)
      and
        O->>DE: runDeAgent(subQ_DE)
        DE-->>O: DeTrace (answer)
      end
      alt main_agent = Data Science
        O->>DS: runDsAgent(main_q)
      else main_agent = Domain Expert
        O->>DE: runDeAgent(main_q, supportingInsights)
      else main_agent = Health Coach
        opt DS supporting findings present
          O->>VAL: pre-validate for the coach
          VAL-->>O: validated + FactSheet
        end
        O->>HC: runHcAgent(main_q, history, validated)
      end
      DS-->>O: mainText (when main = DS)
      DE-->>O: mainText (when main = DE)
      HC-->>O: mainText (when main = HC)
      opt supportingList non-empty
        O->>O: classifyJson(REFLECTION)
        alt decision = YES
          O->>DS: reflection DS follow-up (maxRetries=0)
          O->>DE: reflection DE follow-up
        end
      end
      opt not pre-validated by HC path
        O->>VAL: validateFindings(allFindings)
        VAL-->>O: ValidatedFinding[] + FactSheet
      end
      O->>S: ask(FINAL_SYNTHESIS) — streams tokens
      S-->>U: synthesis_delta events
      S-->>O: final text
      O->>O: factCheckReply(reply, factSheet, userMessage, deProse)
      O->>M: extractMemories + validatedToMemories
      M-->>O: MemoryEntry[]
      O-->>U: done(total_cost_usd, response)
    end
  end

Step-by-step reference

Step 0, Vagueness classifier (Amy extension)

src/orchestrator/index.ts:124–142.

Input: the raw user message. Output: "low" | "medium" | "high".

A one-shot classify (no tools, no JSON), prompt embedded inline in classifyVagueness (lines 896-935). Biased toward low. Returns "high" only when the query has no anchor at all ("Is anything interesting in my data?"); coaching queries with a stated action ("I want to set a SMART goal") are explicitly classified "low".

Step 1a, Investigator (vagueness = high only)

If vague === "high", the orchestrator jumps directly into the Investigator path and returns without ever hitting the router. The Investigator is never assigned as a regular routing agent (see the explicit note in ORCHESTRATOR_SYSTEM).

Step 1b, Routing

src/orchestrator/index.ts:215–236.

Prompt: ORCHESTRATOR_SYSTEM + TASK_ASSIGNMENT (src/orchestrator/prompts.ts lines 11-86). Input includes the full conversation history rendered as role: content\n plus the current [TOPIC]. Output is a JSON object:

{
  "main_agent": "Data Science Agent" | "Domain Expert Agent" | "Health Coach Agent" | "",
  "supporting_agents": "Data Science Agent; Domain Expert Agent" | "",
  "collaboration_workflow": "..."
}

The router classifies into one of the 6 collaboration patterns plus corner cases (general health info, device help, etc.) that resolve to main_agent="" → fallback reply.

sanitiseRouting (index.ts:870–894): the LLM occasionally emits aliases or misspellings. The sanitiser canonicalises:

  • "DS Agent" / "ds" / "data scientist""Data Science Agent"
  • "domain expert" / "de""Domain Expert Agent"
  • "coach" / "hc""Health Coach Agent"
  • Anything else → null (dropped). If main_agent becomes "", the orchestrator runs a fallback ask() (index.ts:289–322) rather than dispatching to a phantom agent.

Step 2, Question rephrase

src/orchestrator/index.ts:238–286.

Prompt: QUESTION_REPHRASE (prompts.ts:88–109). The LLM is told to decompose the user's question into per-agent sub-questions. The prompt has a load-bearing constraint for the DS Agent: "frame its question as ONE narrow, concrete computation, the single core metric/relationship needed. Do NOT enumerate multi-part specs." Empirically, multi-part DS asks produced 100+ line brittle pandas that failed to run.

Output:

{
  "main_agent_question": "...",
  "supporting_agent_questions": { "Data Science Agent": "...", "...": "..." }
}

On parse failure, falls back to using the original userMessage for every agent (lines 264-271).

Step 3, Supporting agents (sequential)

src/orchestrator/index.ts:324–369.

Each supporting agent in supportingList is invoked in order (NOT in parallel, sequential is intentional so the main agent can see all supporting outputs in one block). Only Data Science and Domain Expert can be supporting; Health Coach is always main.

Each call returns its own trace object (DsTrace, DeTrace) and an answer string that gets concatenated into supportingInsights for the main agent and synthesis.

Data Science Agent (src/agents/data-science/)

A 3-stage internal loop:

  1. Plan (PLAN_PROMPT, dsModel): produces an == Approach == text describing what to compute.
  2. Code-gen (CODE_PROMPT, dsModel): produces the body of a Python function analysis(summary_df, activities_df, profile_df, population_df).
  3. Sandbox (runDsCodesandbox.ts):
    • Pre-flight ast.parse via python3 -c (~50ms) before burning sandbox time.
    • Auto-fixes the #1 LLM bug: block opener (if/for/...) followed by un-indented body (autoFixPythonIndent).
    • Wraps the body in PY_WRAPPER that loads SQLite + JSON, attaches deterministic composite features (cardio_fitness_index, hrv_rhr_ratio, rolling _sd_30d / _cv_30d / _mean_30d), and emits a JSON-safe result.
    • Up to maxRetries=2 debug iterations (DEBUG_PROMPT with the previous stderr, with extra "indentation help" inlined when an IndentationError is detected).
  4. Summarize + extract findings: extractFindings returns structured Finding[] for the validation pipeline. Short-circuits on looksDescriptive(query) (e.g. "what is my average X?") to skip extraction entirely.

Reflection-mode DS uses maxRetries: 0 (single shot), if it can't answer in one attempt, the reflection ask was too ambitious. Comment at index.ts:502–508.

Domain Expert Agent (src/agents/domain-expert/)

A ReAct loop with maxTurns: 10. Tools:

  • WebSearch, WebFetch (built-in)
  • mcp__amy-de-tools__ncbi_search (PubMed)
  • mcp__amy-de-tools__range_compare (clinical reference ranges)
  • mcp__amy-de-tools__datacommons (Google Data Commons)

System prompt includes the user's Profile, latest Biomarkers, and the full literature priors blurb from data/reference/biomarker_priors.json. The agent is instructed never to invent a URL.

Health Coach Agent (src/agents/health-coach/)

A 3-module modular flow (paper §6.2, splitting prevents the failure mode of giving premature recommendations from a single fat prompt):

  1. HC_RECOMMEND_GATE (classify): emits [VERDICT]: YESREC | NOREC.
  2. HC_SYSTEM main coaching response, system prompt is parameterised by verdict (NOREC → keep gathering context; YESREC → deliver SMART recommendation NOW, don't re-ask).
  3. HC_FINISH_GATE (classify): FINISH | CONTINUE. On FINISH, a closing summary is generated.

Two hard rules in the HC path:

  • HC never sees rejected findings (index.ts:421–424). Filtering keeps the coach from grounding recommendations on data the Critic disowned.
  • HC is required to reference at least one personal anchor from the deterministic computeAnchors(store) blurb, forces specific-to-user advice instead of textbook generics.

Step 4, Main agent

src/orchestrator/index.ts:371–453. Same agent code as Step 3; the main is just whichever routing.main_agent resolved to. Difference:

  • DE as main receives supportingInsights so it doesn't redo computation.
  • HC as main triggers a pre-validation hop: if DS already ran as supporting and produced findings, validate them BEFORE the HC speaks (index.ts:408–420). This is the only path where validation runs before the main agent, necessary because HC is the only agent that turns numbers into actions.

Step 5, Reflection

src/orchestrator/index.ts:455–560.

Only runs if supportingList.length > 0. Prompt: REFLECTION (prompts.ts:111–152). Output:

{ "decision": "YES" | "NO", "reflection_questions": { "<agent>": "<q>" } }

NO is the common case. YES triggers up to one follow-up per agent; prompt explicitly caps at "Maximum 1 question per agent (one DS, one DE) and prefer just 1 total." Reflection DS findings flow into reflectionFindings[] and are merged into validation alongside the main DS findings (index.ts:563–593).

Step 6, Validation

See The validation phase below.

Step 7, Synthesis (streams)

src/orchestrator/index.ts:596–637.

Prompt: FINAL_SYNTHESIS (prompts.ts:154–213). The system prompt is the famously short:

You are Amy, a unified personal health agent. Speak as a single coherent voice. Do not mention specific sub-agents. Honour the FACT SHEET and validated findings, never invent numbers or contradict the validation verdicts.

The user message is a structured block containing main agent draft, supporting insights, reflection insights, validated findings blurb, the Fact Sheet, and (when DS failed) an explicit hard-warning block telling synthesis NOT to invent numbers.

Synthesis uses onSdkEvent to stream text_delta events through to the orchestrator's emitter, which the CLI renders character-by-character. The final text is what the user sees.

Step 7b, Fact-check (regex)

src/orchestrator/index.ts:759–838, the factCheckReply function.

After synthesis, every numeric token in the reply is checked against:

  • The Fact Sheet values (with 2% relative tolerance + 0.05 absolute floor).
  • Pairwise ratios of Fact Sheet values (synthesis often derives e.g. effect / noise_sd; the math is correct but the ratio isn't literally in the sheet).
  • Numbers from the user's original message (echo: "your 7.3% drop").
  • Numbers from the Domain Expert's prose (literature reference values like PSQI MCID = 4.4).

Anything else gets flagged as {value, severity: "warn"}. Issues are emitted via the fact_check event so the CLI can render them with a yellow tint.

This is intentionally regex-based (deterministic, no LLM cost), see CoDaS §2.6 numeric verification.

Step 8, Memory extraction

See Memory extraction below.


Routing, the 6 patterns

Source: TASK_ASSIGNMENT prompt (prompts.ts:27–86). The router must match one of:

#WhenMain agentSupportingWhy
1"Understand health topic / facts / news"Domain Expert,Pure knowledge lookup.
2"Understand my time-series data, single source"Data Science,DS computation suffices; no external interpretation.
3"Understand my data AND need external knowledge"Domain ExpertData ScienceDS computes, DE interprets.
4"Wellness advice / goal-setting (no data)"Health Coach,Pure coaching.
5"Wellness advice based on my data"Health CoachData ScienceDS computes, HC guides.
6"Wellness advice + data + medical context"Health CoachData Science and Domain ExpertDS computes, DE adds clinical context, HC guides.

Plus a forcing rule ([STEP 3] in the prompt): "If the user asks about something potentially related to their personal data, even just a bit, and the main agent is not the data science agent, add the data science agent as a supporting agent." This is what catches "is my LDL of 124 something to worry about" (pattern 1 → pattern 3, because the user named a specific number that ought to be cross-checked against their actual data).

Corner-case bucket → main_agent="" → fallback reply. Garbage agent name → same.


The Investigator path

src/agents/investigator/index.ts.

Triggered when vagueness === "high". Bypasses routing entirely.

1. computeDigest(store)
   └── Per-metric (steps, sleep_minutes, deep_sleep_minutes, rem_sleep_minutes,
       resting_heart_rate, heart_rate_variability, stress_management_score,
       active_zone_minutes, sleep_score):
       avg(last 30d) vs avg(prior 30d), overall avg ± SD, delta in SD units
   └── Day-of-week sleep breakdown
   └── Missingness (HRV / steps presence ratio)
   └── Top 10 workout types
   └── Output: ~500–1500 tokens

2. classifyJson(INVESTIGATOR_SYSTEM) →
   Hypothesis[] = [{ id, title, rationale, test_plan, data_required,
                     missing_data_flag, expected_impact, confidence,
                     next_action, kind, feature_to_test, target_to_test }]
   Already-tested hypotheses (from memory.testedHypotheses()) are passed in
   to prevent re-proposal.

3. Top-K (default K=3, `AUTO_TEST_TOP_K`) hypotheses → Findings via
   `hypothesisToFinding(h, store)`:
     - kind=association → computeSpearman(store, feature, target)
     - kind=trend       → computeRecentVsPrior(store, feature)
     - kind=scalar      → computeMean(store, feature)
   All computed in TypeScript against SQLite directly — no LLM call, no
   Python sandbox. Numbers are real before the finding enters validation.

4. Validate the top-K Findings (same pipeline as Step 6 main flow).

5. Generate a user-facing briefing (INVESTIGATOR_BRIEFING, `model=opus`).

6. augmentBriefingWithVerdicts: appends "_Auto-tested findings:_" with ✓ /
   ~ markers and the effect sizes. If ALL surviving = 0, appends a "none
   survived validation" note so the user knows the noticed patterns might
   be noise.

The Investigator path is synthesis-free: the briefing IS the final response. Numeric verification still runs against the briefing text.


The validation phase

src/agents/validator/index.ts.

For every Finding (from DS, Investigator, or reflection DS):

1. Deterministic gates (Python sandbox)
   └── runGates({ finding }) — see below
   └── If gate runner crashes → verdict=rejected, hard_rejection=error
   └── If hard gate fails → verdict=rejected
   └── Otherwise → preliminary verdict from gate ratio

2. Critic (LLM, validatorModel=opus by default)
   └── runCritic({ finding, gates, priors, memory })
   └── Output: { decision: accept|downgrade|reject, concerns, rationale }
   └── decision=reject → final verdict=rejected (short-circuit)
   └── decision=downgrade && verdict=validated → conditional

3. Assessment (LLM, validatorModel)
   └── Only runs on non-rejected findings
   └── Output: { mechanism, novelty, strategy, citations }

The output ValidatedFinding carries the original Finding + verdict + gates + critic + assessment.

The 7 deterministic gates

Source: src/agents/validator/gates.ts. Implemented as a single Python script run via spawn("python3", ...) against the same SQLite that the DS Agent reads. Per-finding the script loads summary, computes each gate, and emits a JSON result wrapped in __GATES_BEGIN__ / __GATES_END__ markers.

#GateApplies toLogicPass conditionHard?
1sample_sizeallCount non-null observations of feature (and target if present).n ≥ 20 for associations, n ≥ 10 otherwise.Yes, n < min is an automatic reject.
2effect_vs_noisetrend, scalar only|effect| / metric_sdratio ≥ 0.5 (effect is at least half a metric SD), or sd=0/inf.No, soft.
3construct_validityassociation onlySpearman ρ between feature and target on all valid pairs.|ρ| ≤ 0.85Yes, |ρ| > 0.85 means feature is almost certainly the target re-expressed (tautology).
4bootstrapall (assoc / trend / scalar)1000-resample with seeded RNG (42). For associations: bootstrap Spearman ρ; passes if 95% CI [q.025, q.975] does not cross zero. For trend/scalar: bootstrap the mean.CI not crossing zero (assoc). For trend/scalar this always passes (informational only).No, soft.
5subgroup_consistencyassociation onlySplit window in time-ordered halves; compute Spearman ρ in each.Same sign in both halves (ρ₁ · ρ₂ > 0).No, soft.
6method_triangulationassociation onlySpearman ρ vs Kendall τ-b on all valid pairs.Same sign.No, soft.
7discriminative_powerassociation only|effect| against personal-data noise floor.|ρ| ≥ 0.10No, soft. (Failure alone won't reject, but bootstrap + discriminative_power both fail → reject; see logic below.)

Verdict aggregation logic

# gates.ts → PY → def run() (lines 304–350)
if hard_rejection:                         # gate 1 or 3 hard-failed
    verdict = "rejected"
elif bootstrap_fail and discrim_fail:      # core "is-there-signal" gates BOTH failed
    verdict = "rejected"
elif applicable == 0:                      # no gate applied (degenerate)
    verdict = "conditional"
else:
    ratio = passes / applicable_count
    if ratio >= 0.85:  verdict = "validated"
    elif ratio >= 0.5: verdict = "conditional"
    else:              verdict = "rejected"

Gates marked with detail.applicable = False (e.g., effect_vs_noise on an association finding) are excluded from applicable_count so they don't drag the ratio down.

The Critic (with literature priors)

src/agents/validator/critic.ts.

The Critic gets:

  • The Finding (claim, numbers, feature/target, mechanism).
  • Per-gate results ( / + reason).
  • Relevant literature priors filtered from data/reference/biomarker_priors.json (only those whose feature or target substring-matches this finding).
  • The user's memory (filtered to barrier | preference | decision | value | insight, max 12 entries).

Output schema:

{
  "decision": "accept" | "downgrade" | "reject",
  "concerns": [{
    "category": "confounder" | "reverse_causation" | "selection_bias" |
                "literature_contradiction" | "tautology" | "small_n" | "noise",
    "detail": "<one sentence>",
    "severity": "low" | "medium" | "high"
  }],
  "rationale": "<two-sentence summary>"
}

Hard rules embedded in the system prompt:

  • ANY severity=high concern → must be reject or downgrade.
  • 2+ severity=medium → should be downgrade.
  • 0-1 low and a plausible mechanism → accept.

On malformed output or call failure, defaults to downgrade (conservative)

  • never silently accepts.

Assessment (mechanism / novelty / strategy)

src/agents/validator/assessment.ts.

Only runs for non-rejected findings. Single LLM call. Output:

{
  "mechanism": "<one sentence, grounded in priors or honest 'no established mechanism known'>",
  "novelty": "established" | "supported" | "emerging" | "user_specific",
  "strategy": "<one concrete next step achievable this week>" | null,
  "citations": ["..."]
}

The "strategy quality bar" in the prompt is explicit: must be specific to the finding, not generic ("add a 25-min walk on the 3 lowest-step weekdays", not "exercise more"). If no clear lever exists, strategy is null.


The Fact Sheet contract

The Fact Sheet is the immutable, deterministic dictionary of every number synthesis is allowed to cite. Built by buildFactSheet(validated) in src/agents/validator/types.ts:137–159:

export type FactSheet = Record<string, number>;
// Keys are `<finding_id>.<numbers_key>`:
//   "ds-001.effect"   = -0.374
//   "ds-001.n"        = 87
//   "ds-001.ci_low"   = -0.42
//   "ds-001.ci_high"  = -0.31

What gets in

  • Every numbers key of every non-rejected ValidatedFinding (verdict ∈ conditional).
  • Tuples (CIs) are split into <key>_low / <key>_high.
  • Only finite, real-valued numbers (silently drops NaN, Inf, arrays that aren't 2-tuples).
  • Duplicate finding IDs are suffixed -2, -3, etc., so no number is silently dropped.

What's blocked

  • Anything from a verdict=rejected Finding.
  • Any number generated mid-synthesis by the LLM (the fact-check pass flags these as warn).
  • Numbers from the DS Agent's free-text answer that were not captured into findings (extraction is mandatory for synthesis-eligible numbers).

Tolerance window (in factCheckReply)

const tolerance = Math.max(0.02 * Math.abs(sv), 0.05);
matches = |v - sv| <= tolerance  OR  |v - |sv|| <= tolerance

A 2% relative tolerance with a 0.05 absolute floor. Tight enough to catch a 78.3 vs 75.5 fabrication; loose enough to allow rounding (372 vs 371.83) and ratio derivations (e.g. ~0.4 from 7.38 / 18.74).

Numeric tokens that DON'T trip the fact-check:

  • Bare integers under 100 with no decimal/percent (LLM often uses them as list indices or sentence numbers).
  • Year-looking integers [1900, 2100] (publication years in citations).
  • Numbers in stripped patterns (URLs, markdown links, arXiv:1234.56789, N=1,234, ISO dates).

Synthesis constraints

FINAL_SYNTHESIS prompt lines 154-213. The hard rules:

  1. Numbers MUST come from the Fact Sheet. No invention, no inconsistent rounding, no interpolation. If a value isn't there, synthesis must say "we'd need to look more closely" instead of making one up.
  2. Honour the verdicts.
    • VALIDATED → state with confidence.
    • CONDITIONAL → hedge ("preliminary signal", "worth tracking").
    • REJECTED → do NOT mention as findings (the Critic flagged them as confounded / tautological / under-powered).
  3. Lead with the strongest validated finding. No throat-clearing.
  4. Weave in Mechanism + Strategy from the Assessment when available.
  5. One voice. No references to "the data scientist" or "the domain expert", that's an Amy-internal abstraction.
  6. Coach mode ends with the coach's forward question. Other modes end cleanly.
  7. Focused. Answer the asked thing first.

If DS was invoked but failed (dsStatus = "failed"), the prompt prepends an explicit hard-warning block:

⚠ DATA SCIENCE STATUS: FAILED. The data analysis sandbox did NOT complete successfully this turn. There is no validated quantitative result. You MUST be honest, NOT invent numbers, NOT cite a "result", and offer a narrower follow-up.

This is what prevents the "silent degradation" failure mode where synthesis confidently states a number even though the DS sandbox crashed.


Memory extraction

src/orchestrator/index.ts:937–966extractMemories + validatedToMemories.

After every turn, two memory writes happen in series:

extractMemories(userMessage, assistantMessage)

Prompt: MEMORY_UPDATE (prompts.ts:215–240). Pulls long-term-worth-keeping facts out of the conversation:

[
  {
    "agent": "user" | "orchestrator",
    "kind": "goal" | "barrier" | "preference" | "insight" |
            "hypothesis" | "decision" | "value",
    "text": "...",
    "confidence": 0.01.0
  }
]

The prompt has explicit "only extract things relevant 2 weeks from now" guidance. On JSON parse failure → [].

validatedToMemories(validated)

src/orchestrator/index.ts:684–702. For every ValidatedFinding, emits a tested_hypothesis MemoryEntry:

{
  ts, agent: "validator",
  kind: "tested_hypothesis",
  text: validated.claim,
  confidence: { validated: 0.9, conditional: 0.6, rejected: 0.4 }[verdict],
  meta: { finding_id, feature, target, verdict, effect }
}

Memory.appendMany dedupes by meta.finding_id so the same hypothesis re-tested across turns doesn't bloat the JSONL.

The Investigator reads back memory.testedHypotheses() on its next run and includes them in its prompt to prevent re-proposing the same hypotheses (investigator/index.ts:77–86).


Cost breakdown

Costs come from result.total_cost_usd reported by the Claude Agent SDK src/llm.ts:150–151. They're accumulated in trace.total_cost_usd and emitted via cost and cost_warning events.

A warn event fires once per turn when cumulative cost crosses config.costWarnUsd (default $3, override via AMY_COST_WARN_USD).

Typical per-step costs (from real transcripts in the README / production traces; vary with prompt size and model):

StepModelTypical cost
Vagueness classifyfastModel (sonnet)$0.001-0.003
RoutingfastModel$0.005-0.015
Question rephrasefastModel$0.005-0.015
Investigator (digest + hypotheses + briefing)model (opus)$0.05-0.20
DS plan + code-gendsModel (sonnet-4-6)$0.03-0.10
DS debug iteration (each)dsModel$0.05-0.15
DS summary + extractFindingsfastModel + validatorModel$0.01-0.05
DE ReAct (per call)model$0.10-0.40 (depends on tool turns)
HC recommend gate + main + finish gatefastModel + model + fastModel$0.05-0.15
Validator gatesnone (Python only)$0
CriticvalidatorModel$0.02-0.10 per finding
AssessmentvalidatorModel$0.01-0.05 per finding
ReflectionfastModel$0.003-0.008
Reflection DS follow-updsModel$0.05-0.15
Synthesismodel$0.05-0.20
Memory extractionfastModel$0.005-0.015

Total turn cost typically lands $0.10-0.30 for descriptive queries, $0.30-1.00 for analytical queries, and $1.00-3.00 for vague exploratory queries that fire the Investigator + validate multiple top hypotheses.

The two model knobs that move costs the most:

  • AMY_DS_MODEL, defaults to sonnet-4-6 (faster + cheaper + empirically better than opus at pandas codegen, per the comment in config.ts:53–67).
  • AMY_VALIDATOR_MODEL, defaults to opus because it's the trust-load-bearing step. Setting to sonnet cuts validation cost ~70% but raises the risk of accepting a confounded finding.

Model resolution:

AliasResolves to (direct Anthropic)OpenRouter env var
opusclaude-opus-4-7 (whatever Claude Code default is)ANTHROPIC_DEFAULT_OPUS_MODEL
sonnetclaude-sonnet-4-6ANTHROPIC_DEFAULT_SONNET_MODEL
haikuclaude-haiku-4-5ANTHROPIC_DEFAULT_HAIKU_MODEL
claude-sonnet-4-6exact pinexact pin
claude-opus-4-7exact pinexact pin

dsModel is pinned to claude-sonnet-4-6 (not the alias) because the comment explicitly notes Sonnet 4.6 outperforms Opus 4.7 on pandas codegen.


Where to next

  • The runtime that hosts the API (Worker, Queues, Cron) is in runtime.md. The orchestrator currently runs in the CLI; the architecture target moves it into a TurnWorkflow so each step above survives restarts and is observable in the Cloudflare dashboard.
  • The ingest pipeline that fills the SQLite the DS Agent reads is in data-pipeline.md.
  • The schema of every column the agents touch is in storage.md.
  • The biomarker priors the Critic uses live in data/reference/biomarker_priors.json.
  • For the user-facing event taxonomy and the "calm" CLI rendering, see src/orchestrator/events.ts.

On this page