Claude Code 监督 Codex：可复现验收与防跑偏的实践框架

## 让 AI 一直跑又不跑偏，真的太难了

使用Claude code、codex这类工具久了，有时候就挺想让他们一直运行下去。但又怕他们自己写代码写偏，而且长时间运行可能还会导致模型上下文爆了。针对这个需求，我设计了一套Claude code监督codex的工作流。

今天就把这套思路分享给大家。这不仅仅是个方案，更是一种思路，大家完全可以拿去改成适合自己的版本。

> **特别提醒**：这个思路适合从 **1 到 n** 的迭代开发。如果是 **0 到 1** 的新项目，我还是建议大家自己动手，或者亲自盯着模型做。

---

## 选对工具，省钱又省心

我自己的情况是：有 `ChatGPT Plus`，有 `codex` 使用权限，同时还有 `glm` 的 `coding plan lite`（可以配置到 `Claude Code` 里用）。`Gemini` 我也有，但 `Gemini cli` 的体验我个人觉得一般，所以这里就用 `Claude Code + codex` 来演示。

总结一下就是：

- `glm` 的 coding plan：**额度多**，我基本没碰到过限额
- `Claude Code`：有时会出现 *过早完成任务* 的情况
- `codex`：相对更稳一点，但模型更贵，要省着用

所以我这里的策略是：让 `Claude Code` 来充当监督者，让 `codex` 去干活儿。

关于 `codex` 模型，我建议用 **`ChatGPT-5.2-medium`**。带 `codex` 后缀的模型官方说的是`专门针对编程和代理任务优化`[^2]但我实际测下来干活效果不太理想。`medium` 类似“Auto”，你也可以选 `high`，但是不要选 `Xhigh`，我之前试过，效果是真好，但一天跑完了一周的额度，钱包真的受不住。

---

## 两层防跑偏保险

这套 workflow 里，我最在意的是“防跑偏”和“防作弊”。

所以我用了两个东西做双保险，一个是`tasks.md`一个是`feature_list.json`，主要对比如下：

### 1. 对比表格

| **特性**            | **tasks.md**                                                 | **feature_list.json**                                      |
| ------------------- | ------------------------------------------------------------ | ---------------------------------------------------------- |
| **核心定位**        | **执行层**：具体的实施步骤与验证过程                         | **管理层**：产品功能需求的最终状态                         |
| **颗粒度**          | **细粒度**：一个功能可能拆分为多个任务（1.1, 1.2, 1.3）      | **粗粒度**：一个 Ref ID 对应一个完整功能点（R1）           |
| **Worker 权限**     | **部分写入**：仅允许添加 `BUNDLE` 行（交付代码包路径）       | **完全禁止**：禁止修改任何内容（严禁自作主张改需求或状态） |
| **Supervisor 权限** | **管理执行**：勾选 Checkbox，写入 `EVIDENCE`（通过/失败结论） | **更新状态**：仅在验证通过后，将 `passes` 字段改为 `true`  |
| **内容形态**        | **Markdown**：包含人类可读的指令、测试标准、运行日志路径     | **JSON**：结构化数据，包含 Ref ID、描述、布尔值状态        |
| **生命周期**        | **动态交互**：随着每次运行不断追加日志、报错、重试记录       | **相对静态**：只有在功能真正“做完且验过”时才会翻转状态     |
|                     | 给人类+AI Agent                                              | 主要给AI Agent                                             |

### 2. 作用与联系

#### **各自的作用**

- **`tasks.md`（过程）**：

它是**过程记录**。它记录了从代码实现到最终验证的完整流水线。Worker 可以在这里犯错、重试（Attempt #1, #2...），Supervisor 在这里记录具体的验证命令和截图路径。它是人机协作的**作业空间**，容纳了试错与迭代的细节，确保过程的可追溯性。

- **`feature_list.json`（结果）**：

它是**验收基准**。它不记录具体的开发曲折，只映射最终的**交付状态**。负责 *哪些端到端能力已经真正验过并通过* ，它用稳定 ref 来做长期清单，默认全部 passes=false，只有当某个 ref 的 PASS 证据链已经存在时才允许更新为通过。

#### **靠什么联系起来？**

两者通过 **Ref 标签（如 `[#R1]`）** 进行刚性绑定：

1. **映射关系**：`tasks.md` 中的具体任务行会携带标签（例如 `- [ ] 1.1 实现登录接口 [#R1]`），这个标签直接对应 `feature_list.json` 中的 `"ref": "R1"` 条目。
2. **状态流转（单向驱动）**：
   - **先在 `tasks.md` 验证**：Supervisor 必须先在 `tasks.md` 中运行 Worker 提供的代码包，确认测试通过，并写入 `EVIDENCE ... RESULT: PASS`。
   - **后在 `feature_list.json` 归档**：只有当 `tasks.md` 里的证据链确凿无疑（PASS）后，Supervisor 才有权限去修改 `feature_list.json` 中对应 `R1` 的 `passes` 字段为 `true`。

为什么要这么死板？因为只靠一份任务清单，模型是可能“看起来完成了”，但实际没完成；而 `feature_list.json` 这种能让我们更容易发现它是不是在糊弄。某种意义上，它就是防止“做个样子但不可用”的那道门槛[^1]。

另外，为了最大程度减少“需求没对齐就开干”，我还加了一个 skills，让 AI 能反问我们，把需求再确认一遍。

---

## 总体思路

**[角色分工]** Claude Code 充当**监督者（Supervisor）**，Codex 则是**工人（Worker）**。

[collapse title="为什么要这么拆？" status="false"]

因为真正怕的不是它不会写代码，而是：
1. 它觉得“自己做完了”，但其实只是做了个样子
2. 它偷懒绕过验证，或者验证不可复现
3. 它跑偏了还自信满满，最后我们接手的时候一地鸡毛

所以这里使用两个 Agent 进行工作，最大程度的防止作弊，一个只负责写、一个只负责验收。

[/collapse]

**[启动]** 整个流程开始于我使用 Codex （工人）生成的一份 OpenSpec 变更提案，这些提案会被转化为 `tasks.md` 中具体的待办事项列表。每当需要执行一项新任务时，Claude Code （监督者）就会启动一个subagent，使用`codex exec`调用 Codex （工人）。然后使用自然语言调用 OpenSpec。OpenSpec 最好是0.19.0版本，因为再新的版本 OpenSpec 的工作流重构了，也支持自然语言调用，但使用的是`skills`触发[^3]。

**[执行与交付]** Codex （工人）在写完代码后，它必须制作并交付一个可复现的**测试方案**作为完工凭证并放在`auto_test_openspec` 目录下:
    - CLI 任务： 包内必须包含自动化测试脚本（run.sh）。
    - GUI 任务： 包内必须包含一份不含可执行代码的 MCP 操作方式（Markdown 格式），以及仅用于启动服务的脚本。

**[验收与确权]** Claude Code （监督者）会亲运行脚本进行验收，对于 GUI 任务，它会严格按照剧本调用 `playwright-mcp` 服务驱动浏览器，并抓取截图作为铁证，确保功能不仅代码写了，而且真实可用[^1]。

只有当 Claude Code （监督者）亲自确认**测试方案**运行通过，且手中的证据链完整无误时，它才会执行一系列 *确权* 操作：
1. 在 tasks.md 中勾选任务。
2. 更新 feature_list.json 的 pass 状态。
3. 执行 Git 提交存档。
4. 将包含证据指针的交接日志写入 progress.txt。

**[异常处理]** 如果中遇到技术卡点， Claude Code （监督者）会利用 Context7 或浏览器搜索工具自主寻找解决方案并指导执行者重试。

**目录结构**

```txt
.
├── auto_test_openspec/                     # [根目录衍生品] 不可变的证据仓库
│   ├── run-0001__task-1.1__ref-R1.../      # 具体某次任务的“验证包” (Run Folder)
│   │   ├── run.sh                          # 自动化复现脚本
│   │   ├── task.md                         # 验证操作手册
│   │   └── ...                             # (日志、截图、输入输出等)
│   └── ...
│
├── git_openspec_history/                   # [根目录衍生品] Git 提交索引
│   └── runs.log                            # 索引日志：回溯 Run ID <-> Git Commit SHA
│
└── openspec/
    └── changes/
        └── <change-id>/                    # [OpenSpec 变更内产物]
            ├── feature_list.json           # 特性清单与通过状态 (双重账本)
            ├── progress.txt                # 交接日志 (记录对话与验证结果)
            └── tasks.md                    # (任务列表源文件)
```

[collapse title="### 怎么保存记忆？" status="false"]

每个任务单独的一个subagent，这样做是可以保证上下文不会过长和污染。但记忆则确保不了，我的方案是。

**1. 核心机制：“启动仪式” (The Startup Ritual)**

1. 要求 Codex（工人）在干活前必须先**读取**历史档案：
    - 必须读取 `openspec/changes/<change-id>/progress.txt` 和 `feature_list.json`。
    - 必须运行 `git log --oneline -20` 来获取最近的代码变更历史。
    - 必须把读到的这些信息写进 `auto_test_openspec/$ARGUMENTS/<run-folder>/logs/worker_startup.txt`，证明“我看过以前发生什么了”。

**2. 三个记忆文件**

1. `tasks.md` 作为项目的“任务记忆”与唯一事实来源，它维护着所有任务的执行状态清单。 Claude Code （监督者）通过读取此文件来决定当前的派发逻辑，而 Codex （工人）则依靠它明确具体的实施目标，从而确保双方对 *哪些任务已完成、哪些待执行* 拥有一致的认知。

2. `progress.txt` 这是一个只增不减的“过程记忆”日志，用于在不同会话间传递交接信息。每当任务结束， Claude Code （监督者）会将对话摘要、验证结果及报错信息固化至此；新启动的 Codex （工人）必须通过查阅该文件中的历史记录（特别是失败或阻塞的原因），来汲取前车之鉴，从而避免重蹈覆辙。

3. `feature_list.json` 它是项目完成度的状态，专门记录各个功能模块的验证通过状态。在该机制下，Codex （工人）仅拥有读取权限以确认依赖项状态，只有在 Claude Code （监督者）完成严格验证后才会更新此文件，从而保证了关于项目整体可用性的记忆既连续又具备绝对的权威性[^1]。

[/collapse]

---

## Skills和mcp配置

### 1. 配置 MCP

如果你的任务涉及 GUI（或者 MIXED），我强烈建议加 playwright-mcp。因为我们想做到的是：Supervisor 不靠手动点页面，也不靠脚本跑 Playwright，而是通过 MCP 驱动浏览器并采集证据（截图、日志等）。

playwright-mcp：

```cmd
claude mcp add --transport stdio --scope user playwright-mcp -- npx -y @playwright/mcp@latest
```

再配一个 context7（遇到卡点能查资料、补上下文）：

```cmd
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
```

我这里浏览器搜索 MCP 用的是智普的（你也可以换别家的，只要名字对得上就行）：

```cmd
claude mcp add -s user -t http web-search-prime https://open.bigmodel.cn/api/mcp/web_search_prime/mcp --header "Authorization: Bearer your_api_key"
```

```cmd
claude mcp add -s user -t http web-reader https://open.bigmodel.cn/api/mcp/web_reader/mcp --header "Authorization: Bearer your_api_key"
```

[配置示例](#自定义skills-mcp)

### 2. skills

这几个 skill 我是直接放在仓库里维护的，大家可以按需下载：

给 `codex` 用的：

- 建议大家去 [GitHub 下载 openspec-change-interviewer](https://github.com/Rosetears520/aili-notes/tree/main/skills/openspec-change-interviewer)（用 **采访式反问** 把需求对齐）
- 再去 [GitHub 下载 openspec-feature-list](https://github.com/Rosetears520/aili-notes/tree/main/skills/openspec-feature-list)（生成 `feature_list.json` ）

给 `Claude Code` 用的：

- 这个是 Supervisor 卡点用的研究：建议大家去 [GitHub 下载 openspec-unblock-research](https://github.com/Rosetears520/aili-notes/tree/main/skills/openspec-unblock-research)

[collapse title="自定义 openspec-unblock-research 的 mcp server" status="false"]
**1. 配置mcp server**

在 Claude Code 中运行 `mcp list`。必须看到 `mcp__<new-search-name>__*` 和 `mcp__github__*` (或其他辅助工具) 均已加载。

**2. 修改核心文件 (`SKILL.md`)**

对 `openspec-unblock-research` 的 `SKILL.md` 进行两处关键修改：

**1. 修改文件头部 Description**
    保持描述与实际工具一致。

- 把 `mcp__web-search-prime__*`
    - 改为 `mcp__<new-search-name>__*`

**2. 修改 Default Provider Ordering**
    在文件底部的列表里 **插入新工具** 并 **替换旧搜索**。

**修改示例：**

```markdown
## Default provider ordering (if caller omits toolchain_config)

1. `mcp__context7__*` (authority source)
   ...

2. `mcp__github__*` (新增: internal authority)
   - Use for: checking existing issues/bugs in repo or upstream.
   - Trigger when: `error_excerpt` looks like a library bug.
   - Stop when: found a closed issue matching symptoms.

3. `mcp__<new-search-name>__*` (替换原有的 search-prime)
   - Use for: recent regressions, common pitfalls.
   - Trigger when: `error_excerpt` includes searchable strings.
   - Stop when: have candidate links to verify.

4. `mcp__web-reader__*` (evidence fetcher)
   ...
```
[/collapse]

---

## 需要更改的文件

### 可选：规范代码

修改`AGENT.md`。这个主要目的是为尽量写的代码规范一点精简一点，属于个人喜好，当然你也可以配置一下其他的，比如必须使用uv虚拟环境等等。大家如果觉得没必要的话可以不加

```md
## Code hygiene guardrails (always-on)

- Prioritize correctness and maintainability over cosmetic changes.
    
- Keep scope tight: don’t refactor unrelated areas; avoid “while I’m here” edits.
    
- Write for the next reader: choose clear names, straightforward control flow, and readable structure.
    
- Avoid clever compactness (dense one-liners, nested ternaries). Prefer if/else or switch when branching grows.
```

###  关键文件修改

为了让这套流程跑起来，我们需要覆盖或新建几个配置文件。

#### openspec-proposal.md需要添加的

**位置：**

- Windows: `%USERPROFILE%\.codex\prompts\openspec-proposal.md`
- macOS/Linux: `~/.codex/prompts/openspec-proposal.md`

目的：让openspec生成的task.md比较符合我们的需求。

> 注：该文件必须在输入`openspec init`后修改，否则会默认重置掉。

[collapse title="**Steps**6后面添加" status="false"]

```md
- When drafting `openspec/changes/<id>/tasks.md`, you MUST follow:
  - `openspec/project.md` → `## tasks.md Checklist Format` (canonical; do not invent a parallel format).

- Hard gate reminders (do not expand here; see canonical spec above):
  - Every task MUST include `ACCEPT:` and `TEST:`.
  - Every checkbox task line MUST include EXACTLY ONE `[#R<n>]` token, unique across the file.
  - `TEST:` MUST include `SCOPE: CLI|GUI|MIXED` and MUST enable a human-reproducible validation bundle
    (all bundle rules + role split + evidence rules live ONLY in `openspec/project.md`).

- Role split (mandatory; see `openspec/project.md` → “Validation bundle requirements”):
    - Worker produces bundle assets only; Supervisor executes and records PASS/FAIL evidence.

- GUI/MIXED constraint (mandatory; see `openspec/project.md` → “CLI/GUI/MIXED validation requirements”):
    - GUI verification must be driven via MCP service `playwright-mcp` and evidence must be archived; do NOT use any browser automation scripts (Python/Node/Playwright test runner).
```

[/collapse]

#### 项目目录：openspec\project.md

目的：让openspec生成的task.md比较符合我们的需求。

[collapse title="在`project.md`末尾添加" status="false"]

```md
## tasks.md Checklist Format

This section is the SINGLE canonical spec for tasks.md format and validation bundles.
Do not duplicate this spec elsewhere; other docs must link here.

### Task Line Format (required)

Each checkbox task line MUST follow:
- `- [ ] <task-id> <task summary> [#R<n>]`
- `<task-id>` MUST be dot-numbered (e.g. `1.1`, `2.3`).
- Each checkbox line MUST include EXACTLY ONE `[#R<n>]` token (e.g. `[#R1]`).
  - `[#R<n>]` MUST be unique across the entire tasks.md (never reuse).
- Every task MUST include both `ACCEPT:` and `TEST:` blocks.
- `TEST:` MUST include `SCOPE: CLI|GUI|MIXED` and MUST be implementable into a validation bundle
  per `### Validation bundle requirements (mandatory)` below.

### Example (copy/paste)

- [ ] 1.1 Do X and produce Y [#R1]
  - ACCEPT: ...
  - TEST: SCOPE: CLI
    - When done, generate validation bundle under:
      auto_test_openspec/<change-id>/<run-folder>/
    - run-folder MUST be:
      run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/
    - Run: auto_test_openspec/<change-id>/<run-folder>/run.sh (macOS/Linux) or run.bat (Windows)
	- run-folder MUST be:
	  run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/
	- Run: auto_test_openspec/<change-id>/<run-folder>/run.sh (macOS/Linux) or run.bat (Windows)
	- Inputs: inputs/sample.json
	  Outputs: outputs/result.json
	- Verify: compare against expected/result.json (or rule-based assertions)

### Validation bundle requirements (mandatory)

For every task, `TEST:` MUST be written so:
- the Worker can produce a **human one-click reproducible** validation bundle (assets + scripts for CLI checks; GUI checks are MCP-driven and MUST NOT use any browser automation scripts),
- AND the Supervisor can execute it and record the final PASS/FAIL evidence chain
  (each run-folder is immutable; evidence pointers are written after execution).

0) Roles & responsibilities (mandatory)
- Worker (produces artifacts; not the final verifier):
  - Implement product code + write tests (CLI). For GUI/MIXED, produce an MCP runbook only (no executable browser automation scripts).
  - Produce the validation bundle assets under the run-folder:
    `task.md`, `run.sh`, `run.bat`, `tests/` (CLI tests and/or GUI MCP runbook; no executable browser scripts), and (when applicable) `inputs/`, `expected/`.
  - MUST NOT declare PASS/FAIL.
  - MUST NOT overwrite/edit prior run-folders (append-only history).

- Supervisor (executes validation; forms the evidence chain):
  - MUST create a brand-new run-folder for every validation attempt (never overwrite).
  - Executes `run.sh` / `run.bat`, captures `outputs/` + `logs/` + GUI evidence when applicable.
  - MUST write the final PASS/FAIL result + evidence pointers (this is the DONE hard gate).

1) Canonical on-disk location (repo root; append-only)
- Root folder (fixed):
  - `auto_test_openspec/<change-id>/`
- Each validation attempt MUST create a brand-new run folder (never overwrite; keep ALL history forever):
  - `auto_test_openspec/<change-id>/<run-folder>/`
- Once created, a run folder MUST be treated as immutable evidence:
  - do not edit prior runs; create a new run folder instead.

2) Run folder naming (required; MUST include run#, task-id, ref-id; timestamp recommended)
- `<run-folder>` MUST follow this exact pattern:
  - `run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/`
- Example:
  - `run-0007__task-1.1__ref-R1__20260111T031500Z/`
- Rules:
  - `<RUN4>`: zero-padded, monotonic run counter (e.g. 0001, 0002, ...).
    - MUST match the Supervisor workflow RUN_COUNTER / `EVIDENCE (RUN #n)` numbering for audit alignment.
    - Mapping rule: `RUN #7` => `run-0007`, `RUN #12` => `run-0012`.
  - `<task-id>`: dot-numbered task id from the checkbox line (e.g. `1.1`).
  - `<ref-id>`: stable ref id derived from the task tag (e.g. `[#R1]` → `R1`).
  - `<YYYYMMDDThhmmssZ>`: UTC timestamp to guarantee uniqueness and ease auditing.

3) Minimum required contents inside EVERY run folder
Each run folder MUST contain at least:

A) `task.md` (this run’s readme; MUST be self-sufficient)
task.md MUST include:
- change-id, run#, task-id, ref-id
- SCOPE covered (CLI / GUI / MIXED)
- How to run (Windows + macOS/Linux)
  - CLI: run.sh/run.bat executes CLI checks.
  - GUI/MIXED: run.sh/run.bat starts the service only; GUI steps are executed via the MCP runbook under tests/.
- Test inputs (if any): input file paths, params, sample data
- Test outputs (if any): what files/stdout/stderr/screenshots/logs will be produced and where
- Expected results (machine-decidable): pass/fail criteria
  - exit code checks
  - stdout/stderr assertions (required when relevant)
  - file existence/content assertions (required when outputs exist)
  - GUI assertion points (when GUI/MIXED): which screenshots/states prove correctness
- Hard rules (GUI/MIXED):
  - task.md MUST NOT contain manual browser steps (no “open Chrome/click buttons” prose).
  - task.md MUST point to the MCP-only runbook under tests/ (e.g., tests/gui_runbook_<topic>.md).
  - Any required “copy/seed/prepare input/state” steps MUST be written as exact commands/steps here (and referenced by the runbook). run.sh/run.bat MUST NOT perform them.
- Provenance of expected/assumptions:
  - If inputs/expected are not provided by a human, the Worker MUST generate them and document where they came from
    (e.g., derived from ACCEPT, or an explicit reasonable assumption).

B) One-click scripts (both required; GUI/MIXED = start-server only)
- run.sh (macOS/Linux)
- run.bat (Windows)

Script requirements (all bundles):
- Must assume the default dev machine environment is ready.
- Non-destructive:
  - MUST NOT modify global environment
  - MUST NOT globally install dependencies
  - MUST NOT write to system directories
- Must be runnable from ANY working directory:
  - the script MUST cd/pushd to its own directory first, then resolve paths from there.

Hard rule (when SCOPE includes GUI):
- run.sh/run.bat MUST be start-server only:
  - MUST: start the local service and print the access URL/port (e.g., http://127.0.0.1:<PORT>/)
  - MUST NOT: copy/overwrite data files, mutate state/inputs, generate exports/outputs, run tests, run exports, probe/install dependencies, or perform environment probes (python/uv version checks do NOT belong in GUI start scripts)
  - Any required “copy/seed/prepare input/state” steps MUST be documented as exact commands/steps in task.md (and referenced by tests/gui_runbook_*.md) for the Supervisor to execute and record in EVIDENCE.

For CLI bundles (or the CLI portion of MIXED):
- run.sh/run.bat SHOULD print key results to console and SHOULD write logs to logs/.
- Environment provenance SHOULD be documented as optional preflight commands in task.md (not forced into GUI start scripts), e.g.:
  - interpreter path + version (Python/Node if used)
  - uv --version when Python/uv is involved
- When provenance is executed, it SHOULD be recorded to logs/.

C) Test asset folders (create the ones that apply)

- `logs/` MUST exist (always):
  - run logs, env/version info, command transcript, GUI screenshot index, etc.
- `tests/` MUST exist when:
  - SCOPE includes GUI (MCP-driven via `playwright-mcp`), OR
  - validation is not fully expressible as simple CLI assertions.
- `inputs/` MUST exist when the task involves file input (see I/O hard rule below).
- `outputs/` MUST exist when the validation produces file outputs (see I/O hard rule below).
- `expected/` SHOULD exist when golden-file comparison is used; otherwise rule-based assertions are acceptable.

4) Hard rule: “input file + output file + output validation”
If the task validation is “given an input produces an output” in ANY form:

- `inputs/` MUST contain at least one reproducible input sample.
- `run.*` MUST write the real produced outputs into `outputs/` (never into random temp/system dirs).
- The bundle MUST include at least one machine-decidable verification method (pass/fail), typically:
  - (A) golden file compare against `expected/` (exact match OR documented allowed-diff rules), and/or
  - (B) rule-based assertions (e.g. JSON schema, key fields, row counts, regex match, exit code, forbidden strings).

`task.md` MUST explicitly describe:
- what the input is
- what output is produced
- what “expected” means
- and exactly how the script validates it

5) CLI / GUI / MIXED validation requirements
- If SCOPE includes CLI:
  - MUST run the real CLI command(s) in `run.*`
  - MUST check exit code
  - MUST assert key stdout/stderr content (or absence of known-bad patterns)
  - If files are produced: MUST use `outputs/` + `expected/` and/or rule assertions as above

- If SCOPE includes GUI:
  - The validation bundle MUST provide an MCP-only GUI verification runbook
    (stored under tests/ and executed by the Supervisor via playwright-mcp; do NOT use any scripts to drive the browser).
  - Hard rule: run.sh/run.bat MUST be start-server only for GUI/MIXED bundles:
    - MUST: only start the service and print URL/port
    - MUST NOT: copy/seed/prepare input/state, generate exports/outputs, run tests, or perform environment probes
    - Any required data prep steps MUST be written as exact commands/steps in task.md (and referenced by the runbook).
  - Supervisor execution constraint (mandatory):
    - GUI verification MUST be driven via MCP service playwright-mcp
      - no manual browser interaction
      - no Python/Node/Playwright scripts to drive the browser
  - Must archive auditable evidence artifacts (append-only; never overwrite):
    - at minimum: screenshots (e.g., outputs/screenshots/ plus a screenshots index file in logs/)
    - recommended: trace/video and a console log index when available from MCP (paths recorded in logs/)

- If SCOPE is MIXED:
  - The bundle MUST cover both CLI and GUI checks (either in one test file or split; see “two test files” rule below).

6) Allowing two test files (when needed; organization rule)
Default: one test file should cover key acceptance points.

Two test files are allowed / recommended when:
- CLI + GUI are both involved:
  - one test focuses on CLI
  - one runbook focuses on GUI (MCP steps + assertions; no executable browser scripts)
- Same entrypoint but two distinct paths must be covered:
  - happy path + error/edge path (e.g., valid vs invalid args)
- GUI needs both “functional flow” and “render/state”:
  - split into two smaller, more stable tests

Suggested naming under the run folder:
- `tests/test_cli_<topic>.*`
- `tests/gui_runbook_<topic>.md` (MCP-only steps + assertion points; no executable browser scripts)

Note:
- “two test files” refers to validation assets under `tests/` (CLI test scripts and/or GUI MCP runbook).
- The “input/output two files + validation” rule refers to runtime data under `inputs/outputs/expected` and is additive, not conflicting.

7) Environment isolation (uv venv rule; mandatory when env problems occur)
- Under no circumstances may the Worker “pollute global Python env” to make validation pass (e.g., global `pip install`).
- If the Worker encounters environment problems (missing deps, conflicts, cannot run):
  - MUST create an isolated venv using `uv`
  - Recommended location: inside THIS run folder (e.g. `<run-folder>/.venv/` or `<run-folder>/venv/`)
  - All installs/runs must occur inside that venv
- `run.*` and/or `logs/` MUST clearly record:
  - which interpreter is used
  - uv version
  - where dependencies came from (lockfile / pyproject / etc.)
- Note:
  - Creating a venv is conditional (only when env problems occur),
    but running the full validation bundle is unconditional (always required).

8) tasks.md bookkeeping lines (mandatory; role split; no duplicated rules elsewhere)
- Under the task entry in `openspec/changes/<change-id>/tasks.md`, TWO lines are mandatory:
  - Worker-written (bundle-ready; NO PASS/FAIL):
    - `BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat`
  - Supervisor-written (final decision + evidence pointers):
    - `EVIDENCE (RUN #n): ... | VALIDATED: <exact commands + exit code> | RESULT: PASS|FAIL | GUI_EVIDENCE: <paths when applicable>`
- Worker MUST NOT claim PASS/FAIL anywhere; Supervisor is the only role that records PASS/FAIL after running the bundle.
```

[/collapse]

#### 项目目录：.\claude.md

目的：明确Claude code的任务身份、工作流。

[collapse title="完全覆盖`claude.md`" status="false"]

```md
 # CLAUDE.md (OpenSpec + Codex Supervisor)
 
 You are the SUPERVISOR (Claude Code). Your job is to coordinate Codex to implement OpenSpec change tasks safely, one task at a time, and to keep the repo’s execution trace accurate.
 
 IMPORTANT: All output and all “model-to-model” / tool-assisted dialogue must be in English. Do not produce Chinese text.
 
 ## Source of truth
 - `openspec/changes/<change-id>/tasks.md` is the single source of truth for implementation progress.
 - Do not use `TODO.md` for this workflow. Do not invent tasks outside `tasks.md`.

## Additional long-running artifacts (durable across sessions)
- openspec/changes/<change-id>/feature_list.json is the durable end-to-end feature checklist.
  - One entry per stable ref tag (e.g., [#R1] in tasks.md maps to "ref": "R1" in JSON).
  - Default all features to failing (passes=false) until validated.
  - Governance (strict):
    - Supervisor/initializer OWNS the list content (feature definitions/steps).
    - Worker is FORBIDDEN to add/remove/rewrite feature entries.
    - Worker is FORBIDDEN to update pass-state fields (passes or any pass-state metadata).
    - Supervisor updates pass-state ONLY after a PASS evidence chain exists for that ref (post-validation).
    - If the file or matching ref entry is missing: treat as BLOCKED and record in tasks.md; do NOT scaffold or invent entries.
- openspec/changes/<change-id>/progress.txt is the Supervisor-written handoff log.
  - Append-only. One RUN entry per task attempt (one subagent / one Codex run).
    - A single /monitor-openspec-codex ... invocation MUST append at most ONE RUN entry (no batch loop by default).
    - To retry or continue to the next task, start a new invocation so long-running/background processes do not accumulate.
  - Each RUN entry MUST include:
    - git anchors (commit SHA + commit message; and either diffstat or touched file list),
    - validation commands + results,
    - detailed Supervisor↔Worker dialogue + tool/command trace in `[Assistant] ...` / `[Tool Use] ...` style for replay/audit.
  - Must reflect only verified facts (no aspirational claims).
- `git_openspec_history/<change-id>/runs.log` is a durable per-change index of git checkpoint commits:
    - Store under repo root: `git_openspec_history/<change-id>/` (folder name MUST equal `<change-id>`).
    - Append-only log: `git_openspec_history/<change-id>/runs.log` (one line per successful RUN linking run# → commit → diffstat/files).
- `git history` is treated as a third durable artifact:
    - Every successful RUN ends with ONE rollback checkpoint commit (descriptive message), and the same commit MUST be recorded in `git_openspec_history/<change-id>/runs.log`.

## Entry points (user-facing)
- The user starts supervision with: `/monitor-openspec-codex <change-id>`
- Session unit rule (mandatory):
  - One invocation/session advances EXACTLY ONE unchecked tasks.md checkbox item.
  - State restoration across sessions relies on: progress.txt + feature_list.json + git history
    + git_openspec_history/<change-id>/runs.log.

## Worker invocation (Codex CLI)
# Single Codex command constant (maintain ONLY ONE copy)
CODEX_CMD = codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium

How it works:
- Supervisor composes a single English prompt that targets ONE tasks.md checkbox item.
- Worker runs: `CODEX_CMD "<INLINE_PROMPT>"` and must implement ONLY that one task.
- Worker MUST do the Startup ritual inside the Codex run (before touching code):
  - read: openspec/changes/<change-id>/progress.txt + feature_list.json (+ tasks.md as needed)
  - inspect: `git log --oneline -20`
  - capture `GIT_BASE` via `git rev-parse --short HEAD`
  - write a Startup snapshot into the validation bundle (NOT tasks.md), at:
    - `auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt`
    - MUST include (at minimum): UTC timestamp, CODEX_CMD, GIT_BASE, the `git log --oneline -20` excerpt, and a short “what I observed” summary.
  - NOTE: Do NOT write STARTUP/GIT_BASE fields into tasks.md. Supervisor may cite this file path later in EVIDENCE.
- Worker MUST NOT toggle any tasks.md checkbox. Supervisor owns checkboxes.
- Worker MUST NOT edit feature_list.json (neither entries nor pass-state).
- Worker MUST NOT create git commits.
- Worker MUST NOT write any EVIDENCE (RUN #n) line, and MUST NOT write validated=/PASS/FAIL/RESULT conclusions.
- Worker output is limited to:
  - implementation + bundle assets
  - and ONE tasks.md bookkeeping line:
    - BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat | (if GUI) RUNBOOK: tests/gui_runbook_<topic>.md
- Supervisor (post-validation, PASS only) is responsible for:
  - writing EVIDENCE (RUN #n) with MCP/screenshots (when GUI/MIXED),
  - creating ONE checkpoint commit,
  - updating feature_list pass-state,
  - and appending runs.log (if applicable).

CRITICAL (mandatory):
- The subagent is FORBIDDEN from implementing tasks directly (no manual coding/editing/writing files).
- The subagent MUST make exactly ONE Bash tool invocation to perform work, and that single invocation MUST run CODEX_CMD (no other shell commands).
- Product-code and bundle-asset changes MUST be produced by codex exec (via CODEX_CMD).
- Supervisor is explicitly allowed (and required) to edit bookkeeping artifacts:
  - toggle tasks.md checkboxes, write EVIDENCE (RUN #n) lines, append progress.txt, and create ONE checkpoint commit on PASS.
- Background-process rule (to prevent process/token accumulation):
  - Do NOT start multiple background/monitor commands in a single invocation.
  - If any long-running process was started (e.g., a server), terminate it before starting a new attempt.

Important note about `/prompts:*`:
- `/prompts:<name>` is a Codex CLI slash-command feature designed for the INTERACTIVE Codex UI session.
- Do NOT rely on `/prompts:*` in automated non-interactive runs (`codex exec`). Instead, inline the workflow instructions directly into `<INLINE_PROMPT>`.
 
## Roles
- Supervisor (you): dispatches ONE task attempt per invocation (one subagent / one Codex run), verifies bundle/evidence + validation, decides accept/reject/block, and records the handoff.
  - Within a single /monitor-openspec-codex ... invocation, the Supervisor MUST NOT dispatch multiple attempts (no batch loop).
  - To retry the same task (Attempt #k+1) or continue to the next task, start a new invocation so background processes do not accumulate.
  - Supervisor is the ONLY role allowed to toggle checkboxes in `tasks.md`.
  - Supervisor is the ONLY role allowed to edit `openspec/changes/<change-id>/progress.txt` (append-only).
  - Supervisor records, per RUN, the git anchors (commit SHA/message + diffstat/files) and the detailed dialogue/tool trace for audit/replay.

- Worker (Codex via CODEX_CMD): coding agent for ONE task only.
  - MUST perform Startup ritual at the beginning of EVERY run (progress.txt + feature_list.json + `git log --oneline -20` + `git rev-parse --short HEAD`)
    and write what was observed into the validation bundle log:
    - `auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt` (mandatory)
  - MUST implement + write tests (CLI) + produce the validation bundle assets (task.md/run.sh/run.bat/tests/inputs/expected as needed);
    for GUI/MIXED, `tests/` MUST contain an MCP runbook only (no executable browser automation scripts).
  - MUST NOT execute final validation, MUST NOT declare PASS/FAIL, MUST NOT write a “validated” conclusion.

- Supervisor: executes validation and forms the final evidence chain.
  - Runs `auto_test_openspec/<change-id>/<run-folder>/run.sh|run.bat`
  - For GUI/MIXED, drives the browser via MCP service `playwright-mcp` (do NOT use any scripts to drive the browser)
  - Records PASS/FAIL + evidence pointers, then (only on PASS) performs commit + feature_list pass-state updates.

- MUST NOT toggle any checkbox in `tasks.md`.
  - MUST NOT edit `openspec/changes/<change-id>/progress.txt`.
  - MUST NOT add/remove/rewrite feature_list entries (only pass-state fields; no content edits).

- Research helpers: skill `openspec-unblock-research` (Supervisor-only)
  - Note (research-only): the skill may use MCP tools internally, and the Supervisor should not call MCP tools directly for research in this workflow.

- Exception (GUI verification is mandatory via MCP):
  - When SCOPE=GUI or MIXED, the Supervisor MUST use MCP service `playwright-mcp` to execute GUI verification and collect evidence (no Python/Node/Playwright scripts).

## Task selection rules (tasks.md)
 - Pick the FIRST ELIGIBLE unchecked checkbox item (`- [ ] ...`) in `openspec/changes/<change-id>/tasks.md` (top-to-bottom).
   - ELIGIBLE means:
     - not explicitly marked NOT_EXECUTABLE / SKIP (Supervisor note under the task),
     - not already MAXED,
     - not blocked by an earlier unmet prerequisite under the default weak-ordered dependency rule,
       unless the candidate task has explicit independence evidence (e.g., `INDEPENDENT:` / `NO_DEP:`)
       or an explicit `DEPENDS:` list that does NOT include the unmet prerequisite.
 - Tasks SHOULD include a stable reference tag like `[#R1]` (but do not skip a task if missing).
 - One task = one subagent = one worker run. Never do multiple tasks in a single run.
 
 ## Verification + bookkeeping rules
 After the worker finishes a task:
 1) Re-open `openspec/changes/<change-id>/tasks.md`.
 2) Supervisor is the ONLY role allowed to change any checkbox (`- [ ]` → `- [x]`).
   - Worker/Codex MUST NOT toggle checkboxes.
 3) Under the task, ensure TWO lines exist (role split, mandatory):
  - Worker-written (bundle-ready, no PASS/FAIL):
    - `BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat`
  - Supervisor-written (final decision + evidence pointers):
    - `EVIDENCE (RUN #n): ... | VALIDATED: <exact commands + exit code> | RESULT: PASS|FAIL | GUI_EVIDENCE: <screenshots/trace/video/console index paths>`
	- Prefer this format (SINGLE LINE, THIS TASK ONLY):
	EVIDENCE (RUN #n): CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium
	| SCOPE: <CLI|GUI|MIXED>
	| VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder>
	| WORKER_STARTUP_LOG: auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt
	| VALIDATED_CLI: <exact command(s)> | EXIT_CODE: <n>              (omit if no CLI)
	| VALIDATED_GUI: MCP(playwright-mcp) | RUNBOOK: tests/<.> | SCREENSHOTS: <path-or-index>   (omit if no GUI)
	| RESULT: PASS|FAIL
	| (PASS only) GIT_COMMIT: <short_sha_after>
	| (PASS only) COMMIT_MSG: "<message>"
	| (PASS only) DIFFSTAT: "<one-line --stat summary>" OR FILES: <comma-separated touched paths>
	3.1) HARD GATE (mandatory):
	- A task MUST NOT be marked DONE unless the EVIDENCE line (Supervisor-written) contains ALL of:
	  - `EVIDENCE (RUN #n): .`   # 明确是哪一次 run
	  - `SCOPE: CLI|GUI|MIXED`
	  - `VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder>/`
	  - `WORKER_STARTUP_LOG: auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt`
	  - (If SCOPE includes CLI) `VALIDATED_CLI: <exact commands> | EXIT_CODE: 0`
	  - (If SCOPE=GUI or MIXED) `VALIDATED_GUI: MCP(playwright-mcp)` AND `RUNBOOK:` AND at least `SCREENSHOTS: <path or index>`
	    (recommended: `TRACE:` / `VIDEO:` / `CONSOLE_INDEX:`)
	  - `RESULT: PASS`
	  - `GIT_COMMIT: <sha>` and `COMMIT_MSG: "<message>"`
	  - and at least one of: `DIFFSTAT:` or `FILES:`
	- Worker may provide `BUNDLE (RUN #n): .` but it is NOT sufficient for DONE.
 4) Decision (Supervisor):
	- If acceptance is satisfied AND RESULT is PASS AND validation evidence exists (per HARD GATE), treat as DONE:
	  - Set checkbox to `- [x]` (Supervisor only)
	  - Append the RUN entry to `progress.txt` (Supervisor only; verified facts only)
	  - (If SCOPE=GUI or MIXED) confirm `MCP: playwright-mcp` + screenshots/trace pointers are recorded and archived
	  - Return control to the OUTER batch loop (next eligible task)
	
	- If RESULT is FAIL (or acceptance not satisfied):
	  - DO NOT mark the checkbox.
	  - Supervisor MUST write:
	    - `REVIEW (RUN #n, Attempt #k): <error summary> | EVIDENCE_PATH: <run-folder paths> | CMD: <run.* + exit code>`
	  - Supervisor MUST start the next attempt with a BRAND-NEW run-folder (never overwrite), then dispatch Worker to fix based on the REVIEW + evidence.
	  - Do NOT “one-off stop” or “only retry once” here.
	    Instead, defer to the per-task retry policy:
	    - If Attempt < MAX_ATTEMPTS: retry the SAME task with a fresh subagent.
	    - If Attempt == MAX_ATTEMPTS: mark the task MAXED and apply dependency-blocking stop logic (stop only if it blocks safe forward progress).
 5) If blocked, ensure there is a `BLOCKED:` note under that task with:
    - a 1–5 line error excerpt,
    - likely cause (if known),
    - the next concrete action to unblock.
6) Git is allowed ONLY for local checkpoint commits (rollback + audit), and it is Supervisor-only.
Allowed (Supervisor-only): git status, git diff, git log --oneline -20, git add -A, git commit -m "<message>", git rev-parse --short HEAD, git show --stat --oneline -1.
Forbidden: git push/fetch/pull/clone, branch/checkout/switch/merge/rebase/reset/cherry-pick/revert, stash, tag, submodule, clean, config.
Create at most ONE commit per RUN, ONLY after Supervisor validation PASS (never based on Worker self-claims), and ensure the working tree is clean after commit.

## progress.txt format (Supervisor, append-only)

File: openspec/changes/<change-id>/progress.txt
Rule: Append-only. Never rewrite or reorder existing entries.

Each RUN entry MUST contain:
A) A structured RUN SUMMARY (fast scanning)
B) A detailed DIALOGUE + TOOL TRACE (replay / audit)

================================================================================
RUN ENTRY

[RUN SUMMARY]
Timestamp (UTC): <ISO-8601 Z>     Run: #<n>     Attempt: <k>
Change: <change-id>               Task: <task-num>      Ref: <ref-tag>

Status: DONE | FAIL | BLOCKED | ROLE_VIOLATION | NO_PROGRESS

Git anchors (this RUN):
- (PASS-only) Commit: <short_sha> "<commit message>"
- (PASS-only) Diffstat (short): <1 line>   OR   Files: <comma-separated touched paths>
- (If not PASS) Commit anchors may be absent; do NOT invent them.

Evidence pointers:
- tasks.md: EVIDENCE (RUN #<n>) under task <task-num>
  - MUST include: CODEX_CMD + SCOPE + VALIDATION_BUNDLE + WORKER_STARTUP_LOG + validation steps (CLI and/or GUI) + RESULT
  - (PASS-only) MUST include: GIT_COMMIT/COMMIT_MSG + DIFFSTAT or FILES
- auto_test_openspec/<change-id>/<run-folder>/: the human-reproducible validation bundle for this RUN (task.md + run scripts + assets + outputs/logs, including logs/worker_startup.txt)
- feature_list.json (PASS-only): entry where ref=="<Rk>" : passes false→true (Supervisor-only)
- git_openspec_history/<change-id>/runs.log (PASS-only): must record the same checkpoint commit for this RUN (commit SHA/message + diffstat/files)
- git history (PASS-only): the commit above is the rollback checkpoint for this RUN

--------------------------------------------------------------------------------
Optional (recommended) SESSION STARTUP ENTRY (once per session)

[SESSION STARTUP]
[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <read - openspec/changes/<id>/progress.txt>
[Tool Use] <read - openspec/changes/<id>/feature_list.json>
[Tool Use] <read - openspec/changes/<id>/tasks.md>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - CODEX_CMD "...">  (Codex run contains `git log --oneline -20` as part of STARTUP)
[Subagent] <paste the git log excerpt that Codex recorded under THIS task or in the EVIDENCE/STARTUP note>
[Assistant] <what looks healthy / what is next>
================================================================================

## Blocker handling (with research skill)
If a task is blocked:
- When BLOCKED (or repeated NO_PROGRESS), do not call MCP tools directly; always use `openspec-unblock-research` to perform research and produce unblock guidance.
  - The skill may use MCP tools (e.g. `web-search-prime`, `context7`, etc.) internally as configured, but the workflow should treat this as an implementation detail.
- Under the SAME task in `tasks.md`, add/refresh:
  `UNBLOCK GUIDANCE (RUN #n, Attempt #k): ...`
  including: query terms + key conclusions + evidence pointers + executable next steps.
- Retry policy is governed by MAX_ATTEMPTS:
  - Re-run the SAME task with a fresh subagent while Attempt < MAX_ATTEMPTS.
  - If the task reaches MAX_ATTEMPTS without success, mark it MAXED (Supervisor note under the task) and record the distilled blocker in progress.txt.
  - Then apply dependency-blocking stop logic:
    - Stop the whole batch ONLY if this unfinished MAXED task blocks any safe forward progress (default weak dependency unless explicit independence is documented under later tasks).
    - Otherwise, later tasks explicitly marked independent may proceed.

[/collapse]

#### `.claude/commands/monitor-openspec-codex.md` (自动化核心)

在
- Windows: `%USERPROFILE%\.claude\commands`
- macOS/Linux: `~/.claude/commands`
下新建：`monitor-openspec-codex.md`

这是我们的“监工脚本”，它定义了 Claude Code 如何自动循环调用 Codex。

[collapse title="新建`monitor-openspec-codex.md`" status="false"]

```md
---
description: Supervise an OpenSpec change in BATCH MODE. Iterates through unchecked tasks.md items sequentially via Codex CLI (codex exec). Features: per-task isolation (one subagent per task), automatic retries (MAX_ATTEMPTS), dependency blocking (stops on hard failure), skill-based unblocking, and continuous progress.txt logging.
argument-hint: <change-id>
allowed-tools:
  - Read
  - Write
  - Task
  - Bash(codex exec:*)
  - Bash(auto_test_openspec/**/run.sh)
  - Bash(auto_test_openspec/**/run.bat)

# Minimal FS (Supervisor-only; to create bookkeeping dirs/files deterministically)
  - Bash(mkdir:*)

# Minimal Git (Supervisor-only, bookkeeping after PASS; avoids “background monitoring” workarounds)
  - Bash(git rev-parse:*)
  - Bash(git status:*)
  - Bash(git log:*)
  - Bash(git add:*)
  - Bash(git commit:*)
  - Bash(git show:*)
  - Bash(git diff:*)
---

You are the SUPERVISOR. Follow this procedure in English only.

# Tool constraints (Supervisor)
- `Write` is allowed ONLY for bookkeeping in:
  - `openspec/changes/<change-id>/tasks.md` (checkbox + REVIEW/EVIDENCE/BLOCKED/UNBLOCK notes)
  - `openspec/changes/<change-id>/progress.txt` (append-only handoff log)
  - `openspec/changes/<change-id>/feature_list.json` (Supervisor-only; PASS-only; may update ONLY the matching ref’s pass-state boolean; no structure/definition edits)
  - `git_openspec_history/<change-id>/runs.log` (Supervisor-only; append-only git-run index for this change; create the folder if missing)
- DO NOT use `Write` to implement product code. All implementation MUST come from the Worker’s single `CODEX_CMD` run.

# Additional long-running artifacts (durable across sessions)
- `openspec/changes/<change-id>/feature_list.json` is the end-to-end feature checklist (pass/fail per stable ref tag).
  - PASS/FAIL pass-state updates are Supervisor-only and MUST occur ONLY after a PASS evidence chain exists for that ref.
- `openspec/changes/<change-id>/progress.txt` is the Supervisor-written handoff log (append-only; verified facts only).

# Single Codex command constant (maintain ONLY ONE copy)
CODEX_CMD = codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium

Inputs:
- change-id: $ARGUMENTS

Goal:
- Execute a BATCH LOOP over `openspec/changes/<change-id>/tasks.md`.
- Process tasks sequentially (top-to-bottom).
- For each unchecked task:
  1. Isolate execution (One Task = One Subagent = One Codex Run).
  2. Retry on failure up to MAX_ATTEMPTS (default: 2).
  3. Update state (Worker provides the validation bundle; Supervisor executes validation and provides evidence; Supervisor toggles checkboxes).

- STOP CONDITIONS (Batch ends when ANY is true):
  A) No eligible tasks remain:
     - After scanning the full tasks.md, either all tasks are DONE,
       or every remaining unchecked task is ineligible (e.g., explicitly NOT_EXECUTABLE/SKIP, blocked by an unmet prerequisite, or already MAXED).

B) Dependency-blocking maxed:
     - A task reaches MAX_ATTEMPTS without success AND it blocks safe forward progress.
     - Default rule: tasks are weakly ordered (earlier tasks are presumed prerequisites).
       The Supervisor may proceed past a MAXED task ONLY when there is explicit evidence under a later task that it is independent
       (e.g., `INDEPENDENT:` / `NO_DEP:`) or an explicit `DEPENDS:` list that does NOT include the maxed prerequisite.
     - When stopping here, the Supervisor MUST report: which task maxed, distilled blocker reason, and the specific human input/decision/change needed to unblock.

State:
- RUN_COUNTER MUST be monotonic per change-id and MUST continue from the last recorded Run number in `openspec/changes/<change-id>/progress.txt` (do not reset to 1 across sessions).

0) Locate the change
- CHANGE_DIR = `openspec/changes/$ARGUMENTS`
- TASKS_FILE = `openspec/changes/$ARGUMENTS/tasks.md`
- FEATURE_FILE = `openspec/changes/$ARGUMENTS/feature_list.json`
- PROGRESS_FILE = `openspec/changes/$ARGUMENTS/progress.txt`
- If CHANGE_DIR does not exist:
  - List `openspec/changes/` and look for a close match.
  - If ambiguous, STOP and ask the user for the exact change-id.
- If TASKS_FILE does not exist:
  - STOP and ask the user to scaffold it.
- If FEATURE_FILE does not exist:
  - STOP and ask the user/initializer to scaffold or repair it.
  - NOTE: Worker/Codex is NOT allowed to create or rewrite feature_list.json.
- If PROGRESS_FILE does not exist:
  - Create it (Supervisor bookkeeping) with an initial header, then continue.
  - NOTE: Only do this when the file is missing (first run). Never overwrite or reset an existing progress.txt.

0.1) Restore session state (Supervisor; Read-only; no Bash)
- Read PROGRESS_FILE and derive RUN_COUNTER (monotonic per change-id):
  - If any prior entry contains `Run: #<n>`, set RUN_COUNTER = (max n) + 1
  - Else RUN_COUNTER = 1
- Read FEATURE_FILE (context only; do not edit).
- Proceed to task selection.

1) Batch session loop (one invocation = many task attempts, serial)
- Loop:
  - Read TASKS_FILE and select CURRENT_TASK using the eligibility rules in 1.1 (top-to-bottom).
  - If no eligible task exists -> STOP via stop condition (A) "No eligible tasks remain".

- For CURRENT_TASK, run a per-task retry loop up to MAX_ATTEMPTS:
    - Let MAX_ATTEMPTS = 2 (or the configured constant in this command).
    - Let ATTEMPT be derived from PROGRESS_FILE (resumable across sessions; see 1.1).
    - While ATTEMPT <= MAX_ATTEMPTS:
      - Spawn EXACTLY ONE new subagent for this ONE task attempt (never bundle).
      - Supervisor verifies + books (explicit control flow; keep auto-retries):
        - Determine post-subagent status UNDER THIS task only:
          - READY_TO_VALIDATE if:
            - tasks.md contains exactly ONE `BUNDLE (RUN #<RUN_COUNTER>): ...` line for this attempt, and
            - the referenced run-folder exists and contains the required bundle assets (task.md + run.sh + run.bat + logs/; and if GUI/MIXED, tests/ with MCP runbook).
          - BLOCKED if tasks.md contains `BLOCKED:` + `NEEDS:` under this task.
          - ROLE_VIOLATION if the Worker wrote any `EVIDENCE (RUN #...)` / PASS/FAIL/RESULT/validated= conclusion, toggled any checkbox, or modified feature_list.json.
          - NO_PROGRESS otherwise.

- If READY_TO_VALIDATE:
          - Execute validation as Supervisor:
            - CLI scope: run `auto_test_openspec/**/run.sh|run.bat` and capture logs/outputs (append-only in the run-folder).
            - GUI/MIXED scope:
              - run.* is start-server only (start the service and print URL/port),
              - execute `tests/gui_runbook_*.md` via MCP service `playwright-mcp` (no manual browser; no scripts),
              - capture evidence (at minimum screenshots + screenshots index under logs/; trace/video/console index optional).
          - Record result under THIS task (Supervisor-only):
            - Write ONE `EVIDENCE (RUN #<RUN_COUNTER>): ... | RESULT: PASS|FAIL | ...` line with evidence pointers.
          - If RESULT is PASS:
            - Toggle checkbox to `- [x]` (Supervisor only).
            - Append progress.txt entry (Status=DONE, Attempt=<k>, bundle + evidence pointers).
            - Continue the outer batch loop (pick next eligible task).   # explicit continue
          - If RESULT is FAIL:
            - Append progress.txt entry (Status=FAIL, Attempt=<k>, distilled blocker + evidence pointers).
            - If ATTEMPT < MAX_ATTEMPTS:
              - Add/refresh `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>): ...` under the SAME task in tasks.md (Supervisor only).
              - ATTEMPT += 1 and retry the SAME task with a fresh subagent.  # explicit retry
            - Else (ATTEMPT == MAX_ATTEMPTS):
              - Mark the task as MAXED (Supervisor note under task; do NOT check it):
                - `MAXED (RUN #<RUN_COUNTER>): <short reason>`
              - Enforce dependency-blocking stop logic:
                - If the Supervisor cannot safely proceed to any later unchecked task:
                  - STOP via stop condition (B) and report the required human unblock input.  # explicit stop
                - Else:
                  - Continue the outer batch loop.  # explicit continue
        - If BLOCKED / ROLE_VIOLATION / NO_PROGRESS:
          - Append progress.txt entry (Status=BLOCKED/ROLE_VIOLATION/NO_PROGRESS, Attempt=<k>, distilled blocker + next-step suggestion).
          - If ATTEMPT < MAX_ATTEMPTS:
            - Add/refresh `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>): ...` under the SAME task in tasks.md (Supervisor only).
            - ATTEMPT += 1 and retry the SAME task with a fresh subagent.   # explicit retry
          - Else (ATTEMPT == MAX_ATTEMPTS):
            - Mark the task as MAXED (Supervisor note under task; do NOT check it):
              - `MAXED (RUN #<RUN_COUNTER>): <short reason>`
            - Enforce dependency-blocking stop logic:
              - If the Supervisor cannot safely proceed to any later unchecked task:
                - STOP via stop condition (B) and report the required human unblock input.  # explicit stop
              - Else:
                - Continue the outer batch loop.  # explicit continue

- Terminate ONLY via stop conditions (A) or (B) (and "All tasks done" as a subset of A).
- Do NOT stop after a single task by default.

1.1) Determine CURRENT_TASK (eligible + resumable attempts)
	- Read TASKS_FILE.
	- Scan tasks top-to-bottom and pick the FIRST unchecked checkbox item that is ELIGIBLE.
	  - ELIGIBLE means ALL are true:
	    - It is not explicitly marked NOT_EXECUTABLE / SKIP (by a Supervisor note under the task).
	    - It is not already MAXED (i.e., previously reached MAX_ATTEMPTS without success).
	    - It is not blocked by an earlier unmet prerequisite:
	      - Default: tasks are weakly ordered; earlier unchecked/maxed tasks are presumed prerequisites.
	      - Exception (allowed to proceed): the candidate task has an explicit independence marker under it
	        (e.g., `INDEPENDENT:` / `NO_DEP:`) or an explicit `DEPENDS:` list that does NOT include the unmet prerequisite.
	- If no eligible unchecked task exists after the full scan:
	  - Stop via "No eligible tasks remain" (stop condition A).

- Capture:
	  - TASK_LINE = the full checkbox line
	  - TASK_NUM = e.g., `1.1` if present, else `?`
	  - REF_TAG = e.g., `[#R1]` if present, else `[]`

- Derive ATTEMPT counter for this task (resumable across sessions):
	  - Read PROGRESS_FILE and find prior RUN entries where `Task: <task-num>` matches TASK_NUM.
	  - Let ATTEMPT = (max recorded Attempt for this TASK_NUM) + 1, else 1 if none exist.
	  - Note: Attempt is per-task (not per-session). RUN_COUNTER remains global monotonic.

- Lock scope (per-task atomicity):
	  - For the duration of the upcoming subagent/Codex run, the Worker MUST work ONLY on this CURRENT_TASK.
	  - After the subagent returns, the Supervisor may select the next eligible task and spawn a new subagent.

1.2) Print RUN banner (START)
  Output exactly:
  `[MONITOR] RUN #<RUN_COUNTER> START | change=$ARGUMENTS | task=<TASK_NUM> | ref=<REF_TAG> | text="<TASK_LINE>"`

1.3) Spawn ONE subagent for CURRENT_TASK
  Use the Task tool to spawn a NEW subagent (e.g., name it "codex-worker").
- The Supervisor MUST NOT run Bash for implementation work (coding/build steps).
- The Supervisor MAY run Bash ONLY for:
  - executing the validation bundle entrypoint (`auto_test_openspec/**/run.sh|run.bat`) to capture auditable outputs/logs
  - minimal Git bookkeeping after PASS (commit + show/diffstat), as explicitly allowed in `allowed-tools`
  - any GUI steps MUST be executed ONLY via MCP service `playwright-mcp` (no manual browser; no Python/Node/Playwright scripts).
  
  IMPORTANT: Explicitly instruct the subagent that manual file editing is banned. 
  Tell the subagent: "I will reject any work that does not produce a `codex exec` execution log. Do not try to edit files directly."

Subagent instructions (copy verbatim):
---
You are the CODEX CLI OPERATOR. Your ONLY job is to run Codex CLI exactly once and report results. You are NOT a software engineer.

MISSION: You must force the `codex` CLI tool to perform the work.
NON-NEGOTIABLE RULE: You are FORBIDDEN from using `Write`, `Edit`, or `Replace` tools on project files. You have NO permission to edit code manually.
TOOLS:
- You MAY use the Read tool to inspect files (tasks.md / progress.txt / feature_list.json).
- You MUST invoke the Bash tool exactly once, and that single invocation MUST be CODEX_CMD.
- You are FORBIDDEN from using Write/Edit/Replace on project files.

Execution Steps (Do exactly this):
1. Read (Read tool, not Bash):
   - `openspec/changes/$ARGUMENTS/tasks.md`
   - `openspec/changes/$ARGUMENTS/progress.txt`
   - `openspec/changes/$ARGUMENTS/feature_list.json`
2. Construct a prompt for the CLI using the template below.
3. Run exactly ONE Bash command:
   codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium "$(cat <<'PROMPT'<INLINE_PROMPT>PROMPT)"
4. Verify the CLI updated `tasks.md` under THIS task ONLY (no checkbox toggles).
   Verify the Worker output is BUNDLE-ready (and ONLY bundle-ready):
   - Under THIS task, there is EXACTLY ONE single-line `BUNDLE (RUN #<RUN_COUNTER>): ...` pointer that targets a concrete run-folder:
     - includes `CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium`
     - includes `SCOPE: <CLI|GUI|MIXED>`
     - includes `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-.../`
     - includes `HOW_TO_RUN: run.sh/run.bat`
     - if SCOPE includes GUI: includes `RUNBOOK: tests/gui_runbook_*.md`
   - The referenced run-folder exists and contains at minimum:
     - `task.md`, `run.sh`, `run.bat`,
     - `logs/worker_startup.txt` (mandatory startup snapshot),
     - and (when GUI/MIXED) `tests/` containing an MCP-only runbook (no scripts).
   - The Worker did NOT:
     - write any `EVIDENCE (RUN #...)` line
     - write PASS/FAIL/RESULT/validated= conclusions
     - toggle any checkbox
   Also verify governance constraints:
   - `feature_list.json` MUST NOT be modified by the Worker (neither entries nor pass-state).
   - No git commit is expected/allowed from the Worker.
   - If the CLI violated any of the above, report failure.

<INLINE_PROMPT> Template (fill variables):

(Shared setup)
- change-id: $ARGUMENTS
- include the exact TASK_LINE text (verbatim)
- state explicitly: "Implement ONLY this task (no other tasks, no refactors outside scope)."
- require full validation per the task’s `TEST:` and the canonical spec:
  - Follow `openspec/project.md` → `## tasks.md Checklist Format` → `### Validation bundle requirements (mandatory)`
  - Produce a human-reproducible validation bundle under:
    `auto_test_openspec/$ARGUMENTS/<run-folder>/`
  - Worker MAY run quick local checks to ensure the bundle is runnable,
    but MUST NOT claim PASS/FAIL/validated (Supervisor is the final verifier).

A) Worker deliverables (validation bundle assets)
- Create a NEW run-folder (append-only; never overwrite prior runs):
  `auto_test_openspec/$ARGUMENTS/run-<RUN4>__task-<TASK_ID>__ref-<REF>__<YYYYMMDDThhmmssZ>/`
- Minimum required files inside the run-folder:
  - `task.md` (self-sufficient README; includes How-to-run + machine-decidable pass/fail criteria)
  - `run.sh` and `run.bat`
  - `logs/worker_startup.txt` (MANDATORY; see Startup ritual below)
  - `logs/` (for provenance + transcripts; keep append-only within this run folder)
  - If SCOPE includes GUI/MIXED: `tests/gui_runbook_*.md` (MCP-only runbook; no executable browser scripts)
  - If the task needs inputs/expected: include `inputs/`, `expected/`, and write outputs into `outputs/` (never temp dirs)
- GUI/MIXED server-start contract (MANDATORY):
  - `task.md` MUST include a dedicated section with EXACT, copy/paste-able commands:
    - `SERVER_START:` <exact command to start the server>
    - `SERVER_URL:` <exact URL Supervisor should navigate to, including host + port>
    - `READY_CHECK:` <a concrete readiness check (endpoint or observable signal)>
  - For GUI/MIXED, `run.sh` / `run.bat` MUST implement `SERVER_START`:
    - MUST start the local server and print the `SERVER_URL` to stdout.
    - MUST NOT perform validation (no PASS/FAIL claims); start-server only.

- Environment isolation (mandatory ONLY if env problems occur):
  - DO NOT install Python deps globally.
  - If missing deps / conflicts prevent execution, create an isolated venv via `uv` inside THIS run folder
    (e.g., `<run-folder>/.venv/`) and ensure `run.sh`/`run.bat` uses it.
  - Log provenance into `logs/` (always): python path+version, uv version, dependency source, exact install commands.
A) Startup ritual (MANDATORY, before any edits)
- REQUIRE CodeX STARTUP RITUAL:
  - read `openspec/changes/$ARGUMENTS/progress.txt`
  - read `openspec/changes/$ARGUMENTS/feature_list.json`
  - run `git log --oneline -20`
  - capture `GIT_BASE` via `git rev-parse --short HEAD`
  - write a Startup snapshot to the validation bundle (NOT tasks.md), at:
    - `auto_test_openspec/$ARGUMENTS/<run-folder>/logs/worker_startup.txt`
  - The snapshot MUST include: UTC timestamp, CODEX_CMD, GIT_BASE, the git-log excerpt, and a short “what I observed” summary.

B) tasks.md bookkeeping (Worker-owned; single-line; NO conclusions)
- require Codex to update `openspec/changes/$ARGUMENTS/tasks.md` under THIS task with exactly ONE Worker bookkeeping line (NOT EVIDENCE):
  - starting with: `BUNDLE (RUN #<RUN_COUNTER>): ...`
  - MUST be a SINGLE LINE
  - MUST NOT write any `EVIDENCE (RUN #...)` line
  - MUST NOT write any PASS/FAIL/RESULT/validated= conclusions
- The single BUNDLE line MUST include ONLY:
  - `CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium`
  - `SCOPE: <CLI|GUI|MIXED>`
  - `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-<RUN4>__task-<TASK_NUM>__ref-<REF>__<YYYYMMDDThhmmssZ>`
  - `HOW_TO_RUN: run.sh/run.bat`
  - (if SCOPE=GUI or MIXED) `RUNBOOK: tests/gui_runbook_*.md`
  - (if SCOPE=GUI or MIXED) `SERVER_URL: <exact url including host+port>`
- forbid Codex from toggling ANY checkbox in tasks.

C) GUI hard rules (only if SCOPE includes GUI/MIXED)
- GUI verification is Supervisor-only via MCP service `playwright-mcp`.
- Worker deliverable for GUI is ONLY the MCP runbook file:
  - `tests/gui_runbook_*.md` MUST be MCP-only steps + selectors + assertion points + evidence capture points.
  - ABSOLUTELY NO executable browser automation scripts (no Playwright test runner; no Python/Node scripts).
  - ABSOLUTELY NO manual browser steps anywhere (no “open Chrome/click …” prose, anywhere in the bundle).
- For GUI/MIXED bundles, `run.sh` / `run.bat` MUST be start-server only:
  - MUST start the local server and print URL/port.
  - MUST NOT perform state seeding/copying/exporting/testing/validation/probing/installs.

D) Governance boundaries (Worker forbidden; Supervisor-only)
- feature_list governance (MANDATORY; strict):
  - The Worker/Codex is FORBIDDEN to edit `openspec/changes/$ARGUMENTS/feature_list.json` (no entry edits, no pass-state edits, no formatting churn).
  - If `openspec/changes/$ARGUMENTS/feature_list.json` is missing OR the matching ref entry is missing:
    - Under THIS task write:
      BLOCKED: Missing feature_list.json (or missing ref entry for <REF_TAG>)
      NEEDS: Supervisor/initializer must create/repair feature_list.json (structure + ref mapping). Then re-run this task.
    - Then END THIS WORKER RUN immediately (do not proceed with implementation in this run).
  - Pass-state updates (e.g., `passes=true/false`) are Supervisor-only and may occur ONLY after Supervisor validation PASS + EVIDENCE is recorded.
- forbid touching any other tasks (no evidence elsewhere; no changes to other items)
- governance boundary (Worker/Codex; mandatory):
  - The Worker/Codex is FORBIDDEN to create git commits (no checkpoint commits).
  - The Worker/Codex is FORBIDDEN to edit or append `git_openspec_history/<change-id>/runs.log`.
  - The Worker/Codex MUST NOT attempt to produce DIFFSTAT/FILES “final” summaries as evidence.
  - All commit/runs.log bookkeeping (and DIFFSTAT capture) is Supervisor-only and may occur ONLY after Supervisor validation PASS.

3) After Codex finishes, confirm that `openspec/changes/$ARGUMENTS/tasks.md` has either:

BUNDLE-READY (Worker output, under THIS task):
  - EXACTLY ONE `BUNDLE (RUN #<RUN_COUNTER>): ...` line that points to a concrete run-folder:
    - includes `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-.../`
    - includes `HOW_TO_RUN: run.sh/run.bat`
    - if SCOPE includes GUI/MIXED: includes `RUNBOOK: tests/gui_runbook_*.md`
    - if SCOPE includes GUI/MIXED: includes `SERVER_URL: ...`
  - The referenced run-folder exists and contains at minimum:
    - `task.md`, `run.sh`, `run.bat`, `logs/worker_startup.txt`,
    - and (when GUI/MIXED) `tests/` with an MCP-only runbook
  - For GUI/MIXED, `task.md` MUST include `SERVER_START:` + `SERVER_URL:` + `READY_CHECK:` (as defined above).
  - Worker MUST NOT have written any `EVIDENCE (RUN #...)` line.
  - Worker MUST NOT have toggled any checkbox.
  - Worker MUST NOT have edited feature_list.json.
  - Worker MUST NOT have created any git commit.
  - Worker MUST NOT have edited `git_openspec_history/<change-id>/runs.log`.

OR BLOCKED (Worker output, under THIS task):
  - `BLOCKED: ...` (1–5 line error excerpt)
  - `NEEDS: ...` (next concrete unblock step)

OR ROLE_VIOLATION (Worker output, under THIS task):
  - Any `EVIDENCE (RUN #...)` / PASS/FAIL/RESULT/validated= conclusion, checkbox toggle, feature_list.json edit, git commit,
    or any edit/append to `git_openspec_history/<change-id>/runs.log`.

Otherwise treat as NO_PROGRESS (missing BUNDLE line and/or missing run-folder).

1.4) Supervisor verification after subagent returns
- Re-read TASKS_FILE.
- Determine status (under THIS task only):

- READY_TO_VALIDATE if a compliant BUNDLE (RUN #<RUN_COUNTER>) line exists and the referenced run-folder is present and well-formed.
  - BLOCKED if BLOCKED+NEEDS exists.
  - ROLE_VIOLATION if Worker wrote any EVIDENCE/PASS/FAIL/RESULT/validated= conclusion, toggled checkboxes, edited feature_list.json, created commits,
    or edited/appended `git_openspec_history/<change-id>/runs.log`.
  - NO_PROGRESS otherwise.

- If READY_TO_VALIDATE:
  - Supervisor MUST execute validation.
    - CLI: via `run.sh`/`run.bat` as specified in the bundle.
    - GUI/MIXED:
      1) MUST start the server first by running `run.sh`/`run.bat` (start-server only).
      2) MUST navigate using the `SERVER_URL` provided in the BUNDLE line / task.md.
      3) Then execute the MCP `playwright-mcp` runbook.
      4) If the server cannot be started or `SERVER_URL` is missing/invalid, treat as bundle not ready for validation (NO_PROGRESS or BLOCKED with NEEDS), not as a feature FAIL.
  - Supervisor writes the single EVIDENCE (RUN #<RUN_COUNTER>) line (PASS/FAIL + evidence pointers).
  - Supervisor updates feature_list.json pass-state ONLY after PASS.
  - Supervisor creates ONE checkpoint commit ONLY after PASS.
  - Supervisor appends runs.log ONLY after PASS.
  - Supervisor may then toggle the checkbox to - [x] ONLY after PASS.

- DONE is reachable only after Supervisor validation PASS + compliant EVIDENCE exists under THIS task.

If DONE:
- Toggle the checkbox to `- [x]` (Supervisor only).
- Append a FULL RUN ENTRY to PROGRESS_FILE (Supervisor only; verified facts only) including:
  - RUN SUMMARY (timestamp, run #, change-id, task/ref, status)
  - Evidence pointers (tasks.md evidence line pointer + feature_list passes change + GIT_BASE/GIT_COMMIT/COMMIT_MSG)
  - Validation commands/steps + 3–15 lines output excerpt (from Supervisor validation output and/or bundle logs)
  - Changes verified: FILES/DIFFSTAT + key edits summary
  - [DIALOGUE + TOOL TRACE] with bracket markers, including:
    - [Supervisor → Subagent] instruction
    - [Tool Use] <task - spawn subagent>
    - [Tool Use] <bash - CODEX_CMD "..."> (from subagent trace)
    - [Subagent] reported outputs + the exact BUNDLE line + bundle folder pointer(s)
    - [Supervisor] the exact EVIDENCE line + acceptance decision + rationale
- Print RUN banner (END) as before.
- RUN_COUNTER += 1 and continue/stop per your session policy.

If BLOCKED:
- Ensure actionable NEEDS exists (next concrete unblock step).

- Call skill `openspec-unblock-research` (Supervisor-only). Do NOT call MCP tools directly here.
  - Provide the skill the BLOCKED context (task line + ref, error excerpt, NEEDS, what was tried, env/versions if known).
  - Instruct the skill to write its portable research capsule into BOTH bookkeeping artifacts:
    (a) Under THIS task in tasks.md:
        Add `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>):` containing:
        - Query terms
        - Key conclusions
        - Evidence pointers (source links/locators)
        - Executable next steps + how to verify
    (b) Into progress.txt (inside the current RUN entry):
        Append a short “Unblock Research Capsule” containing:
        - Query terms
        - Key conclusions
        - Evidence pointers
        - Pointer back to the tasks.md UNBLOCK GUIDANCE location

- Append a FULL RUN ENTRY to PROGRESS_FILE capturing blocker + the skill’s capsule + retry decision (verified facts only).
- Retry once as before; if blocked again, STOP and require user/initializer intervention.

If NO_PROGRESS:
- Treat as a FAILED ATTEMPT (not an immediate session stop by default).
- Under THIS task, append/refresh a single diagnostic note:
  `BLOCKED: Missing a compliant BUNDLE pointer and/or the referenced validation bundle folder is missing/incomplete for this RUN (workflow non-compliance).`
  `NEEDS: Re-run SAME task; Worker/Codex must (1) create a fresh run-folder under auto_test_openspec/<change-id>/... containing task.md + run.sh + run.bat + logs/worker_startup.txt (+ tests/runbook if GUI), and (2) append EXACTLY ONE single-line BUNDLE (RUN #<RUN_COUNTER>) pointer under THIS task (CODEX_CMD + SCOPE + VALIDATION_BUNDLE + HOW_TO_RUN [+ RUNBOOK]).`
- Append a FULL RUN ENTRY to PROGRESS_FILE (status=NO_PROGRESS) including:
  - the missing-gate diagnosis,
  - the subagent trace,
  - Attempt #k and the retry/maxed decision.
- Flow control MUST follow the per-task retry policy:
  - If Attempt #k < MAX_ATTEMPTS: continue the retry loop for the SAME task (fresh subagent).
  - Else (Attempt #k == MAX_ATTEMPTS): mark the task MAXED and apply dependency-blocking stop logic (stop only if it blocks safe forward progress).

2) Completion (only at start-of-session, or if CURRENT_TASK selection finds none)
- If no unchecked tasks remain:
  `[MONITOR] DONE | change=$ARGUMENTS | all tasks checked`
  then STOP.
```

[/collapse]

---

## 流程

### 初始配置

首先是安装Claude code和codex，这个就不列举了。安装openspec这里要说一下，最好是0.19.0版本，因为再新的版本，openspec的工作流重构了，支持自然语言调用，使用的是`skills`触发[^3]，后续我也会尝试适配更新最新版本的openspec。

```shell
npm install -g @fission-ai/openspec@0.19.0
```

先使用openspec初始化一下项目

```shell
openspec init
```

```txt
下一步操作 - 将这些提示复制到codex：
────────────────────────────────────────────────────────────
1. 填充项目上下文：
请阅读 openspec/project.md 并协助我完成内容填写
包含我的项目详情、技术栈及规范"

2. 创建您的首个变更提案：
我想添加[在此处填写您的功能]。请创建一个
OpenSpec 对此功能的变更提案

3. 学习 OpenSpec 工作流：
请解释来自 openspec/AGENTS.md 的 OpenSpec 工作流。
以及我该如何与你共同推进这个项目
```

### 重复流程

先打开codex，使用自然语言提出一个变更提案，例如：`为我这个项目添加一个支持夜间模型自动切换的功能`

然后再使用skills`$openspec-change-interviewer <id>`让模型通过采访的方式，明确我们的需求，对齐需求。<id>填写的是openspec文件夹下的当前提案的文件夹名称。

再 `$openspec-feature-list <id>` 让模型列出来一个feature_list.json。

最后打开Claude code，输入`/monitor-openspec-codex <id>`即可

###  实际使用流程

1. 安装 `openspec`（我建议锁 `0.19.0`）

我这里强烈建议用 `0.19.0`，因为更高版本工作流有重构，虽然也支持自然语言调用，但走的是 `skills` ，我后续也会尝试适配到最新版[^3]。

```
   npm install -g @fission-ai/openspec@0.19.0
   ```

2. 初始化项目

```
   openspec init
   ```

初始化完成后，它会提示下一步要做什么。我们可以先把项目上下文补齐，再创建第一个变更提案。

3. 用 `codex` 提一个 change（自然语言就行）

比如：`为我这个项目添加一个支持夜间模型自动切换的功能`

4. 用 skill 把需求“采访清楚”

对齐需求这一步真的很值。我们让模型先问清楚，再开干，后面返工会少很多。

- 运行 `openspec-change-interviewer`：`$openspec-change-interviewer <id>`
   - `<id>` 就是 `openspec` 文件夹下当前提案的文件夹名

5. 生成 `feature_list.json`

- 跑：`$openspec-feature-list <id>`
   - 这一步做完，我们后面就能用它来防止“看似完成、实际没过”的情况

6. 开始监督执行：交给 `Claude Code`

最后打开 `Claude Code`，输入：

- `/monitor-openspec-codex <id>`

## 参考资料

[^1]: [Effective harnesses for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
[^2]: [gpt-5-codex 模型文档](https://platform.openai.com/docs/models/gpt-5-codex)
[^3]: [issue #630](https://github.com/Fission-AI/OpenSpec/issues/630#issuecomment-3827031916)

让 AI 一直跑又不跑偏，真的太难了

今天就把这套思路分享给大家。这不仅仅是个方案，更是一种思路，大家完全可以拿去改成适合自己的版本。

特别提醒：这个思路适合从 1 到 n 的迭代开发。如果是 0 到 1 的新项目，我还是建议大家自己动手，或者亲自盯着模型做。

选对工具，省钱又省心

我自己的情况是：有 ChatGPT Plus，有 codex 使用权限，同时还有 glm 的 coding plan lite（可以配置到 Claude Code 里用）。Gemini 我也有，但 Gemini cli 的体验我个人觉得一般，所以这里就用 Claude Code + codex 来演示。

总结一下就是：

glm 的 coding plan：额度多，我基本没碰到过限额
Claude Code：有时会出现 过早完成任务 的情况
codex：相对更稳一点，但模型更贵，要省着用

所以我这里的策略是：让 Claude Code 来充当监督者，让 codex 去干活儿。

关于 codex 模型，我建议用 ChatGPT-5.2-medium。带 codex 后缀的模型官方说的是专门针对编程和代理任务优化¹但我实际测下来干活效果不太理想。medium 类似“Auto”，你也可以选 high，但是不要选 Xhigh，我之前试过，效果是真好，但一天跑完了一周的额度，钱包真的受不住。

两层防跑偏保险

这套 workflow 里，我最在意的是“防跑偏”和“防作弊”。

所以我用了两个东西做双保险，一个是tasks.md一个是feature_list.json，主要对比如下：

1. 对比表格

特性	tasks.md	feature_list.json
核心定位	执行层：具体的实施步骤与验证过程	管理层：产品功能需求的最终状态
颗粒度	细粒度：一个功能可能拆分为多个任务（1.1, 1.2, 1.3）	粗粒度：一个 Ref ID 对应一个完整功能点（R1）
Worker 权限	部分写入：仅允许添加 `BUNDLE` 行（交付代码包路径）	完全禁止：禁止修改任何内容（严禁自作主张改需求或状态）
Supervisor 权限	管理执行：勾选 Checkbox，写入 `EVIDENCE`（通过/失败结论）	更新状态：仅在验证通过后，将 `passes` 字段改为 `true`
内容形态	Markdown：包含人类可读的指令、测试标准、运行日志路径	JSON：结构化数据，包含 Ref ID、描述、布尔值状态
生命周期	动态交互：随着每次运行不断追加日志、报错、重试记录	相对静态：只有在功能真正“做完且验过”时才会翻转状态
	给人类+AI Agent	主要给AI Agent

2. 作用与联系

各自的作用

tasks.md（过程）：
它是过程记录。它记录了从代码实现到最终验证的完整流水线。Worker 可以在这里犯错、重试（Attempt #1, #2...），Supervisor 在这里记录具体的验证命令和截图路径。它是人机协作的作业空间，容纳了试错与迭代的细节，确保过程的可追溯性。
feature_list.json（结果）：
它是验收基准。它不记录具体的开发曲折，只映射最终的交付状态。负责 哪些端到端能力已经真正验过并通过 ，它用稳定 ref 来做长期清单，默认全部 passes=false，只有当某个 ref 的 PASS 证据链已经存在时才允许更新为通过。

靠什么联系起来？

两者通过 Ref 标签（如 [#R1]） 进行刚性绑定：

映射关系：tasks.md 中的具体任务行会携带标签（例如 - [ ] 1.1 实现登录接口 [#R1]），这个标签直接对应 feature_list.json 中的 "ref": "R1" 条目。
状态流转（单向驱动）：
- 先在 tasks.md 验证：Supervisor 必须先在 tasks.md 中运行 Worker 提供的代码包，确认测试通过，并写入 EVIDENCE ... RESULT: PASS。
- 后在 feature_list.json 归档：只有当 tasks.md 里的证据链确凿无疑（PASS）后，Supervisor 才有权限去修改 feature_list.json 中对应 R1 的 passes 字段为 true。

为什么要这么死板？因为只靠一份任务清单，模型是可能“看起来完成了”，但实际没完成；而 feature_list.json 这种能让我们更容易发现它是不是在糊弄。某种意义上，它就是防止“做个样子但不可用”的那道门槛²。

另外，为了最大程度减少“需求没对齐就开干”，我还加了一个 skills，让 AI 能反问我们，把需求再确认一遍。

总体思路

[角色分工] Claude Code 充当监督者（Supervisor），Codex 则是工人（Worker）。

为什么要这么拆？

因为真正怕的不是它不会写代码，而是：

它觉得“自己做完了”，但其实只是做了个样子
它偷懒绕过验证，或者验证不可复现
它跑偏了还自信满满，最后我们接手的时候一地鸡毛

所以这里使用两个 Agent 进行工作，最大程度的防止作弊，一个只负责写、一个只负责验收。

[启动] 整个流程开始于我使用 Codex （工人）生成的一份 OpenSpec 变更提案，这些提案会被转化为 tasks.md 中具体的待办事项列表。每当需要执行一项新任务时，Claude Code （监督者）就会启动一个subagent，使用codex exec调用 Codex （工人）。然后使用自然语言调用 OpenSpec。OpenSpec 最好是0.19.0版本，因为再新的版本 OpenSpec 的工作流重构了，也支持自然语言调用，但使用的是skills触发³。

[执行与交付] Codex （工人）在写完代码后，它必须制作并交付一个可复现的测试方案作为完工凭证并放在auto_test_openspec 目录下:

- CLI 任务： 包内必须包含自动化测试脚本（run.sh）。
- GUI 任务： 包内必须包含一份不含可执行代码的 MCP 操作方式（Markdown 格式），以及仅用于启动服务的脚本。

[验收与确权] Claude Code （监督者）会亲运行脚本进行验收，对于 GUI 任务，它会严格按照剧本调用 playwright-mcp 服务驱动浏览器，并抓取截图作为铁证，确保功能不仅代码写了，而且真实可用²。

只有当 Claude Code （监督者）亲自确认测试方案运行通过，且手中的证据链完整无误时，它才会执行一系列确权操作：

在 tasks.md 中勾选任务。
更新 feature_list.json 的 pass 状态。
执行 Git 提交存档。
将包含证据指针的交接日志写入 progress.txt。

[异常处理] 如果中遇到技术卡点， Claude Code （监督者）会利用 Context7 或浏览器搜索工具自主寻找解决方案并指导执行者重试。

目录结构

.
├── auto_test_openspec/                     # [根目录衍生品] 不可变的证据仓库
│   ├── run-0001__task-1.1__ref-R1.../      # 具体某次任务的“验证包” (Run Folder)
│   │   ├── run.sh                          # 自动化复现脚本
│   │   ├── task.md                         # 验证操作手册
│   │   └── ...                             # (日志、截图、输入输出等)
│   └── ...
│
├── git_openspec_history/                   # [根目录衍生品] Git 提交索引
│   └── runs.log                            # 索引日志：回溯 Run ID <-> Git Commit SHA
│
└── openspec/
    └── changes/
        └── <change-id>/                    # [OpenSpec 变更内产物]
            ├── feature_list.json           # 特性清单与通过状态 (双重账本)
            ├── progress.txt                # 交接日志 (记录对话与验证结果)
            └── tasks.md                    # (任务列表源文件)

### 怎么保存记忆？

每个任务单独的一个subagent，这样做是可以保证上下文不会过长和污染。但记忆则确保不了，我的方案是。

1. 核心机制：“启动仪式” (The Startup Ritual)

要求 Codex（工人）在干活前必须先读取历史档案：
- 必须读取 openspec/changes/<change-id>/progress.txt 和 feature_list.json。
- 必须运行 git log --oneline -20 来获取最近的代码变更历史。
- 必须把读到的这些信息写进 auto_test_openspec/$ARGUMENTS/<run-folder>/logs/worker_startup.txt，证明“我看过以前发生什么了”。

2. 三个记忆文件

tasks.md 作为项目的“任务记忆”与唯一事实来源，它维护着所有任务的执行状态清单。 Claude Code （监督者）通过读取此文件来决定当前的派发逻辑，而 Codex （工人）则依靠它明确具体的实施目标，从而确保双方对 哪些任务已完成、哪些待执行 拥有一致的认知。
progress.txt 这是一个只增不减的“过程记忆”日志，用于在不同会话间传递交接信息。每当任务结束， Claude Code （监督者）会将对话摘要、验证结果及报错信息固化至此；新启动的 Codex （工人）必须通过查阅该文件中的历史记录（特别是失败或阻塞的原因），来汲取前车之鉴，从而避免重蹈覆辙。
feature_list.json 它是项目完成度的状态，专门记录各个功能模块的验证通过状态。在该机制下，Codex （工人）仅拥有读取权限以确认依赖项状态，只有在 Claude Code （监督者）完成严格验证后才会更新此文件，从而保证了关于项目整体可用性的记忆既连续又具备绝对的权威性²。

Skills和mcp配置

1. 配置 MCP

playwright-mcp：

claude mcp add --transport stdio --scope user playwright-mcp -- npx -y @playwright/mcp@latest

再配一个 context7（遇到卡点能查资料、补上下文）：

claude mcp add context7 -- npx -y @upstash/context7-mcp@latest

我这里浏览器搜索 MCP 用的是智普的（你也可以换别家的，只要名字对得上就行）：

claude mcp add -s user -t http web-search-prime https://open.bigmodel.cn/api/mcp/web_search_prime/mcp --header "Authorization: Bearer your_api_key"

claude mcp add -s user -t http web-reader https://open.bigmodel.cn/api/mcp/web_reader/mcp --header "Authorization: Bearer your_api_key"

配置示例

2. skills

这几个 skill 我是直接放在仓库里维护的，大家可以按需下载：

给 codex 用的：

建议大家去 GitHub 下载 openspec-change-interviewer（用 采访式反问 把需求对齐）
再去 GitHub 下载 openspec-feature-list（生成 feature_list.json ）

给 Claude Code 用的：

这个是 Supervisor 卡点用的研究：建议大家去 GitHub 下载 openspec-unblock-research

自定义 openspec-unblock-research 的 mcp server

1. 配置mcp server

在 Claude Code 中运行 mcp list。必须看到 mcp__<new-search-name>__* 和 mcp__github__* (或其他辅助工具) 均已加载。

2. 修改核心文件 (SKILL.md)

对 openspec-unblock-research 的 SKILL.md 进行两处关键修改：

1. 修改文件头部 Description

保持描述与实际工具一致。

- 把 `mcp__web-search-prime__*`
- 改为 `mcp__<new-search-name>__*`

2. 修改 Default Provider Ordering

在文件底部的列表里 **插入新工具** 并 **替换旧搜索**。

修改示例：

## Default provider ordering (if caller omits toolchain_config)

1. `mcp__context7__*` (authority source)
   ...

2. `mcp__github__*` (新增: internal authority)
   - Use for: checking existing issues/bugs in repo or upstream.
   - Trigger when: `error_excerpt` looks like a library bug.
   - Stop when: found a closed issue matching symptoms.

3. `mcp__<new-search-name>__*` (替换原有的 search-prime)
   - Use for: recent regressions, common pitfalls.
   - Trigger when: `error_excerpt` includes searchable strings.
   - Stop when: have candidate links to verify.

4. `mcp__web-reader__*` (evidence fetcher)
   ...

需要更改的文件

可选：规范代码

修改AGENT.md。这个主要目的是为尽量写的代码规范一点精简一点，属于个人喜好，当然你也可以配置一下其他的，比如必须使用uv虚拟环境等等。大家如果觉得没必要的话可以不加

## Code hygiene guardrails (always-on)

- Prioritize correctness and maintainability over cosmetic changes.
    
- Keep scope tight: don’t refactor unrelated areas; avoid “while I’m here” edits.
    
- Write for the next reader: choose clear names, straightforward control flow, and readable structure.
    
- Avoid clever compactness (dense one-liners, nested ternaries). Prefer if/else or switch when branching grows.

关键文件修改

为了让这套流程跑起来，我们需要覆盖或新建几个配置文件。

openspec-proposal.md需要添加的

位置：

Windows: %USERPROFILE%\.codex\prompts\openspec-proposal.md
macOS/Linux: ~/.codex/prompts/openspec-proposal.md

目的：让openspec生成的task.md比较符合我们的需求。

注：该文件必须在输入openspec init后修改，否则会默认重置掉。

Steps6后面添加

- When drafting `openspec/changes/<id>/tasks.md`, you MUST follow:
  - `openspec/project.md` → `## tasks.md Checklist Format` (canonical; do not invent a parallel format).

- Hard gate reminders (do not expand here; see canonical spec above):
  - Every task MUST include `ACCEPT:` and `TEST:`.
  - Every checkbox task line MUST include EXACTLY ONE `[#R<n>]` token, unique across the file.
  - `TEST:` MUST include `SCOPE: CLI|GUI|MIXED` and MUST enable a human-reproducible validation bundle
    (all bundle rules + role split + evidence rules live ONLY in `openspec/project.md`).

  - Role split (mandatory; see `openspec/project.md` → “Validation bundle requirements”):
    - Worker produces bundle assets only; Supervisor executes and records PASS/FAIL evidence.

  - GUI/MIXED constraint (mandatory; see `openspec/project.md` → “CLI/GUI/MIXED validation requirements”):
    - GUI verification must be driven via MCP service `playwright-mcp` and evidence must be archived; do NOT use any browser automation scripts (Python/Node/Playwright test runner).

项目目录：openspec\project.md

目的：让openspec生成的task.md比较符合我们的需求。

在project.md末尾添加

## tasks.md Checklist Format

This section is the SINGLE canonical spec for tasks.md format and validation bundles.
Do not duplicate this spec elsewhere; other docs must link here.

### Task Line Format (required)

Each checkbox task line MUST follow:
- `- [ ] <task-id> <task summary> [#R<n>]`
- `<task-id>` MUST be dot-numbered (e.g. `1.1`, `2.3`).
- Each checkbox line MUST include EXACTLY ONE `[#R<n>]` token (e.g. `[#R1]`).
  - `[#R<n>]` MUST be unique across the entire tasks.md (never reuse).
- Every task MUST include both `ACCEPT:` and `TEST:` blocks.
- `TEST:` MUST include `SCOPE: CLI|GUI|MIXED` and MUST be implementable into a validation bundle
  per `### Validation bundle requirements (mandatory)` below.

### Example (copy/paste)

- [ ] 1.1 Do X and produce Y [#R1]
  - ACCEPT: ...
  - TEST: SCOPE: CLI
    - When done, generate validation bundle under:
      auto_test_openspec/<change-id>/<run-folder>/
    - run-folder MUST be:
      run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/
    - Run: auto_test_openspec/<change-id>/<run-folder>/run.sh (macOS/Linux) or run.bat (Windows)
    - run-folder MUST be:
      run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/
    - Run: auto_test_openspec/<change-id>/<run-folder>/run.sh (macOS/Linux) or run.bat (Windows)
    - Inputs: inputs/sample.json
      Outputs: outputs/result.json
    - Verify: compare against expected/result.json (or rule-based assertions)

### Validation bundle requirements (mandatory)

For every task, `TEST:` MUST be written so:
- the Worker can produce a **human one-click reproducible** validation bundle (assets + scripts for CLI checks; GUI checks are MCP-driven and MUST NOT use any browser automation scripts),
- AND the Supervisor can execute it and record the final PASS/FAIL evidence chain
  (each run-folder is immutable; evidence pointers are written after execution).

0) Roles & responsibilities (mandatory)
- Worker (produces artifacts; not the final verifier):
  - Implement product code + write tests (CLI). For GUI/MIXED, produce an MCP runbook only (no executable browser automation scripts).
  - Produce the validation bundle assets under the run-folder:
    `task.md`, `run.sh`, `run.bat`, `tests/` (CLI tests and/or GUI MCP runbook; no executable browser scripts), and (when applicable) `inputs/`, `expected/`.
  - MUST NOT declare PASS/FAIL.
  - MUST NOT overwrite/edit prior run-folders (append-only history).

- Supervisor (executes validation; forms the evidence chain):
  - MUST create a brand-new run-folder for every validation attempt (never overwrite).
  - Executes `run.sh` / `run.bat`, captures `outputs/` + `logs/` + GUI evidence when applicable.
  - MUST write the final PASS/FAIL result + evidence pointers (this is the DONE hard gate).

1) Canonical on-disk location (repo root; append-only)
- Root folder (fixed):
  - `auto_test_openspec/<change-id>/`
- Each validation attempt MUST create a brand-new run folder (never overwrite; keep ALL history forever):
  - `auto_test_openspec/<change-id>/<run-folder>/`
- Once created, a run folder MUST be treated as immutable evidence:
  - do not edit prior runs; create a new run folder instead.

2) Run folder naming (required; MUST include run#, task-id, ref-id; timestamp recommended)
- `<run-folder>` MUST follow this exact pattern:
  - `run-<RUN4>__task-<task-id>__ref-<ref-id>__<YYYYMMDDThhmmssZ>/`
- Example:
  - `run-0007__task-1.1__ref-R1__20260111T031500Z/`
- Rules:
  - `<RUN4>`: zero-padded, monotonic run counter (e.g. 0001, 0002, ...).
    - MUST match the Supervisor workflow RUN_COUNTER / `EVIDENCE (RUN #n)` numbering for audit alignment.
    - Mapping rule: `RUN #7` => `run-0007`, `RUN #12` => `run-0012`.
  - `<task-id>`: dot-numbered task id from the checkbox line (e.g. `1.1`).
  - `<ref-id>`: stable ref id derived from the task tag (e.g. `[#R1]` → `R1`).
  - `<YYYYMMDDThhmmssZ>`: UTC timestamp to guarantee uniqueness and ease auditing.

3) Minimum required contents inside EVERY run folder
Each run folder MUST contain at least:

A) `task.md` (this run’s readme; MUST be self-sufficient)
task.md MUST include:
- change-id, run#, task-id, ref-id
- SCOPE covered (CLI / GUI / MIXED)
- How to run (Windows + macOS/Linux)
  - CLI: run.sh/run.bat executes CLI checks.
  - GUI/MIXED: run.sh/run.bat starts the service only; GUI steps are executed via the MCP runbook under tests/.
- Test inputs (if any): input file paths, params, sample data
- Test outputs (if any): what files/stdout/stderr/screenshots/logs will be produced and where
- Expected results (machine-decidable): pass/fail criteria
  - exit code checks
  - stdout/stderr assertions (required when relevant)
  - file existence/content assertions (required when outputs exist)
  - GUI assertion points (when GUI/MIXED): which screenshots/states prove correctness
- Hard rules (GUI/MIXED):
  - task.md MUST NOT contain manual browser steps (no “open Chrome/click buttons” prose).
  - task.md MUST point to the MCP-only runbook under tests/ (e.g., tests/gui_runbook_<topic>.md).
  - Any required “copy/seed/prepare input/state” steps MUST be written as exact commands/steps here (and referenced by the runbook). run.sh/run.bat MUST NOT perform them.
- Provenance of expected/assumptions:
  - If inputs/expected are not provided by a human, the Worker MUST generate them and document where they came from
    (e.g., derived from ACCEPT, or an explicit reasonable assumption).


B) One-click scripts (both required; GUI/MIXED = start-server only)
- run.sh (macOS/Linux)
- run.bat (Windows)

Script requirements (all bundles):
- Must assume the default dev machine environment is ready.
- Non-destructive:
  - MUST NOT modify global environment
  - MUST NOT globally install dependencies
  - MUST NOT write to system directories
- Must be runnable from ANY working directory:
  - the script MUST cd/pushd to its own directory first, then resolve paths from there.

Hard rule (when SCOPE includes GUI):
- run.sh/run.bat MUST be start-server only:
  - MUST: start the local service and print the access URL/port (e.g., http://127.0.0.1:<PORT>/)
  - MUST NOT: copy/overwrite data files, mutate state/inputs, generate exports/outputs, run tests, run exports, probe/install dependencies, or perform environment probes (python/uv version checks do NOT belong in GUI start scripts)
  - Any required “copy/seed/prepare input/state” steps MUST be documented as exact commands/steps in task.md (and referenced by tests/gui_runbook_*.md) for the Supervisor to execute and record in EVIDENCE.

For CLI bundles (or the CLI portion of MIXED):
- run.sh/run.bat SHOULD print key results to console and SHOULD write logs to logs/.
- Environment provenance SHOULD be documented as optional preflight commands in task.md (not forced into GUI start scripts), e.g.:
  - interpreter path + version (Python/Node if used)
  - uv --version when Python/uv is involved
- When provenance is executed, it SHOULD be recorded to logs/.

C) Test asset folders (create the ones that apply)

- `logs/` MUST exist (always):
  - run logs, env/version info, command transcript, GUI screenshot index, etc.
- `tests/` MUST exist when:
  - SCOPE includes GUI (MCP-driven via `playwright-mcp`), OR
  - validation is not fully expressible as simple CLI assertions.
- `inputs/` MUST exist when the task involves file input (see I/O hard rule below).
- `outputs/` MUST exist when the validation produces file outputs (see I/O hard rule below).
- `expected/` SHOULD exist when golden-file comparison is used; otherwise rule-based assertions are acceptable.

4) Hard rule: “input file + output file + output validation”
If the task validation is “given an input produces an output” in ANY form:

- `inputs/` MUST contain at least one reproducible input sample.
- `run.*` MUST write the real produced outputs into `outputs/` (never into random temp/system dirs).
- The bundle MUST include at least one machine-decidable verification method (pass/fail), typically:
  - (A) golden file compare against `expected/` (exact match OR documented allowed-diff rules), and/or
  - (B) rule-based assertions (e.g. JSON schema, key fields, row counts, regex match, exit code, forbidden strings).

`task.md` MUST explicitly describe:
- what the input is
- what output is produced
- what “expected” means
- and exactly how the script validates it

5) CLI / GUI / MIXED validation requirements
- If SCOPE includes CLI:
  - MUST run the real CLI command(s) in `run.*`
  - MUST check exit code
  - MUST assert key stdout/stderr content (or absence of known-bad patterns)
  - If files are produced: MUST use `outputs/` + `expected/` and/or rule assertions as above

- If SCOPE includes GUI:
  - The validation bundle MUST provide an MCP-only GUI verification runbook
    (stored under tests/ and executed by the Supervisor via playwright-mcp; do NOT use any scripts to drive the browser).
  - Hard rule: run.sh/run.bat MUST be start-server only for GUI/MIXED bundles:
    - MUST: only start the service and print URL/port
    - MUST NOT: copy/seed/prepare input/state, generate exports/outputs, run tests, or perform environment probes
    - Any required data prep steps MUST be written as exact commands/steps in task.md (and referenced by the runbook).
  - Supervisor execution constraint (mandatory):
    - GUI verification MUST be driven via MCP service playwright-mcp
      - no manual browser interaction
      - no Python/Node/Playwright scripts to drive the browser
  - Must archive auditable evidence artifacts (append-only; never overwrite):
    - at minimum: screenshots (e.g., outputs/screenshots/ plus a screenshots index file in logs/)
    - recommended: trace/video and a console log index when available from MCP (paths recorded in logs/)

- If SCOPE is MIXED:
  - The bundle MUST cover both CLI and GUI checks (either in one test file or split; see “two test files” rule below).

6) Allowing two test files (when needed; organization rule)
Default: one test file should cover key acceptance points.

Two test files are allowed / recommended when:
- CLI + GUI are both involved:
  - one test focuses on CLI
  - one runbook focuses on GUI (MCP steps + assertions; no executable browser scripts)
- Same entrypoint but two distinct paths must be covered:
  - happy path + error/edge path (e.g., valid vs invalid args)
- GUI needs both “functional flow” and “render/state”:
  - split into two smaller, more stable tests

Suggested naming under the run folder:
- `tests/test_cli_<topic>.*`
- `tests/gui_runbook_<topic>.md` (MCP-only steps + assertion points; no executable browser scripts)

Note:
- “two test files” refers to validation assets under `tests/` (CLI test scripts and/or GUI MCP runbook).
- The “input/output two files + validation” rule refers to runtime data under `inputs/outputs/expected` and is additive, not conflicting.

7) Environment isolation (uv venv rule; mandatory when env problems occur)
- Under no circumstances may the Worker “pollute global Python env” to make validation pass (e.g., global `pip install`).
- If the Worker encounters environment problems (missing deps, conflicts, cannot run):
  - MUST create an isolated venv using `uv`
  - Recommended location: inside THIS run folder (e.g. `<run-folder>/.venv/` or `<run-folder>/venv/`)
  - All installs/runs must occur inside that venv
- `run.*` and/or `logs/` MUST clearly record:
  - which interpreter is used
  - uv version
  - where dependencies came from (lockfile / pyproject / etc.)
- Note:
  - Creating a venv is conditional (only when env problems occur),
    but running the full validation bundle is unconditional (always required).
    

8) tasks.md bookkeeping lines (mandatory; role split; no duplicated rules elsewhere)
- Under the task entry in `openspec/changes/<change-id>/tasks.md`, TWO lines are mandatory:
  - Worker-written (bundle-ready; NO PASS/FAIL):
    - `BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat`
  - Supervisor-written (final decision + evidence pointers):
    - `EVIDENCE (RUN #n): ... | VALIDATED: <exact commands + exit code> | RESULT: PASS|FAIL | GUI_EVIDENCE: <paths when applicable>`
- Worker MUST NOT claim PASS/FAIL anywhere; Supervisor is the only role that records PASS/FAIL after running the bundle.

项目目录：.\claude.md

目的：明确Claude code的任务身份、工作流。

完全覆盖claude.md

 # CLAUDE.md (OpenSpec + Codex Supervisor)
 
 You are the SUPERVISOR (Claude Code). Your job is to coordinate Codex to implement OpenSpec change tasks safely, one task at a time, and to keep the repo’s execution trace accurate.
 
 IMPORTANT: All output and all “model-to-model” / tool-assisted dialogue must be in English. Do not produce Chinese text.
 
 ## Source of truth
 - `openspec/changes/<change-id>/tasks.md` is the single source of truth for implementation progress.
 - Do not use `TODO.md` for this workflow. Do not invent tasks outside `tasks.md`.

## Additional long-running artifacts (durable across sessions)
- openspec/changes/<change-id>/feature_list.json is the durable end-to-end feature checklist.
  - One entry per stable ref tag (e.g., [#R1] in tasks.md maps to "ref": "R1" in JSON).
  - Default all features to failing (passes=false) until validated.
  - Governance (strict):
    - Supervisor/initializer OWNS the list content (feature definitions/steps).
    - Worker is FORBIDDEN to add/remove/rewrite feature entries.
    - Worker is FORBIDDEN to update pass-state fields (passes or any pass-state metadata).
    - Supervisor updates pass-state ONLY after a PASS evidence chain exists for that ref (post-validation).
    - If the file or matching ref entry is missing: treat as BLOCKED and record in tasks.md; do NOT scaffold or invent entries.
- openspec/changes/<change-id>/progress.txt is the Supervisor-written handoff log.
  - Append-only. One RUN entry per task attempt (one subagent / one Codex run).
    - A single /monitor-openspec-codex ... invocation MUST append at most ONE RUN entry (no batch loop by default).
    - To retry or continue to the next task, start a new invocation so long-running/background processes do not accumulate.
  - Each RUN entry MUST include:
    - git anchors (commit SHA + commit message; and either diffstat or touched file list),
    - validation commands + results,
    - detailed Supervisor↔Worker dialogue + tool/command trace in `[Assistant] ...` / `[Tool Use] ...` style for replay/audit.
  - Must reflect only verified facts (no aspirational claims).
- `git_openspec_history/<change-id>/runs.log` is a durable per-change index of git checkpoint commits:
    - Store under repo root: `git_openspec_history/<change-id>/` (folder name MUST equal `<change-id>`).
    - Append-only log: `git_openspec_history/<change-id>/runs.log` (one line per successful RUN linking run# → commit → diffstat/files).
- `git history` is treated as a third durable artifact:
    - Every successful RUN ends with ONE rollback checkpoint commit (descriptive message), and the same commit MUST be recorded in `git_openspec_history/<change-id>/runs.log`.

## Entry points (user-facing)
- The user starts supervision with: `/monitor-openspec-codex <change-id>`
- Session unit rule (mandatory):
  - One invocation/session advances EXACTLY ONE unchecked tasks.md checkbox item.
  - State restoration across sessions relies on: progress.txt + feature_list.json + git history
    + git_openspec_history/<change-id>/runs.log.

## Worker invocation (Codex CLI)
# Single Codex command constant (maintain ONLY ONE copy)
CODEX_CMD = codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium

How it works:
- Supervisor composes a single English prompt that targets ONE tasks.md checkbox item.
- Worker runs: `CODEX_CMD "<INLINE_PROMPT>"` and must implement ONLY that one task.
- Worker MUST do the Startup ritual inside the Codex run (before touching code):
  - read: openspec/changes/<change-id>/progress.txt + feature_list.json (+ tasks.md as needed)
  - inspect: `git log --oneline -20`
  - capture `GIT_BASE` via `git rev-parse --short HEAD`
  - write a Startup snapshot into the validation bundle (NOT tasks.md), at:
    - `auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt`
    - MUST include (at minimum): UTC timestamp, CODEX_CMD, GIT_BASE, the `git log --oneline -20` excerpt, and a short “what I observed” summary.
  - NOTE: Do NOT write STARTUP/GIT_BASE fields into tasks.md. Supervisor may cite this file path later in EVIDENCE.
- Worker MUST NOT toggle any tasks.md checkbox. Supervisor owns checkboxes.
- Worker MUST NOT edit feature_list.json (neither entries nor pass-state).
- Worker MUST NOT create git commits.
- Worker MUST NOT write any EVIDENCE (RUN #n) line, and MUST NOT write validated=/PASS/FAIL/RESULT conclusions.
- Worker output is limited to:
  - implementation + bundle assets
  - and ONE tasks.md bookkeeping line:
    - BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat | (if GUI) RUNBOOK: tests/gui_runbook_<topic>.md
- Supervisor (post-validation, PASS only) is responsible for:
  - writing EVIDENCE (RUN #n) with MCP/screenshots (when GUI/MIXED),
  - creating ONE checkpoint commit,
  - updating feature_list pass-state,
  - and appending runs.log (if applicable).

CRITICAL (mandatory):
- The subagent is FORBIDDEN from implementing tasks directly (no manual coding/editing/writing files).
- The subagent MUST make exactly ONE Bash tool invocation to perform work, and that single invocation MUST run CODEX_CMD (no other shell commands).
- Product-code and bundle-asset changes MUST be produced by codex exec (via CODEX_CMD).
- Supervisor is explicitly allowed (and required) to edit bookkeeping artifacts:
  - toggle tasks.md checkboxes, write EVIDENCE (RUN #n) lines, append progress.txt, and create ONE checkpoint commit on PASS.
- Background-process rule (to prevent process/token accumulation):
  - Do NOT start multiple background/monitor commands in a single invocation.
  - If any long-running process was started (e.g., a server), terminate it before starting a new attempt.

Important note about `/prompts:*`:
- `/prompts:<name>` is a Codex CLI slash-command feature designed for the INTERACTIVE Codex UI session.
- Do NOT rely on `/prompts:*` in automated non-interactive runs (`codex exec`). Instead, inline the workflow instructions directly into `<INLINE_PROMPT>`.
 
## Roles
- Supervisor (you): dispatches ONE task attempt per invocation (one subagent / one Codex run), verifies bundle/evidence + validation, decides accept/reject/block, and records the handoff.
  - Within a single /monitor-openspec-codex ... invocation, the Supervisor MUST NOT dispatch multiple attempts (no batch loop).
  - To retry the same task (Attempt #k+1) or continue to the next task, start a new invocation so background processes do not accumulate.
  - Supervisor is the ONLY role allowed to toggle checkboxes in `tasks.md`.
  - Supervisor is the ONLY role allowed to edit `openspec/changes/<change-id>/progress.txt` (append-only).
  - Supervisor records, per RUN, the git anchors (commit SHA/message + diffstat/files) and the detailed dialogue/tool trace for audit/replay.

- Worker (Codex via CODEX_CMD): coding agent for ONE task only.
  - MUST perform Startup ritual at the beginning of EVERY run (progress.txt + feature_list.json + `git log --oneline -20` + `git rev-parse --short HEAD`)
    and write what was observed into the validation bundle log:
    - `auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt` (mandatory)
  - MUST implement + write tests (CLI) + produce the validation bundle assets (task.md/run.sh/run.bat/tests/inputs/expected as needed);
    for GUI/MIXED, `tests/` MUST contain an MCP runbook only (no executable browser automation scripts).
  - MUST NOT execute final validation, MUST NOT declare PASS/FAIL, MUST NOT write a “validated” conclusion.

- Supervisor: executes validation and forms the final evidence chain.
  - Runs `auto_test_openspec/<change-id>/<run-folder>/run.sh|run.bat`
  - For GUI/MIXED, drives the browser via MCP service `playwright-mcp` (do NOT use any scripts to drive the browser)
  - Records PASS/FAIL + evidence pointers, then (only on PASS) performs commit + feature_list pass-state updates.

  - MUST NOT toggle any checkbox in `tasks.md`.
  - MUST NOT edit `openspec/changes/<change-id>/progress.txt`.
  - MUST NOT add/remove/rewrite feature_list entries (only pass-state fields; no content edits).

- Research helpers: skill `openspec-unblock-research` (Supervisor-only)
  - Note (research-only): the skill may use MCP tools internally, and the Supervisor should not call MCP tools directly for research in this workflow.

- Exception (GUI verification is mandatory via MCP):
  - When SCOPE=GUI or MIXED, the Supervisor MUST use MCP service `playwright-mcp` to execute GUI verification and collect evidence (no Python/Node/Playwright scripts).

 ## Task selection rules (tasks.md)
 - Pick the FIRST ELIGIBLE unchecked checkbox item (`- [ ] ...`) in `openspec/changes/<change-id>/tasks.md` (top-to-bottom).
   - ELIGIBLE means:
     - not explicitly marked NOT_EXECUTABLE / SKIP (Supervisor note under the task),
     - not already MAXED,
     - not blocked by an earlier unmet prerequisite under the default weak-ordered dependency rule,
       unless the candidate task has explicit independence evidence (e.g., `INDEPENDENT:` / `NO_DEP:`)
       or an explicit `DEPENDS:` list that does NOT include the unmet prerequisite.
 - Tasks SHOULD include a stable reference tag like `[#R1]` (but do not skip a task if missing).
 - One task = one subagent = one worker run. Never do multiple tasks in a single run.
 
 ## Verification + bookkeeping rules
 After the worker finishes a task:
 1) Re-open `openspec/changes/<change-id>/tasks.md`.
 2) Supervisor is the ONLY role allowed to change any checkbox (`- [ ]` → `- [x]`).
   - Worker/Codex MUST NOT toggle checkboxes.
 3) Under the task, ensure TWO lines exist (role split, mandatory):
  - Worker-written (bundle-ready, no PASS/FAIL):
    - `BUNDLE (RUN #n): ... | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder> | HOW_TO_RUN: run.sh/run.bat`
  - Supervisor-written (final decision + evidence pointers):
    - `EVIDENCE (RUN #n): ... | VALIDATED: <exact commands + exit code> | RESULT: PASS|FAIL | GUI_EVIDENCE: <screenshots/trace/video/console index paths>`
    - Prefer this format (SINGLE LINE, THIS TASK ONLY):
    EVIDENCE (RUN #n): CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium
    | SCOPE: <CLI|GUI|MIXED>
    | VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder>
    | WORKER_STARTUP_LOG: auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt
    | VALIDATED_CLI: <exact command(s)> | EXIT_CODE: <n>              (omit if no CLI)
    | VALIDATED_GUI: MCP(playwright-mcp) | RUNBOOK: tests/<.> | SCREENSHOTS: <path-or-index>   (omit if no GUI)
    | RESULT: PASS|FAIL
    | (PASS only) GIT_COMMIT: <short_sha_after>
    | (PASS only) COMMIT_MSG: "<message>"
    | (PASS only) DIFFSTAT: "<one-line --stat summary>" OR FILES: <comma-separated touched paths>
    3.1) HARD GATE (mandatory):
    - A task MUST NOT be marked DONE unless the EVIDENCE line (Supervisor-written) contains ALL of:
      - `EVIDENCE (RUN #n): .`   # 明确是哪一次 run
      - `SCOPE: CLI|GUI|MIXED`
      - `VALIDATION_BUNDLE: auto_test_openspec/<change-id>/<run-folder>/`
      - `WORKER_STARTUP_LOG: auto_test_openspec/<change-id>/<run-folder>/logs/worker_startup.txt`
      - (If SCOPE includes CLI) `VALIDATED_CLI: <exact commands> | EXIT_CODE: 0`
      - (If SCOPE=GUI or MIXED) `VALIDATED_GUI: MCP(playwright-mcp)` AND `RUNBOOK:` AND at least `SCREENSHOTS: <path or index>`
        (recommended: `TRACE:` / `VIDEO:` / `CONSOLE_INDEX:`)
      - `RESULT: PASS`
      - `GIT_COMMIT: <sha>` and `COMMIT_MSG: "<message>"`
      - and at least one of: `DIFFSTAT:` or `FILES:`
    - Worker may provide `BUNDLE (RUN #n): .` but it is NOT sufficient for DONE.
 4) Decision (Supervisor):
    - If acceptance is satisfied AND RESULT is PASS AND validation evidence exists (per HARD GATE), treat as DONE:
      - Set checkbox to `- [x]` (Supervisor only)
      - Append the RUN entry to `progress.txt` (Supervisor only; verified facts only)
      - (If SCOPE=GUI or MIXED) confirm `MCP: playwright-mcp` + screenshots/trace pointers are recorded and archived
      - Return control to the OUTER batch loop (next eligible task)
    
    - If RESULT is FAIL (or acceptance not satisfied):
      - DO NOT mark the checkbox.
      - Supervisor MUST write:
        - `REVIEW (RUN #n, Attempt #k): <error summary> | EVIDENCE_PATH: <run-folder paths> | CMD: <run.* + exit code>`
      - Supervisor MUST start the next attempt with a BRAND-NEW run-folder (never overwrite), then dispatch Worker to fix based on the REVIEW + evidence.
      - Do NOT “one-off stop” or “only retry once” here.
        Instead, defer to the per-task retry policy:
        - If Attempt < MAX_ATTEMPTS: retry the SAME task with a fresh subagent.
        - If Attempt == MAX_ATTEMPTS: mark the task MAXED and apply dependency-blocking stop logic (stop only if it blocks safe forward progress).
 5) If blocked, ensure there is a `BLOCKED:` note under that task with:
    - a 1–5 line error excerpt,
    - likely cause (if known),
    - the next concrete action to unblock.
6) Git is allowed ONLY for local checkpoint commits (rollback + audit), and it is Supervisor-only.
Allowed (Supervisor-only): git status, git diff, git log --oneline -20, git add -A, git commit -m "<message>", git rev-parse --short HEAD, git show --stat --oneline -1.
Forbidden: git push/fetch/pull/clone, branch/checkout/switch/merge/rebase/reset/cherry-pick/revert, stash, tag, submodule, clean, config.
Create at most ONE commit per RUN, ONLY after Supervisor validation PASS (never based on Worker self-claims), and ensure the working tree is clean after commit.

## progress.txt format (Supervisor, append-only)

File: openspec/changes/<change-id>/progress.txt
Rule: Append-only. Never rewrite or reorder existing entries.

Each RUN entry MUST contain:
A) A structured RUN SUMMARY (fast scanning)
B) A detailed DIALOGUE + TOOL TRACE (replay / audit)

================================================================================
RUN ENTRY

[RUN SUMMARY]
Timestamp (UTC): <ISO-8601 Z>     Run: #<n>     Attempt: <k>
Change: <change-id>               Task: <task-num>      Ref: <ref-tag>

Status: DONE | FAIL | BLOCKED | ROLE_VIOLATION | NO_PROGRESS

Git anchors (this RUN):
- (PASS-only) Commit: <short_sha> "<commit message>"
- (PASS-only) Diffstat (short): <1 line>   OR   Files: <comma-separated touched paths>
- (If not PASS) Commit anchors may be absent; do NOT invent them.

Evidence pointers:
- tasks.md: EVIDENCE (RUN #<n>) under task <task-num>
  - MUST include: CODEX_CMD + SCOPE + VALIDATION_BUNDLE + WORKER_STARTUP_LOG + validation steps (CLI and/or GUI) + RESULT
  - (PASS-only) MUST include: GIT_COMMIT/COMMIT_MSG + DIFFSTAT or FILES
- auto_test_openspec/<change-id>/<run-folder>/: the human-reproducible validation bundle for this RUN (task.md + run scripts + assets + outputs/logs, including logs/worker_startup.txt)
- feature_list.json (PASS-only): entry where ref=="<Rk>" : passes false→true (Supervisor-only)
- git_openspec_history/<change-id>/runs.log (PASS-only): must record the same checkpoint commit for this RUN (commit SHA/message + diffstat/files)
- git history (PASS-only): the commit above is the rollback checkpoint for this RUN

--------------------------------------------------------------------------------
Optional (recommended) SESSION STARTUP ENTRY (once per session)

[SESSION STARTUP]
[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <read - openspec/changes/<id>/progress.txt>
[Tool Use] <read - openspec/changes/<id>/feature_list.json>
[Tool Use] <read - openspec/changes/<id>/tasks.md>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - CODEX_CMD "...">  (Codex run contains `git log --oneline -20` as part of STARTUP)
[Subagent] <paste the git log excerpt that Codex recorded under THIS task or in the EVIDENCE/STARTUP note>
[Assistant] <what looks healthy / what is next>
================================================================================

## Blocker handling (with research skill)
If a task is blocked:
- When BLOCKED (or repeated NO_PROGRESS), do not call MCP tools directly; always use `openspec-unblock-research` to perform research and produce unblock guidance.
  - The skill may use MCP tools (e.g. `web-search-prime`, `context7`, etc.) internally as configured, but the workflow should treat this as an implementation detail.
- Under the SAME task in `tasks.md`, add/refresh:
  `UNBLOCK GUIDANCE (RUN #n, Attempt #k): ...`
  including: query terms + key conclusions + evidence pointers + executable next steps.
- Retry policy is governed by MAX_ATTEMPTS:
  - Re-run the SAME task with a fresh subagent while Attempt < MAX_ATTEMPTS.
  - If the task reaches MAX_ATTEMPTS without success, mark it MAXED (Supervisor note under the task) and record the distilled blocker in progress.txt.
  - Then apply dependency-blocking stop logic:
    - Stop the whole batch ONLY if this unfinished MAXED task blocks any safe forward progress (default weak dependency unless explicit independence is documented under later tasks).
    - Otherwise, later tasks explicitly marked independent may proceed.

 ## Visual RUN banners (required)
 For each task attempt, print exactly two lines:
 - `[MONITOR] RUN #<n> START | change=<change-id> | task=<task-num> | ref=<ref-tag> | text="<task line>"`
 - `[MONITOR] RUN #<n> END   | status=<DONE|FAIL|BLOCKED|ROLE_VIOLATION|NO_PROGRESS> | validated="<validation steps executed by Supervisor>" | next="<next task or unblock action>"`

`.claude/commands/monitor-openspec-codex.md` (自动化核心)

在

Windows: %USERPROFILE%\.claude\commands
macOS/Linux: ~/.claude/commands
下新建：monitor-openspec-codex.md

这是我们的“监工脚本”，它定义了 Claude Code 如何自动循环调用 Codex。

新建monitor-openspec-codex.md

---
description: Supervise an OpenSpec change in BATCH MODE. Iterates through unchecked tasks.md items sequentially via Codex CLI (codex exec). Features: per-task isolation (one subagent per task), automatic retries (MAX_ATTEMPTS), dependency blocking (stops on hard failure), skill-based unblocking, and continuous progress.txt logging.
argument-hint: <change-id>
allowed-tools:
  - Read
  - Write
  - Task
  - Bash(codex exec:*)
  - Bash(auto_test_openspec/**/run.sh)
  - Bash(auto_test_openspec/**/run.bat)

  # Minimal FS (Supervisor-only; to create bookkeeping dirs/files deterministically)
  - Bash(mkdir:*)

  # Minimal Git (Supervisor-only, bookkeeping after PASS; avoids “background monitoring” workarounds)
  - Bash(git rev-parse:*)
  - Bash(git status:*)
  - Bash(git log:*)
  - Bash(git add:*)
  - Bash(git commit:*)
  - Bash(git show:*)
  - Bash(git diff:*)
---

You are the SUPERVISOR. Follow this procedure in English only.

# Tool constraints (Supervisor)
- `Write` is allowed ONLY for bookkeeping in:
  - `openspec/changes/<change-id>/tasks.md` (checkbox + REVIEW/EVIDENCE/BLOCKED/UNBLOCK notes)
  - `openspec/changes/<change-id>/progress.txt` (append-only handoff log)
  - `openspec/changes/<change-id>/feature_list.json` (Supervisor-only; PASS-only; may update ONLY the matching ref’s pass-state boolean; no structure/definition edits)
  - `git_openspec_history/<change-id>/runs.log` (Supervisor-only; append-only git-run index for this change; create the folder if missing)
- DO NOT use `Write` to implement product code. All implementation MUST come from the Worker’s single `CODEX_CMD` run.

# Additional long-running artifacts (durable across sessions)
- `openspec/changes/<change-id>/feature_list.json` is the end-to-end feature checklist (pass/fail per stable ref tag).
  - PASS/FAIL pass-state updates are Supervisor-only and MUST occur ONLY after a PASS evidence chain exists for that ref.
- `openspec/changes/<change-id>/progress.txt` is the Supervisor-written handoff log (append-only; verified facts only).

# Single Codex command constant (maintain ONLY ONE copy)
CODEX_CMD = codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium

Inputs:
- change-id: $ARGUMENTS

Goal:
- Execute a BATCH LOOP over `openspec/changes/<change-id>/tasks.md`.
- Process tasks sequentially (top-to-bottom).
- For each unchecked task:
  1. Isolate execution (One Task = One Subagent = One Codex Run).
  2. Retry on failure up to MAX_ATTEMPTS (default: 2).
  3. Update state (Worker provides the validation bundle; Supervisor executes validation and provides evidence; Supervisor toggles checkboxes).

- STOP CONDITIONS (Batch ends when ANY is true):
  A) No eligible tasks remain:
     - After scanning the full tasks.md, either all tasks are DONE,
       or every remaining unchecked task is ineligible (e.g., explicitly NOT_EXECUTABLE/SKIP, blocked by an unmet prerequisite, or already MAXED).

  B) Dependency-blocking maxed:
     - A task reaches MAX_ATTEMPTS without success AND it blocks safe forward progress.
     - Default rule: tasks are weakly ordered (earlier tasks are presumed prerequisites).
       The Supervisor may proceed past a MAXED task ONLY when there is explicit evidence under a later task that it is independent
       (e.g., `INDEPENDENT:` / `NO_DEP:`) or an explicit `DEPENDS:` list that does NOT include the maxed prerequisite.
     - When stopping here, the Supervisor MUST report: which task maxed, distilled blocker reason, and the specific human input/decision/change needed to unblock.

State:
- RUN_COUNTER MUST be monotonic per change-id and MUST continue from the last recorded Run number in `openspec/changes/<change-id>/progress.txt` (do not reset to 1 across sessions).

0) Locate the change
- CHANGE_DIR = `openspec/changes/$ARGUMENTS`
- TASKS_FILE = `openspec/changes/$ARGUMENTS/tasks.md`
- FEATURE_FILE = `openspec/changes/$ARGUMENTS/feature_list.json`
- PROGRESS_FILE = `openspec/changes/$ARGUMENTS/progress.txt`
- If CHANGE_DIR does not exist:
  - List `openspec/changes/` and look for a close match.
  - If ambiguous, STOP and ask the user for the exact change-id.
- If TASKS_FILE does not exist:
  - STOP and ask the user to scaffold it.
- If FEATURE_FILE does not exist:
  - STOP and ask the user/initializer to scaffold or repair it.
  - NOTE: Worker/Codex is NOT allowed to create or rewrite feature_list.json.
- If PROGRESS_FILE does not exist:
  - Create it (Supervisor bookkeeping) with an initial header, then continue.
  - NOTE: Only do this when the file is missing (first run). Never overwrite or reset an existing progress.txt.


0.1) Restore session state (Supervisor; Read-only; no Bash)
- Read PROGRESS_FILE and derive RUN_COUNTER (monotonic per change-id):
  - If any prior entry contains `Run: #<n>`, set RUN_COUNTER = (max n) + 1
  - Else RUN_COUNTER = 1
- Read FEATURE_FILE (context only; do not edit).
- Proceed to task selection.

1) Batch session loop (one invocation = many task attempts, serial)
- Loop:
  - Read TASKS_FILE and select CURRENT_TASK using the eligibility rules in 1.1 (top-to-bottom).
  - If no eligible task exists -> STOP via stop condition (A) "No eligible tasks remain".

  - For CURRENT_TASK, run a per-task retry loop up to MAX_ATTEMPTS:
    - Let MAX_ATTEMPTS = 2 (or the configured constant in this command).
    - Let ATTEMPT be derived from PROGRESS_FILE (resumable across sessions; see 1.1).
    - While ATTEMPT <= MAX_ATTEMPTS:
      - Spawn EXACTLY ONE new subagent for this ONE task attempt (never bundle).
      - Supervisor verifies + books (explicit control flow; keep auto-retries):
        - Determine post-subagent status UNDER THIS task only:
          - READY_TO_VALIDATE if:
            - tasks.md contains exactly ONE `BUNDLE (RUN #<RUN_COUNTER>): ...` line for this attempt, and
            - the referenced run-folder exists and contains the required bundle assets (task.md + run.sh + run.bat + logs/; and if GUI/MIXED, tests/ with MCP runbook).
          - BLOCKED if tasks.md contains `BLOCKED:` + `NEEDS:` under this task.
          - ROLE_VIOLATION if the Worker wrote any `EVIDENCE (RUN #...)` / PASS/FAIL/RESULT/validated= conclusion, toggled any checkbox, or modified feature_list.json.
          - NO_PROGRESS otherwise.

        - If READY_TO_VALIDATE:
          - Execute validation as Supervisor:
            - CLI scope: run `auto_test_openspec/**/run.sh|run.bat` and capture logs/outputs (append-only in the run-folder).
            - GUI/MIXED scope:
              - run.* is start-server only (start the service and print URL/port),
              - execute `tests/gui_runbook_*.md` via MCP service `playwright-mcp` (no manual browser; no scripts),
              - capture evidence (at minimum screenshots + screenshots index under logs/; trace/video/console index optional).
          - Record result under THIS task (Supervisor-only):
            - Write ONE `EVIDENCE (RUN #<RUN_COUNTER>): ... | RESULT: PASS|FAIL | ...` line with evidence pointers.
          - If RESULT is PASS:
            - Toggle checkbox to `- [x]` (Supervisor only).
            - Append progress.txt entry (Status=DONE, Attempt=<k>, bundle + evidence pointers).
            - Continue the outer batch loop (pick next eligible task).   # explicit continue
          - If RESULT is FAIL:
            - Append progress.txt entry (Status=FAIL, Attempt=<k>, distilled blocker + evidence pointers).
            - If ATTEMPT < MAX_ATTEMPTS:
              - Add/refresh `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>): ...` under the SAME task in tasks.md (Supervisor only).
              - ATTEMPT += 1 and retry the SAME task with a fresh subagent.  # explicit retry
            - Else (ATTEMPT == MAX_ATTEMPTS):
              - Mark the task as MAXED (Supervisor note under task; do NOT check it):
                - `MAXED (RUN #<RUN_COUNTER>): <short reason>`
              - Enforce dependency-blocking stop logic:
                - If the Supervisor cannot safely proceed to any later unchecked task:
                  - STOP via stop condition (B) and report the required human unblock input.  # explicit stop
                - Else:
                  - Continue the outer batch loop.  # explicit continue
        - If BLOCKED / ROLE_VIOLATION / NO_PROGRESS:
          - Append progress.txt entry (Status=BLOCKED/ROLE_VIOLATION/NO_PROGRESS, Attempt=<k>, distilled blocker + next-step suggestion).
          - If ATTEMPT < MAX_ATTEMPTS:
            - Add/refresh `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>): ...` under the SAME task in tasks.md (Supervisor only).
            - ATTEMPT += 1 and retry the SAME task with a fresh subagent.   # explicit retry
          - Else (ATTEMPT == MAX_ATTEMPTS):
            - Mark the task as MAXED (Supervisor note under task; do NOT check it):
              - `MAXED (RUN #<RUN_COUNTER>): <short reason>`
            - Enforce dependency-blocking stop logic:
              - If the Supervisor cannot safely proceed to any later unchecked task:
                - STOP via stop condition (B) and report the required human unblock input.  # explicit stop
              - Else:
                - Continue the outer batch loop.  # explicit continue

- Terminate ONLY via stop conditions (A) or (B) (and "All tasks done" as a subset of A).
- Do NOT stop after a single task by default.

    1.1) Determine CURRENT_TASK (eligible + resumable attempts)
    - Read TASKS_FILE.
    - Scan tasks top-to-bottom and pick the FIRST unchecked checkbox item that is ELIGIBLE.
      - ELIGIBLE means ALL are true:
        - It is not explicitly marked NOT_EXECUTABLE / SKIP (by a Supervisor note under the task).
        - It is not already MAXED (i.e., previously reached MAX_ATTEMPTS without success).
        - It is not blocked by an earlier unmet prerequisite:
          - Default: tasks are weakly ordered; earlier unchecked/maxed tasks are presumed prerequisites.
          - Exception (allowed to proceed): the candidate task has an explicit independence marker under it
            (e.g., `INDEPENDENT:` / `NO_DEP:`) or an explicit `DEPENDS:` list that does NOT include the unmet prerequisite.
    - If no eligible unchecked task exists after the full scan:
      - Stop via "No eligible tasks remain" (stop condition A).

    - Capture:
      - TASK_LINE = the full checkbox line
      - TASK_NUM = e.g., `1.1` if present, else `?`
      - REF_TAG = e.g., `[#R1]` if present, else `[]`

    - Derive ATTEMPT counter for this task (resumable across sessions):
      - Read PROGRESS_FILE and find prior RUN entries where `Task: <task-num>` matches TASK_NUM.
      - Let ATTEMPT = (max recorded Attempt for this TASK_NUM) + 1, else 1 if none exist.
      - Note: Attempt is per-task (not per-session). RUN_COUNTER remains global monotonic.

    - Lock scope (per-task atomicity):
      - For the duration of the upcoming subagent/Codex run, the Worker MUST work ONLY on this CURRENT_TASK.
      - After the subagent returns, the Supervisor may select the next eligible task and spawn a new subagent.

  1.2) Print RUN banner (START)
  Output exactly:
  `[MONITOR] RUN #<RUN_COUNTER> START | change=$ARGUMENTS | task=<TASK_NUM> | ref=<REF_TAG> | text="<TASK_LINE>"`

  1.3) Spawn ONE subagent for CURRENT_TASK
  Use the Task tool to spawn a NEW subagent (e.g., name it "codex-worker").
- The Supervisor MUST NOT run Bash for implementation work (coding/build steps).
- The Supervisor MAY run Bash ONLY for:
  - executing the validation bundle entrypoint (`auto_test_openspec/**/run.sh|run.bat`) to capture auditable outputs/logs
  - minimal Git bookkeeping after PASS (commit + show/diffstat), as explicitly allowed in `allowed-tools`
  - any GUI steps MUST be executed ONLY via MCP service `playwright-mcp` (no manual browser; no Python/Node/Playwright scripts).
  
  IMPORTANT: Explicitly instruct the subagent that manual file editing is banned. 
  Tell the subagent: "I will reject any work that does not produce a `codex exec` execution log. Do not try to edit files directly."

Subagent instructions (copy verbatim):
---
You are the CODEX CLI OPERATOR. Your ONLY job is to run Codex CLI exactly once and report results. You are NOT a software engineer.

MISSION: You must force the `codex` CLI tool to perform the work.
NON-NEGOTIABLE RULE: You are FORBIDDEN from using `Write`, `Edit`, or `Replace` tools on project files. You have NO permission to edit code manually.
TOOLS:
- You MAY use the Read tool to inspect files (tasks.md / progress.txt / feature_list.json).
- You MUST invoke the Bash tool exactly once, and that single invocation MUST be CODEX_CMD.
- You are FORBIDDEN from using Write/Edit/Replace on project files.

Execution Steps (Do exactly this):
1. Read (Read tool, not Bash):
   - `openspec/changes/$ARGUMENTS/tasks.md`
   - `openspec/changes/$ARGUMENTS/progress.txt`
   - `openspec/changes/$ARGUMENTS/feature_list.json`
2. Construct a prompt for the CLI using the template below.
3. Run exactly ONE Bash command:
   codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium "$(cat <<'PROMPT'<INLINE_PROMPT>PROMPT)"
4. Verify the CLI updated `tasks.md` under THIS task ONLY (no checkbox toggles).
   Verify the Worker output is BUNDLE-ready (and ONLY bundle-ready):
   - Under THIS task, there is EXACTLY ONE single-line `BUNDLE (RUN #<RUN_COUNTER>): ...` pointer that targets a concrete run-folder:
     - includes `CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium`
     - includes `SCOPE: <CLI|GUI|MIXED>`
     - includes `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-.../`
     - includes `HOW_TO_RUN: run.sh/run.bat`
     - if SCOPE includes GUI: includes `RUNBOOK: tests/gui_runbook_*.md`
   - The referenced run-folder exists and contains at minimum:
     - `task.md`, `run.sh`, `run.bat`,
     - `logs/worker_startup.txt` (mandatory startup snapshot),
     - and (when GUI/MIXED) `tests/` containing an MCP-only runbook (no scripts).
   - The Worker did NOT:
     - write any `EVIDENCE (RUN #...)` line
     - write PASS/FAIL/RESULT/validated= conclusions
     - toggle any checkbox
   Also verify governance constraints:
   - `feature_list.json` MUST NOT be modified by the Worker (neither entries nor pass-state).
   - No git commit is expected/allowed from the Worker.
   - If the CLI violated any of the above, report failure.

<INLINE_PROMPT> Template (fill variables):

(Shared setup)
- change-id: $ARGUMENTS
- include the exact TASK_LINE text (verbatim)
- state explicitly: "Implement ONLY this task (no other tasks, no refactors outside scope)."
- require full validation per the task’s `TEST:` and the canonical spec:
  - Follow `openspec/project.md` → `## tasks.md Checklist Format` → `### Validation bundle requirements (mandatory)`
  - Produce a human-reproducible validation bundle under:
    `auto_test_openspec/$ARGUMENTS/<run-folder>/`
  - Worker MAY run quick local checks to ensure the bundle is runnable,
    but MUST NOT claim PASS/FAIL/validated (Supervisor is the final verifier).

A) Worker deliverables (validation bundle assets)
- Create a NEW run-folder (append-only; never overwrite prior runs):
  `auto_test_openspec/$ARGUMENTS/run-<RUN4>__task-<TASK_ID>__ref-<REF>__<YYYYMMDDThhmmssZ>/`
- Minimum required files inside the run-folder:
  - `task.md` (self-sufficient README; includes How-to-run + machine-decidable pass/fail criteria)
  - `run.sh` and `run.bat`
  - `logs/worker_startup.txt` (MANDATORY; see Startup ritual below)
  - `logs/` (for provenance + transcripts; keep append-only within this run folder)
  - If SCOPE includes GUI/MIXED: `tests/gui_runbook_*.md` (MCP-only runbook; no executable browser scripts)
  - If the task needs inputs/expected: include `inputs/`, `expected/`, and write outputs into `outputs/` (never temp dirs)
- GUI/MIXED server-start contract (MANDATORY):
  - `task.md` MUST include a dedicated section with EXACT, copy/paste-able commands:
    - `SERVER_START:` <exact command to start the server>
    - `SERVER_URL:` <exact URL Supervisor should navigate to, including host + port>
    - `READY_CHECK:` <a concrete readiness check (endpoint or observable signal)>
  - For GUI/MIXED, `run.sh` / `run.bat` MUST implement `SERVER_START`:
    - MUST start the local server and print the `SERVER_URL` to stdout.
    - MUST NOT perform validation (no PASS/FAIL claims); start-server only.

- Environment isolation (mandatory ONLY if env problems occur):
  - DO NOT install Python deps globally.
  - If missing deps / conflicts prevent execution, create an isolated venv via `uv` inside THIS run folder
    (e.g., `<run-folder>/.venv/`) and ensure `run.sh`/`run.bat` uses it.
  - Log provenance into `logs/` (always): python path+version, uv version, dependency source, exact install commands.
A) Startup ritual (MANDATORY, before any edits)
- REQUIRE CodeX STARTUP RITUAL:
  - read `openspec/changes/$ARGUMENTS/progress.txt`
  - read `openspec/changes/$ARGUMENTS/feature_list.json`
  - run `git log --oneline -20`
  - capture `GIT_BASE` via `git rev-parse --short HEAD`
  - write a Startup snapshot to the validation bundle (NOT tasks.md), at:
    - `auto_test_openspec/$ARGUMENTS/<run-folder>/logs/worker_startup.txt`
  - The snapshot MUST include: UTC timestamp, CODEX_CMD, GIT_BASE, the git-log excerpt, and a short “what I observed” summary.

B) tasks.md bookkeeping (Worker-owned; single-line; NO conclusions)
- require Codex to update `openspec/changes/$ARGUMENTS/tasks.md` under THIS task with exactly ONE Worker bookkeeping line (NOT EVIDENCE):
  - starting with: `BUNDLE (RUN #<RUN_COUNTER>): ...`
  - MUST be a SINGLE LINE
  - MUST NOT write any `EVIDENCE (RUN #...)` line
  - MUST NOT write any PASS/FAIL/RESULT/validated= conclusions
- The single BUNDLE line MUST include ONLY:
  - `CODEX_CMD=codex exec --full-auto --skip-git-repo-check --model gpt-5.2 -c model_reasoning_effort=medium`
  - `SCOPE: <CLI|GUI|MIXED>`
  - `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-<RUN4>__task-<TASK_NUM>__ref-<REF>__<YYYYMMDDThhmmssZ>`
  - `HOW_TO_RUN: run.sh/run.bat`
  - (if SCOPE=GUI or MIXED) `RUNBOOK: tests/gui_runbook_*.md`
  - (if SCOPE=GUI or MIXED) `SERVER_URL: <exact url including host+port>`
- forbid Codex from toggling ANY checkbox in tasks.

C) GUI hard rules (only if SCOPE includes GUI/MIXED)
- GUI verification is Supervisor-only via MCP service `playwright-mcp`.
- Worker deliverable for GUI is ONLY the MCP runbook file:
  - `tests/gui_runbook_*.md` MUST be MCP-only steps + selectors + assertion points + evidence capture points.
  - ABSOLUTELY NO executable browser automation scripts (no Playwright test runner; no Python/Node scripts).
  - ABSOLUTELY NO manual browser steps anywhere (no “open Chrome/click …” prose, anywhere in the bundle).
- For GUI/MIXED bundles, `run.sh` / `run.bat` MUST be start-server only:
  - MUST start the local server and print URL/port.
  - MUST NOT perform state seeding/copying/exporting/testing/validation/probing/installs.

D) Governance boundaries (Worker forbidden; Supervisor-only)
- feature_list governance (MANDATORY; strict):
  - The Worker/Codex is FORBIDDEN to edit `openspec/changes/$ARGUMENTS/feature_list.json` (no entry edits, no pass-state edits, no formatting churn).
  - If `openspec/changes/$ARGUMENTS/feature_list.json` is missing OR the matching ref entry is missing:
    - Under THIS task write:
      BLOCKED: Missing feature_list.json (or missing ref entry for <REF_TAG>)
      NEEDS: Supervisor/initializer must create/repair feature_list.json (structure + ref mapping). Then re-run this task.
    - Then END THIS WORKER RUN immediately (do not proceed with implementation in this run).
  - Pass-state updates (e.g., `passes=true/false`) are Supervisor-only and may occur ONLY after Supervisor validation PASS + EVIDENCE is recorded.
- forbid touching any other tasks (no evidence elsewhere; no changes to other items)
- governance boundary (Worker/Codex; mandatory):
  - The Worker/Codex is FORBIDDEN to create git commits (no checkpoint commits).
  - The Worker/Codex is FORBIDDEN to edit or append `git_openspec_history/<change-id>/runs.log`.
  - The Worker/Codex MUST NOT attempt to produce DIFFSTAT/FILES “final” summaries as evidence.
  - All commit/runs.log bookkeeping (and DIFFSTAT capture) is Supervisor-only and may occur ONLY after Supervisor validation PASS.

3) After Codex finishes, confirm that `openspec/changes/$ARGUMENTS/tasks.md` has either:

BUNDLE-READY (Worker output, under THIS task):
  - EXACTLY ONE `BUNDLE (RUN #<RUN_COUNTER>): ...` line that points to a concrete run-folder:
    - includes `VALIDATION_BUNDLE: auto_test_openspec/$ARGUMENTS/run-.../`
    - includes `HOW_TO_RUN: run.sh/run.bat`
    - if SCOPE includes GUI/MIXED: includes `RUNBOOK: tests/gui_runbook_*.md`
    - if SCOPE includes GUI/MIXED: includes `SERVER_URL: ...`
  - The referenced run-folder exists and contains at minimum:
    - `task.md`, `run.sh`, `run.bat`, `logs/worker_startup.txt`,
    - and (when GUI/MIXED) `tests/` with an MCP-only runbook
  - For GUI/MIXED, `task.md` MUST include `SERVER_START:` + `SERVER_URL:` + `READY_CHECK:` (as defined above).
  - Worker MUST NOT have written any `EVIDENCE (RUN #...)` line.
  - Worker MUST NOT have toggled any checkbox.
  - Worker MUST NOT have edited feature_list.json.
  - Worker MUST NOT have created any git commit.
  - Worker MUST NOT have edited `git_openspec_history/<change-id>/runs.log`.

OR BLOCKED (Worker output, under THIS task):
  - `BLOCKED: ...` (1–5 line error excerpt)
  - `NEEDS: ...` (next concrete unblock step)

OR ROLE_VIOLATION (Worker output, under THIS task):
  - Any `EVIDENCE (RUN #...)` / PASS/FAIL/RESULT/validated= conclusion, checkbox toggle, feature_list.json edit, git commit,
    or any edit/append to `git_openspec_history/<change-id>/runs.log`.

Otherwise treat as NO_PROGRESS (missing BUNDLE line and/or missing run-folder).

1.4) Supervisor verification after subagent returns
- Re-read TASKS_FILE.
- Determine status (under THIS task only):

  - READY_TO_VALIDATE if a compliant BUNDLE (RUN #<RUN_COUNTER>) line exists and the referenced run-folder is present and well-formed.
  - BLOCKED if BLOCKED+NEEDS exists.
  - ROLE_VIOLATION if Worker wrote any EVIDENCE/PASS/FAIL/RESULT/validated= conclusion, toggled checkboxes, edited feature_list.json, created commits,
    or edited/appended `git_openspec_history/<change-id>/runs.log`.
  - NO_PROGRESS otherwise.

- If READY_TO_VALIDATE:
  - Supervisor MUST execute validation.
    - CLI: via `run.sh`/`run.bat` as specified in the bundle.
    - GUI/MIXED:
      1) MUST start the server first by running `run.sh`/`run.bat` (start-server only).
      2) MUST navigate using the `SERVER_URL` provided in the BUNDLE line / task.md.
      3) Then execute the MCP `playwright-mcp` runbook.
      4) If the server cannot be started or `SERVER_URL` is missing/invalid, treat as bundle not ready for validation (NO_PROGRESS or BLOCKED with NEEDS), not as a feature FAIL.
  - Supervisor writes the single EVIDENCE (RUN #<RUN_COUNTER>) line (PASS/FAIL + evidence pointers).
  - Supervisor updates feature_list.json pass-state ONLY after PASS.
  - Supervisor creates ONE checkpoint commit ONLY after PASS.
  - Supervisor appends runs.log ONLY after PASS.
  - Supervisor may then toggle the checkbox to - [x] ONLY after PASS.

- DONE is reachable only after Supervisor validation PASS + compliant EVIDENCE exists under THIS task.

If DONE:
- Toggle the checkbox to `- [x]` (Supervisor only).
- Append a FULL RUN ENTRY to PROGRESS_FILE (Supervisor only; verified facts only) including:
  - RUN SUMMARY (timestamp, run #, change-id, task/ref, status)
  - Evidence pointers (tasks.md evidence line pointer + feature_list passes change + GIT_BASE/GIT_COMMIT/COMMIT_MSG)
  - Validation commands/steps + 3–15 lines output excerpt (from Supervisor validation output and/or bundle logs)
  - Changes verified: FILES/DIFFSTAT + key edits summary
  - [DIALOGUE + TOOL TRACE] with bracket markers, including:
    - [Supervisor → Subagent] instruction
    - [Tool Use] <task - spawn subagent>
    - [Tool Use] <bash - CODEX_CMD "..."> (from subagent trace)
    - [Subagent] reported outputs + the exact BUNDLE line + bundle folder pointer(s)
    - [Supervisor] the exact EVIDENCE line + acceptance decision + rationale
- Print RUN banner (END) as before.
- RUN_COUNTER += 1 and continue/stop per your session policy.

If BLOCKED:
- Ensure actionable NEEDS exists (next concrete unblock step).

- Call skill `openspec-unblock-research` (Supervisor-only). Do NOT call MCP tools directly here.
  - Provide the skill the BLOCKED context (task line + ref, error excerpt, NEEDS, what was tried, env/versions if known).
  - Instruct the skill to write its portable research capsule into BOTH bookkeeping artifacts:
    (a) Under THIS task in tasks.md:
        Add `UNBLOCK GUIDANCE (RUN #<RUN_COUNTER>):` containing:
        - Query terms
        - Key conclusions
        - Evidence pointers (source links/locators)
        - Executable next steps + how to verify
    (b) Into progress.txt (inside the current RUN entry):
        Append a short “Unblock Research Capsule” containing:
        - Query terms
        - Key conclusions
        - Evidence pointers
        - Pointer back to the tasks.md UNBLOCK GUIDANCE location

- Append a FULL RUN ENTRY to PROGRESS_FILE capturing blocker + the skill’s capsule + retry decision (verified facts only).
- Retry once as before; if blocked again, STOP and require user/initializer intervention.

If NO_PROGRESS:
- Treat as a FAILED ATTEMPT (not an immediate session stop by default).
- Under THIS task, append/refresh a single diagnostic note:
  `BLOCKED: Missing a compliant BUNDLE pointer and/or the referenced validation bundle folder is missing/incomplete for this RUN (workflow non-compliance).`
  `NEEDS: Re-run SAME task; Worker/Codex must (1) create a fresh run-folder under auto_test_openspec/<change-id>/... containing task.md + run.sh + run.bat + logs/worker_startup.txt (+ tests/runbook if GUI), and (2) append EXACTLY ONE single-line BUNDLE (RUN #<RUN_COUNTER>) pointer under THIS task (CODEX_CMD + SCOPE + VALIDATION_BUNDLE + HOW_TO_RUN [+ RUNBOOK]).`
- Append a FULL RUN ENTRY to PROGRESS_FILE (status=NO_PROGRESS) including:
  - the missing-gate diagnosis,
  - the subagent trace,
  - Attempt #k and the retry/maxed decision.
- Flow control MUST follow the per-task retry policy:
  - If Attempt #k < MAX_ATTEMPTS: continue the retry loop for the SAME task (fresh subagent).
  - Else (Attempt #k == MAX_ATTEMPTS): mark the task MAXED and apply dependency-blocking stop logic (stop only if it blocks safe forward progress).

2) Completion (only at start-of-session, or if CURRENT_TASK selection finds none)
- If no unchecked tasks remain:
  `[MONITOR] DONE | change=$ARGUMENTS | all tasks checked`
  then STOP.

流程

初始配置

首先是安装Claude code和codex，这个就不列举了。安装openspec这里要说一下，最好是0.19.0版本，因为再新的版本，openspec的工作流重构了，支持自然语言调用，使用的是skills触发³，后续我也会尝试适配更新最新版本的openspec。

npm install -g @fission-ai/openspec@0.19.0

先使用openspec初始化一下项目

openspec init

下一步操作 - 将这些提示复制到codex：
────────────────────────────────────────────────────────────
1. 填充项目上下文：
请阅读 openspec/project.md 并协助我完成内容填写
包含我的项目详情、技术栈及规范"

2. 创建您的首个变更提案：
我想添加[在此处填写您的功能]。请创建一个
OpenSpec 对此功能的变更提案

3. 学习 OpenSpec 工作流：
请解释来自 openspec/AGENTS.md 的 OpenSpec 工作流。
以及我该如何与你共同推进这个项目

重复流程

先打开codex，使用自然语言提出一个变更提案，例如：为我这个项目添加一个支持夜间模型自动切换的功能

然后再使用skills$openspec-change-interviewer <id>让模型通过采访的方式，明确我们的需求，对齐需求。填写的是openspec文件夹下的当前提案的文件夹名称。

再 $openspec-feature-list <id> 让模型列出来一个feature_list.json。

最后打开Claude code，输入/monitor-openspec-codex <id>即可

实际使用流程

安装 openspec（我建议锁 0.19.0）
我这里强烈建议用 0.19.0，因为更高版本工作流有重构，虽然也支持自然语言调用，但走的是 skills ，我后续也会尝试适配到最新版³。
```
npm install -g @fission-ai/openspec@0.19.0
```
初始化项目
```
openspec init
```
初始化完成后，它会提示下一步要做什么。我们可以先把项目上下文补齐，再创建第一个变更提案。
用 codex 提一个 change（自然语言就行）
比如：为我这个项目添加一个支持夜间模型自动切换的功能
用 skill 把需求“采访清楚”
对齐需求这一步真的很值。我们让模型先问清楚，再开干，后面返工会少很多。
- 运行 openspec-change-interviewer：$openspec-change-interviewer <id>
- <id> 就是 openspec 文件夹下当前提案的文件夹名
生成 feature_list.json
- 跑：$openspec-feature-list <id>
- 这一步做完，我们后面就能用它来防止“看似完成、实际没过”的情况
开始监督执行：交给 Claude Code
最后打开 Claude Code，输入：
- /monitor-openspec-codex <id>

参考资料

本文作者： 她笑中藏泪花 文章标题： Claude Code 监督 Codex：可复现验收与防跑偏的实践框架 本文地址： https://rosetears.cn/archives/85/ 版权说明： 若无注明，本文 Rosetear's blog 原创，转载请保留文章出处。

最后修改：2026 年 02 月 03 日

如果您觉得本文还不错，欢迎前往爱发电支持我

Claude Code 监督 Codex：可复现验收与防跑偏的实践框架

让 AI 一直跑又不跑偏，真的太难了

选对工具，省钱又省心

两层防跑偏保险

1. 对比表格

2. 作用与联系

各自的作用

靠什么联系起来？

总体思路

Skills和mcp配置

1. 配置 MCP

2. skills

需要更改的文件

可选：规范代码

关键文件修改

openspec-proposal.md需要添加的

项目目录：openspec\project.md

项目目录：.\claude.md

`.claude/commands/monitor-openspec-codex.md` (自动化核心)

流程

初始配置

重复流程

实际使用流程

参考资料

【Claude Code Router】一键直连五大模型

【Zotero-pdf2zh】快速搞定 PDF 翻译，完美保留原文格式

【PDF2zh 2.0】三种部署指南与 Zotero 插件配置

【Claude Code】告别VS Code强制登录：丝滑接入其他模型

【LaTeX】Vs code下载与配置

【SPSS】相关性分析

【秘塔 AI 搜索】个性化 AI 老师，高效学习一站式神器

【安装】R 、R Studio 、R Tools 安装

【分享】用了很久的浏览器插件分享

【NPS】净推荐值---学习笔记

让 AI 一直跑又不跑偏，真的太难了

选对工具，省钱又省心

两层防跑偏保险

1. 对比表格

2. 作用与联系

各自的作用

靠什么联系起来？

总体思路

Skills和mcp配置

1. 配置 MCP

2. skills

需要更改的文件

可选：规范代码

关键文件修改

openspec-proposal.md需要添加的

项目目录：openspec\project.md

项目目录：.\claude.md

.claude/commands/monitor-openspec-codex.md (自动化核心)

流程

初始配置

重复流程

实际使用流程

参考资料

`.claude/commands/monitor-openspec-codex.md` (自动化核心)