Openclaw Testing

Opening the library

# OpenClaw Testing Use this skill when deciding what to test, debugging failures, rerunning CI, or validating a change without wasting hours. ## Read First - `docs/reference/test.md` for local test commands. - `docs/ci.md` for CI scope, release checks, Docker chunks, and runner behavior. - Scoped `AGENTS.md` files before editing code under a subtree. ## Default Rule Prove the touched surface first. Do not reflexively run the whole suite. 1. Inspect the diff and classify the touched surface: - normal source checkout, source change: `pnpm changed:lanes --json`, then `pnpm check:changed` (delegates to Crabbox/Testbox) - normal source checkout, tests only: `pnpm test:changed` - normal source checkout, one failing file: `pnpm test <path-or-filter> -- --reporter=verbose` - Codex worktree or linked/sparse checkout, one/few explicit files: `node scripts/run-vitest.mjs <path-or-filter>` - Codex worktree or linked/sparse checkout, changed gates or anything broad: use the Crabbox wrapper with the provider that matches the proof surface. For maintainer heavy `pnpm` gates, that is usually delegated Blacksmith Testbox through Crabbox, e.g. `node scripts/crabbox-wrapper.mjs run --provider blacksmith-testbox ... -- env OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed`. For direct AWS Crabbox proof, omit `--provider` and let `.crabbox.yaml` choose AWS. - workflow-only: `git diff --check`, workflow syntax/lint (`actionlint` when available) - docs-only: `pnpm docs:list`, docs formatter/lint only if docs tooling changed or requested 2. Reproduce narrowly before fixing. 3. Fix root cause. 4. Rerun the same narrow proof. 5. Broaden only when the touched contract demands it. ## Guardrails - Do not kill unrelated processes or tests. If something is running elsewhere, treat it as owned by the user or another agent. - Do not run expensive local Docker, full release checks, full `pnpm test`, or full `pnpm check` unless the user asks or the change genuinely requires it. - Prefer GitHub Actions for release/Docker proof when the workflow already has the prepared image and secrets. - Use `scripts/committer "<msg>" <paths...>` when committing; stage only your files. - If deps are missing, run `pnpm install`, retry once, then report the first actionable error. - In a Codex worktree or linked/sparse checkout, do not run direct local `pnpm test*`, `pnpm check*`, `pnpm crabbox:run`, or `scripts/committer` until you have verified pnpm will not reconcile or reinstall dependencies. Use `node scripts/run-vitest.mjs` for tiny local proof, `node scripts/crabbox-wrapper.mjs` for Testbox, and `git commit --no-verify` only after the relevant remote or node-wrapper proof is already clean. - For remote proof, use the Crabbox wrapper first, but name the actual backend. Direct AWS Crabbox uses `provider=aws` and `cbx_...` ids. Delegated Blacksmith Testbox through Crabbox uses `provider=blacksmith-testbox`, `syncDelegated=true`, and `tbx_...` ids. Both satisfy "remote proof" when the requested proof surface allows either. - Do not infer "no Testbox is running" from plain `blacksmith testbox list`. Use `blacksmith testbox list --all` or `blacksmith testbox status <tbx_id>` before reporting cloud state. - Reuse only an id/slug created in this operator session unless explicitly coordinating with another lane. If Testbox queues, fails capacity, or cannot allocate, report the blocker or switch to direct AWS Crabbox only when that still proves the requested surface. ## Local Test Shortcuts ```bash pnpm changed:lanes --json pnpm check:changed # Crabbox/Testbox changed typecheck/lint/guards; no Vitest pnpm test:changed # cheap smart changed Vitest targets pnpm verify # full check, then full Vitest OPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed pnpm test <path-or-filter> -- --reporter=verbose OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test <path-or-filter> ``` Use targeted file paths whenever possible. Avoid raw `vitest`; use the repo `pnpm test` wrapper so project routing, workers, and setup stay correct. If raw Vitest is unavoidable, use `vitest run ...`; bare `vitest ...` starts local watch mode and will not exit on its own. When the checkout is a Codex worktree, prefer the direct node harness instead: ```bash node scripts/run-vitest.mjs <path-or-filter> ``` That keeps the test scoped without giving pnpm a chance to run dependency status checks or install reconciliation in a linked worktree. ## Command Semantics - `pnpm check` and `pnpm check:changed` do not run Vitest tests. They are for typecheck, lint, and guard proof. - `pnpm test` and `pnpm test:changed` run Vitest tests. - `pnpm verify` runs `pnpm check`, then `pnpm test`, with Crabbox phase markers so remote summaries show which half failed. - `pnpm test:changed` is intentionally cheap by default: direct test edits, sibling tests, explicit source mappings, and import-graph dependents. - `OPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed` is the explicit broad fallback for harness/config/package edits that genuinely need it. - Do not run extension sweeps just because core changed. If a core edit is for a specific plugin bug, run that plugin's tests explicitly. If a public SDK or contract change needs consumer proof, choose the smallest representative plugin/contract tests first, then broaden only when the risk justifies it. - The test wrapper prints a short `[test] passed|failed|skipped ... in ...` line. Vitest's own duration is still the per-shard detail. ## Routing Model - `pnpm changed:lanes --json` answers "which check lanes does this diff touch?" It is used by `pnpm check:changed` for typecheck/lint/guard selection. - `pnpm test:changed` answers "which Vitest targets are worth running now?" It uses the same changed path list, but applies a cheaper test-target resolver. - Direct test edits run themselves. Source edits prefer explicit mappings, sibling `*.test.ts`, then import-graph dependents. Shared harness/config/root edits are skipped by default unless they have precise mapped tests. - Shared group-room delivery config and source-reply prompt edits are precise mapped tests: they run the core auto-reply regressions plus Discord and Slack delivery tests so cross-channel default changes fail before a PR push. - Public SDK or contract edits do not automatically run every plugin test. `check:changed` proves extension type contracts; the agent chooses the smallest plugin/contract Vitest proof that matches the actual risk. - Use `OPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed` only when a harness, config, package, or unknown-root edit really needs the broad Vitest fallback. ## CI Debugging Start with current run state, not logs for everything: ```bash gh run list --branch main --limit 10 gh run view <run-id> --json status,conclusion,headSha,url,jobs gh run view <run-id> --job <job-id> --log ``` - Check exact SHA. Ignore newer unrelated `main` unless asked. - For cancelled same-branch runs, confirm whether a newer run superseded it. - Fetch full logs only for failed or relevant jobs. - Prefer `gh run view <run-id> --json jobs` over PR rollup while debugging; rollup can be stale/noisy. - For `prompt:snapshots:check` failures, treat Linux Node 24 as CI truth. If macOS passes but CI drifts, reproduce in a Linux Node 24 container or Testbox, commit that generated output, then rerun. ## GitHub Release Workflows Use the smallest workflow that proves the current risk. The full umbrella is available, but it is usually the last step after narrower proof, not the first rerun after a focused patch. ### Full Release Validation `Full Release Validation` (`.github/workflows/full-release-validation.yml`) is the manual "everything before release" umbrella. It resolves a target ref, then dispatches: - manual `CI` for the full normal CI graph, with Android enabled via `include_android=true` - `Plugin Prerelease` for release-only plugin static checks, extension shards, the release-only `agentic-plugins` shard, and plugin product Docker lanes - `OpenClaw Release Checks` for install smoke, cross-OS release checks, live and E2E checks, Docker release-path suites, OpenWebUI, QA Lab, fast Matrix, and Telegram release lanes - optional post-publish Telegram E2E when a package spec is supplied Run it only when validating an actual release candidate, after broad shared CI or release orchestration changes, or when explicitly asked: ```bash gh workflow run full-release-validation.yml \ --repo openclaw/openclaw \ --ref main \ -f ref=<branch-or-sha> \ -f provider=openai \ -f mode=both \ -f release_profile=stable ``` Run the workflow itself from the trusted current ref, normally `--ref main`; child workflows are dispatched from that same ref even when `ref` points at an older release branch or tag. Full Release Validation has no separate child workflow ref input; choose the trusted harness by choosing the workflow run ref. Use `release_profile=minimum|stable|full` to control live/provider breadth: `minimum` keeps the fastest OpenAI/core release-critical set, `stable` adds the stable provider/backend set, and `full` adds the broad advisory provider/media matrix. Do not make `full` faster by silently dropping suites; optimize setup, artifact reuse, and sharding instead. The parent verifier job appends a child overview plus slowest-job tables for child runs; rerun only that verifier after a child rerun turns green. Standalone manual `CI` dispatches do not run the plugin prerelease suite, the extension batch sweep, or the release-only `agentic-plugins` Vitest shard. Those lanes are intentionally reserved for the separate `Plugin Prerelease` child so PRs, main pushes, and ad hoc broad CI checks do not spend Docker/package time or all-plugin runtime time on release-only product coverage. If a full run is already active on a newer `origin/main`, prefer watching that run over dispatching a duplicate. Do not cancel release, release-check, or child workflow runs unless Peter explicitly asks for cancellation. The child-dispatch jobs record the child run ids. The final `Verify full validation` job re-queries those child runs and is the canonical parent gate. If a child workflow failed but was later rerun successfully, rerun only the failed parent verifier job; do not dispatch a new full umbrella unless the release evidence is stale. For bounded recovery after a focused fix, pass `-f rerun_group=<group>`. Supported umbrella groups are `all`, `ci`, `plugin-prerelease`, `release-checks`, `install-smoke`, `cross-os`, `live-e2e`, `package`, `qa`, `qa-parity`, `qa-live`, and `npm-telegram`. Use the narrowest group that covers the failed box. After a targeted release-check fix, do not restart the full umbrella by habit: dispatch the matching `rerun_group` and rerun only the parent verifier/evidence step after the child is green unless the release evidence is stale. For a single failed live/E2E shard, use `-f rerun_group=live-e2e -f live_suite_filter=<suite_id>` so the Blacksmith workflow only spends setup and queue time on that suite. ### Release Evidence After release-candidate validation or before a release decision, record the important run ids in the public `openclaw/releases` evidence ledger. Use the manual `OpenClaw Release Evidence` (`openclaw-release-evidence.yml`) workflow there. It writes durable summaries under `evidence/<release-id>/` and commits: - `release-evidence.md` - `release-evidence.json` - `index.json` - `runs/<label>.json` Use one run per line: ```text full-release-validation openclaw/openclaw <run-id> blocking package-acceptance openclaw/openclaw <run-id> blocking release-checks openclaw/openclaw <run-id> blocking ``` Store summaries, run URLs, artifact metadata, timings, pass/fail state, and short release-manager notes there. Do not store raw logs, provider prompts/responses, channel transcripts, signing material, or secret-bearing config in git; raw logs stay in Actions artifacts. When `Full Release Validation` completes and `OPENCLAW_RELEASES_DISPATCH_TOKEN` is configured in the source repo, it requests the public `OpenClaw Release Evidence From Full Validation` workflow. That workflow reads the parent full-validation run, extracts the child CI/release-checks/Telegram run ids from the parent logs, and opens the evidence PR automatically. If the token is absent or the run predates this wiring, trigger that workflow manually with the full-validation run id. ### Release Checks `OpenClaw Release Checks` (`openclaw-release-checks.yml`) is the release child workflow. It is broader than normal CI but narrower than the umbrella because it does not dispatch the separate full normal CI child. It runs Package Acceptance with artifact-native delta lanes and `telegram_mode=mock-openai`, so the release package tarball also goes through offline plugin proof, bundled-channel compat, and Telegram package QA. The Docker release-path chunks cover the overlapping package/update/plugin lanes. Use it when release-path validation is needed without rerunning the entire umbrella. ```bash gh workflow run openclaw-release-checks.yml \ --repo openclaw/openclaw \ --ref main \ -f ref=<branch-or-sha> \ -f provider=openai \ -f mode=both \ -f release_profile=stable \ -f rerun_group=all ``` Release-check rerun groups are `all`, `install-smoke`, `cross-os`, `live-e2e`, `package`, `qa`, `qa-parity`, and `qa-live`. `OpenClaw Release Checks` uses the trusted workflow ref to resolve the selected ref once as `release-package-under-test` and passes that artifact into cross-OS release checks, release-path Docker live/E2E checks, and Package Acceptance. When `Full Release Validation` dispatches release checks, it passes the requested branch/tag plus an `expected_sha` so branch/tag refs resolve through the fast remote-ref path while the package and QA jobs still validate the exact SHA. The full install-smoke child is split on purpose: one job prepares or reuses the target-SHA GHCR root Dockerfile smoke image, QR package install runs in its own job, root Dockerfile/gateway smokes pull the prepared image, and installer/Bun smokes pull the same image while building only their small installer images. If install-smoke gets slow again, first check whether the root image was reused or rebuilt before adding/removing coverage. The full-profile native live media shards use the prebuilt `ghcr.io/openclaw/openclaw-live-media-runner:ubuntu-24.04` container so `ffmpeg`/`ffprobe` are already present. If those jobs suddenly spend minutes in dependency setup again, first check the `Live Media Runner Image` workflow and the `Verify preinstalled live media dependencies` step before assuming the media tests themselves slowed down. The release Docker path intentionally shards the plugin/runtime tail. The workflow uses `plugins-runtime-plugins`, `plugins-runtime-services`, and `plugins-runtime-install-a` through `plugins-runtime-install-d`; aggregate aliases such as `plugins-runtime-core`, `plugins-runtime`, and `plugins-integrations` remain for manual reruns. The release QA parity box is internally split into candidate and baseline lane jobs, followed by a report job that downloads both artifacts and runs `pnpm openclaw qa parity-report`. For parity failures, inspect the failed lane first; inspect the report job when both lane summaries exist but the comparison fails. ### QA Lab Matrix Profiles `pnpm openclaw qa matrix` defaults to `--profile all`. Do not assume the CLI default is the fast release path. Use explicit profiles: - `--profile fast`: release-critical Matrix transport contract; add `--fail-fast` only when the target CLI supports it - `--profile transport|media|e2ee-smoke|e2ee-deep|e2ee-cli`: sharded full Matrix proof - `OPENCLAW_QA_MATRIX_NO_REPLY_WINDOW_MS=3000`: CI-friendly no-reply quiet window when paired with fast or sharded gates `QA-Lab - All Lanes` uses explicit fast Matrix on scheduled runs; manual dispatch keeps `matrix_profile=all` as the default and always shards that full Matrix selection. `OpenClaw Release Checks` uses explicit fast Matrix; run the all-lanes workflow when release investigation needs full Matrix media/E2EE inventory. ### Reusable Live/E2E Checks `OpenClaw Live And E2E Checks (Reusable)` (`openclaw-live-and-e2e-checks-reusable.yml`) is the preferred entry point for targeted live, Docker, model, and E2E proof. Inputs let you turn off unrelated lanes: ```bash gh workflow run openclaw-live-and-e2e-checks-reusable.yml \ --repo openclaw/openclaw \ --ref main \ -f ref=<sha> \ -f include_repo_e2e=false \ -f include_release_path_suites=false \ -f include_openwebui=false \ -f include_live_suites=true \ -f live_models_only=true \ -f live_model_providers=fireworks ``` Useful knobs: - `docker_lanes='<lane[,lane]>'`: run selected Docker scheduler lanes against prepared artifacts instead of the release chunk matrix. Multiple selected lanes fan out as parallel targeted Docker jobs after one shared package/image preparation step. - `include_live_suites=false`: skip live/provider suites when testing Docker scheduler or release packaging only. - `live_models_only=true`: run only Docker live model coverage. - `live_model_providers=fireworks` (or comma/space separated providers): run one targeted Docker live model job instead of the full provider matrix. - blank `live_model_providers`: run the full live-model provider matrix. Release-path Docker chunks are currently `core`, `package-update-openai`, `package-update-anthropic`, `package-update-core`, `plugins-runtime-plugins`, `plugins-runtime-services`, `plugins-runtime-install-a`, `plugins-runtime-install-b`, `plugins-runtime-install-c`, `plugins-runtime-install-d`, `bundled-channels-core`, `bundled-channels-update-a`, `bundled-channels-update-b`, and `bundled-channels-contracts`. The aggregate `bundled-channels`, `plugins-runtime-core`, `plugins-runtime`, and `plugins-integrations` chunks remain valid for manual one-shot reruns, but release checks use the split chunks. When live suites are enabled, the workflow shards broad native `pnpm test:live` coverage through `scripts/test-live-shard.mjs` instead of one serial `live-all` job: - `native-live-src-agents` - `native-live-src-gateway-core` - `native-live-src-gateway-profiles` (release CI runs this with provider filters such as `OPENCLAW_LIVE_GATEWAY_PROVIDERS=anthropic`) - `native-live-src-gateway-backends` - `native-live-test` - `native-live-extensions-a-k` - `native-live-extensions-l-n` - `native-live-extensions-openai` - `native-live-extensions-o-z` - `native-live-extensions-o-z-other` - `native-live-extensions-xai` - `native-live-extensions-media` - `native-live-extensions-media-audio` - `native-live-extensions-media-music` - `native-live-extensions-media-music-google` - `native-live-extensions-media-music-minimax` - `native-live-extensions-media-video` Use `node scripts/test-live-shard.mjs <shard> --list` to see the exact files before rerunning a failed native live shard. The aggregate `o-z` and `media` shards remain useful locally; release CI uses the smaller provider/media shards so one live-provider flake does not force a broad native live rerun. — [truncated; see full source: https://github.com/openclaw/openclaw]

Openclaw Testing

Variables

Output

About this prompt

Prompt body

Variables

Best for

Openclaw Testing

Variables

Output

About this prompt

Prompt body

Variables

Best for