warg/dotfiles

Fork 0

Files

T

warg 603ca3d23e fix(claude): forbid audit self-reference in commit messages

2026-04-28 14:39:59 +02:00

17 KiB

Raw Permalink Blame History

name, description

name	description
audit	Run a deep, multi-lens review of existing code state (not a diff). Launches six specialized review agents in parallel - reuse, quality, efficiency, errors, api, bugs - then validates each finding before presenting. Optional scope (`/audit path/ path2/`) and optional lens subset (`/audit --lenses reuse,bugs`). Opt-in lenses for docs, tests, security, a11y, deps. Use when the user asks for a full review, deep review, codebase audit, cleanup pass, retrospective review, tech-debt sweep, or otherwise wants to surface issues across landed code - even if they say "clean up the project" or "look over the repo" without using the word "audit". Do NOT use for reviewing in-flight work (use /simplify), PR review (use /review or /code-review), or security-only review (use /security-review).

name

description

audit

Run a deep, multi-lens review of existing code state (not a diff). Launches six specialized review agents in parallel - reuse, quality, efficiency, errors, api, bugs - then validates each finding before presenting. Optional scope (`/audit path/ path2/`) and optional lens subset (`/audit --lenses reuse,bugs`). Opt-in lenses for docs, tests, security, a11y, deps. Use when the user asks for a full review, deep review, codebase audit, cleanup pass, retrospective review, tech-debt sweep, or otherwise wants to surface issues across landed code - even if they say "clean up the project" or "look over the repo" without using the word "audit". Do NOT use for reviewing in-flight work (use /simplify), PR review (use /review or /code-review), or security-only review (use /security-review).

/audit: Retrospective multi-lens codebase review

/simplify reviews a diff; /audit reviews current file state. Use it when issues may have accumulated before a review gate existed, when rolling onto an unfamiliar codebase, or when the user wants a deliberate "what's lurking?" sweep.

Invocation

/audit                              # whole primary source tree
/audit crates/uitk/src/text/        # one directory
/audit src/auth.rs src/api.rs       # specific files
/audit --lenses reuse,bugs          # only those lenses
/audit src/foo/ --lenses docs,tests # both scope and lenses

If the user invokes /audit with no arguments, infer the primary source tree from the project layout (Rust: crates/*/src/; TS/JS: src/ or packages/*/src/; Python: the package directory). Ask briefly only if ambiguous.

Communicating with the user

Phases are internal scaffolding for organizing this skill, not concepts the user needs to track. Do not announce them in user-facing text. No "Phase 3: validating findings before reporting", no "moving on to Phase 5", no "Phase 4 triage complete". Brief, plain progress notes are fine when warranted ("validating findings before reporting", "running the gate"), but they should describe the action, not name a phase.

Phase 1: Context gather

Before spawning review agents:

Identify scope from args or inference (see above).
Read CLAUDE.md (project root, plus any in touched directories) and any memory index. Capture project-specific conventions to feed each agent as "do NOT flag these" directives (example: "no em-dashes", "we intentionally keep cargo test --all-targets off", "focus state uses two bools deliberately, scheduled for refactor in TODO.md").
Read TODO.md / BACKLOG.md / equivalents for items explicitly deferred. Agents must not re-raise known debt.
Identify the gate script (e.g. scripts/prepare.sh, pnpm check, make test) so fixes can be verified at the end.

All three pieces of context get fed into every agent prompt so they respect the project's existing shape.

Phase 2: Launch lens agents in parallel

Send a single message with multiple Agent tool uses, each subagent_type: general-purpose. The default set is six lenses; if --lenses <list> was given, run only those (plus any opt-in lenses named in the list).

Model selection per lens

The Agent tool accepts a model: "sonnet" | "opus" | "haiku" parameter. Pick deliberately - some lenses are pattern-matching (cheap), others are reasoning-heavy (expensive but worth it).

lens	model	why
reuse	sonnet	pattern recognition across files, fits sonnet's strengths
quality	sonnet	structural critique, naming, dead code; sonnet is enough
efficiency	opus	needs reasoning about hot paths, allocations, asymptotic patterns
errors	opus	control-flow analysis, silent-failure detection wants careful reading
api	sonnet	visibility analysis, type design - mostly mechanical
bugs	opus	correctness reasoning is the place not to skimp
docs (opt-in)	haiku	"does the comment still match the code?" - cheap
tests (opt-in)	sonnet	gap analysis with semantic context
security (opt-in)	opus	high-stakes correctness, needs careful reading
a11y (opt-in)	sonnet	pattern matching with semantic context
deps (opt-in)	haiku	mostly file scanning

The validation agent in Phase 3 also runs on opus - false negatives drop real findings, so this is the wrong place to economize.

These are defaults; if a project's lens is unusually subtle (e.g. obscure embedded language, novel runtime), bump up.

Default lenses

Each lens prompt must include:

One-paragraph project summary (language, domain, what the code does).
The scope: exact file/directory list the agent must read.
The lens's concrete focus (see below).
Project conventions to skip (from Phase 1).
Deferred TODO items to skip (from Phase 1).
Explicit "skip" list: the other lenses' topics (so findings don't overlap).
Output format: bulleted findings, each with file:line (or range), the issue (concrete, one line), suggested fix (one line).
Word cap: 400-700 words per agent. Findings scale with scope, so give bigger caps when auditing whole repos, smaller when auditing a single file.
"HIGH SIGNAL only. If you are not certain a finding is real, don't flag it. Don't invent findings to fill space. If the area is clean, say so."

reuse

Duplicated logic, reinvented std / framework primitives, inline patterns that match an existing helper in the same codebase, inconsistent import paths (e.g. some files use top-level re-exports, others use deep paths). Flag the concrete duplication with file:line of each duplicate site. Don't propose new abstractions where no duplication exists yet.

quality

Structural and maintainability issues: redundant state (fields that duplicate each other, derived values cached unnecessarily, bools that encode the same thing as an adjacent enum), leaky abstractions (pub(crate) fields poked directly when a method would be cleaner), stringly-typed code, parameter sprawl, unnecessary comments (especially WHAT-not-WHY narration, section dividers, PR/commit/task references in code), nested conditionals 3+ levels deep that could flatten, dead code, brittle test fixtures. Skip anything a linter/formatter would catch - the gate handles those.

efficiency

Hot-path bloat (anything that runs per-frame / per-event / per-request / per-render): redundant allocations, repeated hashmap lookups, multiple tree walks where one would do, reconstructing immutable objects every call. Recurring no-op updates (state writes that trigger downstream invalidation even when the value didn't change). Unbounded growth in caches or maps. Overly broad operations (scanning entire collections to find one thing). Note "hot path" context per project - for GUI/game code it's paint/layout/event loops; for servers it's request handlers; for data pipelines it's per-record transforms.

errors

Error-handling hygiene. Silent failures (catch/Result discarded, unwrap/expect on fallible ops that could surface meaningful errors), inconsistent error propagation patterns within one codebase, expect("...") messages that don't explain why, panic locations that could be Result returns, missing error context at boundaries. Inspired by Anthropic's silent-failure-hunter agent.

api

Public API surface appropriateness. pub on items that could be pub(crate) (check whether external callers exist), missing #[non_exhaustive] on enums that will grow, doc-commented-but-private items (doc comment misplaced), trait methods with confusing defaults, constructors/builders inconsistent with the rest of the crate. Type design: invariants expressed via state instead of type (e.g. a pair of Option + bool that could be an enum). Inspired by Anthropic's type-design-analyzer.

bugs

Correctness issues: logic errors, off-by-one, missing bounds checks, wrong condition in if, incorrect loop termination, type confusion that compiles but is wrong, borrow patterns that compile but violate invariants (lifetimes too permissive / not permissive enough). Only flag with high confidence - "this might be wrong depending on inputs" is NOT a finding. Include language-specific bug profiles: Rust bugs often involve lifetimes/Send/Sync; JS/TS bugs often involve null/undefined, async Promise lifetimes, reference equality mistakes.

Opt-in lenses

Enabled only via explicit --lenses containing their name.

docs

Public items without doc comments. Stale / rotted comments (code has moved on, comment hasn't). Outdated examples in doc comments. Missing module-level docs on non-trivial modules. Inspired by Anthropic's comment-analyzer.

tests

Coverage gaps (public API without tests), brittle fixtures (parallel arrays that should be tuples, over-complex setup), test-only code leaking into production, missing edge case assertions (empty input, single element, boundary values), assertions that don't match their descriptions. Inspired by Anthropic's pr-test-analyzer.

security

Common vulnerability patterns for the project type: injection (SQL / shell / template), hardcoded secrets, unsafe deserialization, missing input validation at trust boundaries, auth/session flaws, path traversal. Skip entirely for projects with no security surface (pure algorithm libraries, graphics code, offline tools).

UI accessibility: missing labels on inputs / buttons, colour-only signalling, tabindex / focus management, screen reader compatibility, keyboard-only navigation support. Only meaningful for UI-layer code.

deps

Dependency hygiene: duplicate deps at different versions, unused deps, feature flags that enable more than needed, dev-deps used in production code paths.

Phase 3: Validation pass

Once lens agents return, do NOT present findings to the user yet. Launch a single validation agent with all raw findings as input:

"Each finding below was flagged by a lens agent. For each one, confirm independently whether it's real by reading the referenced file(s) and the surrounding context. Classify each as: confirmed (high-confidence real issue), misfire (wrong reading of the code, semantics differ from what the agent thought), or context-dependent (real only under unstated assumptions - treat as misfire). Return the confirmed list, with the reasoning for any misfires you're dropping so the aggregator can double-check."

This mirrors the confidence-scoring approach in Anthropic's /code-review. Misfires are noisy; validation keeps signal high.

Skip validation only if the raw finding count is ≤3 and each one is obviously right (saves tokens when the audit turns up almost nothing).

Phase 4: Triage

Classify each confirmed finding into one of four tiers:

Trivial fix - small local change, clear improvement, no judgment call (e.g. "use existing helper at file.rs:42 instead of inline arithmetic").
Substantive fix - real value, more than a few lines, clear scope (e.g. "merge two near-duplicate functions into one walker").
Needs discussion - chunky refactor, public API change, enum redesign, hot-path caching with lifetime gymnastics. Outcome shouldn't be assumed.
Backlog item - real but larger than cleanup. Should land in TODO.md (or equivalent) so it's not lost.

This phase is classification only. Do NOT apply any fixes here, do NOT edit TODO.md here. Recording happens in the next phase, after the user has seen the proposed plan.

Phase 5: Report and apply tier by tier

Don't dump every tier at once. The user shouldn't have to scroll back through a wall of findings to track decisions. Walk through one tier at a time: present, get approval, apply, commit, gate, then move to the next.

Set up internal tracking

Before presenting anything, use TaskCreate to record one task per non-empty tier in the order below. The full finding set lives in those tasks, so you can hold detail internally and surface only the active tier to the user. Mark each tier's task complete as you finish it.

Tier order

Suggested backlog additions - lock these in first. A single TODO.md append is cheap and ensures nothing is lost if a later code change goes sideways.
Trivial fixes - grouped by theme (e.g. "use existing helpers", "drop dead code"), one commit per theme.
Substantive fixes - one commit per logical change. Commit message explains the why.
Needs discussion - present each as: issue, two options, tradeoff. Apply only if the user gives specific direction.

Skip any tier that has zero items.

Opening the report

Only on the first non-empty tier, lead with a single summary line:

## Audit findings

Ran <N> lens(es) over <scope>. <K> raw findings, <V> confirmed. Working through them one tier at a time.

If there's a useful cross-cutting observation, mention it in one line here. Don't pad.

Format per tier

Items are numbered 1..N within the tier, resetting each tier so the user can say "skip 2 and 5" without ambiguity.

### <Tier name> (<count>)

1. file.rs:42 - <issue>. Fix: <one-line change>.
2. file.rs:88 - <issue>. Fix: <one-line change>.
3. ...

Ready to apply? "Go ahead" for all, or tell me which numbers to skip.

For Needs discussion, expand each item:

### Needs discussion (<count>)

1. file.rs:220 - <issue>.
   - Option a: <option>
   - Option b: <option>
   - Tradeoff: <one line>
2. ...

Stop after presenting a tier and wait for the user. They may:

Approve all in the tier ("go ahead").
Skip specific numbers ("skip 2 and 5, apply the rest").
Reject the whole tier.
Ask for more detail on a specific number before deciding.

Match their direction precisely. Don't slip auto-applied items past their stated scope.

After approval, before the next tier

Apply the approved items in the current tier.
Commit per the tier's rule (one commit for the backlog append, one per theme for trivials, one per logical change for substantives).
Run the project's gate (scripts/prepare.sh / pnpm check / etc.). If it fails, fix the underlying issue rather than reverting or bypassing.
Mark the tier's task complete.
Present the next non-empty tier the same way.

Public-API guard

A generic "go ahead" on Trivial or Substantive does NOT extend to items that touch public API surface (pub items, exported types, breaking signature changes). If such an item is sitting in those tiers, lift it into Needs discussion before presenting, so it gets explicit attention.

Closing

After the final tier, give a brief summary: what was applied, what was backlogged, what's still open. If items got dropped during discussion, note them so the user can confirm the bookkeeping.

Do not

Edit any file before the user has approved the active tier. This includes TODO.md and any code file. Validation is the last automatic step. Everything after waits on per-tier go-ahead.
Dump every tier at once or rely on a single big report. Walk tier by tier so the user only has to track one decision at a time.
Surface phase numbers/names to the user. Phases are scaffolding for this skill, not vocabulary the user should have to learn.
Auto-apply changes that affect public API surface even after a generic "go ahead". If a finding touches pub items, lift it into the Needs discussion tier before presenting.
Reference the audit run itself in commit titles or bodies (no "from audit", "audit cleanup", "via /audit", "found by audit"). Commits should describe the change, not how it surfaced. The audit is the trigger, not the subject.
Stack findings from multiple lenses into one commit without clear grouping.
Invent findings to fill space if a lens comes up empty. "Nothing to flag" is a valid outcome and should be reported as such.
Re-raise items already in TODO.md / BACKLOG.md.
Run /audit against a codebase that was just audited - diminishing returns.

Parallelism note

All lens agents run in parallel (single message, multiple Agent tool uses). Six agents is the upper bound where parallelism still pays; above that, coordination overhead catches up. Keep opt-in lenses opt-in for this reason - running all 11 in parallel would be wasteful for most projects.

When NOT to use this skill

Reviewing in-flight work → /simplify against the diff.
PR review → /review (built-in) or /code-review:code-review if that plugin is installed.
Security-only audit of a diff → /security-review.
Back-to-back on the same code → re-runs produce sharply diminishing returns.

17 KiB Raw Permalink Blame History