Daeseon Yoo

Building a project log system for AI-pair-programmed work

A live build of the logging system that records every substantive decision Claude Code makes on my behalf — including the trigger layer that catches author-judgment slips, the cross-repo aggregation that pulls multiple satellite projects into one timeline, the tiered decision templates drawn from established frameworks, and the human-annotation surface that keeps me in the loop without forcing me to write every entry.

·9 min read·한국어 버전 →

This post was written while the system was built. Not after. The goal is to capture not just what shipped but the decisions and trade-offs in real time — so a reader (or future me) can reconstruct why the system looks the way it does.

What I was building

A logging system designed for someone who delegates substantial code work to AI and still wants every decision visible, queryable, and annotatable. Four properties I care about:

  1. Nothing missed by default. Author-judgment slips were the largest source of gaps in the prior version; a positive-trigger layer overrides them now.
  2. AI-author + human-annotator. Every AI-written entry has a sibling .human.mdx slot. Empty slots render as "REVIEW NEEDED" so absence is visible.
  3. Cross-repo aggregation. Five projects (one hub + four satellites) feed one portfolio timeline via pull-on-demand. No sync, no second source of truth.
  4. Methodology as portfolio. The system itself is documented as a project, not buried in process docs. This post and the /method page are part of that.

Why this post exists

Reading other people's logging systems, I noticed the same gap: they document the finished system, not the build. The build is where the actual decisions live — what was tried, what was rejected, which trade-offs hurt. So this is the build, not the polish.

There's a more concrete reason too. In 2026, anyone can ship code with AI. The differentiator for engineers is the judgment layer above it — what you choose to build, what trade-offs you accept, what AI suggestions you reject, what gaps in your own understanding you notice and close. None of this lives in the code; it lives in records or it disappears. So the system itself becomes a portfolio piece.

Phase A — Stop the bleeding

Positive triggers

The prior version of the system had a [no-log] mechanism that trusted author judgment. A cold audit across five projects found that judgment was unreliable about a third of the time. 855-LOC admin CRUD with no log entry. A 367-LOC change tagged routine that turned out to be the literal cause of a production hang. An ADR record (the canonical "why hooks" decision in another project) buried in a [no-log] commit.

The fix: positive triggers that override [no-log] when one of three conditions fires.

Override mechanism for false positives: a deliberate <!-- override-trigger: hash subject — real rationale --> line in docs/troubleshooting.md. Silent overrides are disabled because the whole point is to preserve human reasoning in the audit trail.

Verification before shipping: ran the trigger logic against three known-bad commits from the audit. All three classified correctly. The commit that installed the change (a9c4911) was itself the first dogfood — 378 LOC plus three sensitive paths fired triggers 1 and 2, requiring a full dual-write entry rather than the [no-log] tag the message would have allowed under the prior hook.

Propagation across satellites

The hook lives in the hub repo but needs to run in four satellite repos too. A factory script (scripts/propagate-hook.sh) copies the current source-of-truth hook into each satellite, commits, and pushes — skipping any satellite with a dirty working tree to avoid surprising local work.

The trade-off: this assumes a known set of satellite paths on one machine. A multi-machine setup would need a global-hook pattern (one file at ~/.claude/global-hooks/, all satellites exec it). Deferred — solo dev on one machine is the current shape, and the factory script is enough.

In practice, two of four satellites took the new hook on first attempt; two were skipped because their working trees were dirty from in-progress unrelated work. The script is idempotent, so re-running once the satellite is clean catches up the rest.

Skip marker reconciliation

The audit also found that three of four satellites had zero <!-- skipped: --> markers despite dozens of [no-log] commits in their history. The Stop hook's auto-recording wasn't installed in those repos at all, so the audit trail of deliberate skips was missing.

A sync script (scripts/sync-skip-markers.sh) walks each satellite's git log, finds every [no-log] / [skip-log] commit whose hash isn't already in docs/troubleshooting.md, and appends the missing markers in one batch. The result on shadow-ai: 33 of 34 historical skips reconciled. On jarvis-pc: 21 of 22.

Phase B — Architecture overviews and audit backfill

The audit's most-cited critical gap: no project had a single log entry describing its system shape. Every dated entry assumed the reader already knew the architecture and proceeded straight to incidents.

The fix was straightforward but tedious — one architecture-overview entry per project, dated today, kind snapshot, describing the system at the boundary between code and aggregation. To avoid a serial slog through five repos by hand, I ran the writes as a fan-out workflow: twelve agents in parallel, each reading one project's local state (manifest, top-level dir tree, recent git log, existing log entries) and producing one MDX draft. The workflow returned all twelve drafts in 97 seconds.

The same workflow also produced backfill entries for seven specific audit-flagged commits — the substantial commits that had been tagged [no-log] or shipped without an entry. The backfills are explicitly marked backfilled: true in frontmatter so future readers know they were reconstructed from the diff, not written contemporaneously.

The most consequential backfill: the ADR-001 record from dalkkak-ai, which decided per-session status would be implemented via Claude Code hooks rather than TUI scraping or process-check polling. The decision had been recorded in docs/DECISIONS.md but tagged [no-log], so it never surfaced in the project timeline. As a backfilled kind: decision with Tier 1, it now does.

Phase C — Status header on each project page

The visible portfolio signal isn't "this project shipped X commits." It's "this project recorded Y decisions and Z learning gaps." So /projects/<slug> now renders a status block above the timeline:

N logged · Judgment layer: 3 Decision · 1 Discussion · 2 Learning gap
Last update: 2026-05-31 · Architecture overview ↓

Decision, discussion, and learning-gap are first-class kinds with their own rendering. Activity heatmap (calendar grid colored by entry kind) is deferred — the data is there but the visualization needs an afternoon I don't have right now.

The decision tiers (the MBA-grade layer)

Three tiers, scaled to the weight of the choice. All borrowed from established frameworks, not invented:

Frameworks the tiers draw on:

The learning-gap kind

The most novel piece. Anyone records what they learned. Almost nobody records what they didn't know and the path their understanding took.

The kind has five required slots:

  1. What I (initially) didn't understand
  2. Where the gap came from (prior assumption, missing context, mental model)
  3. What clicked
  4. Still confused (if anything remains)
  5. Related wiki entries to update

Trigger: the human says "I asked this before" / re-asks a question in different form / acknowledges a knowledge gap mid-conversation. Claude Code proposes the entry in the same turn. Quarterly: walk the entries, consolidate into wiki updates.

The bet here is that recording confusion paths is the AI engineer's growth artifact. If recruiters care about who actually understands the systems they're shipping (not just who shipped them), this is the thing to look at. Empirically unverified.

The /method page

A surface dedicated to the methodology itself, linked from the header nav. Covers the four properties, decision tiers with framework citations, the three judgment-layer kinds, and pointers to the install guide and this post.

The framing is "how I work", not "what I built." A recruiter or external reviewer can read it in 5-10 min without needing to scan individual log entries. The page exists because the spec is reviewable but specs are dry; the methodology page is the entry point that earns the spec read.

Lives at /method.

What's queued for the next sprint

What this isn't

The honest betting

What I'm assuming that could be wrong:

  1. The .human.mdx annotation habit will form. Audit data shows I haven't reliably done this yet. The whole AI/human dual-author surface depends on someone actually filling the human column. If I don't, "REVIEW NEEDED" placeholders accumulate forever and become noise. Mitigation: weekly Sunday-morning annotation ritual on the calendar.
  2. Tier 1 decisions don't burn me out. 20-30 min per T1 entry × 3-5/month = 1.5-2.5 hr/month. Manageable in theory; in practice the slot-filling can feel like homework. Mitigation: tiers exist precisely so T1 is rare.
  3. The learning-gap kind doesn't become a vehicle for performative humility. "Look how much I'm learning!" is a real failure mode. Discipline must come from the author.
  4. Recruiters reading my portfolio actually weigh judgment trail over projects shipped. Empirically untested. Anti-evidence: most recruiters skim and never click through.

Repo and spec