Composing over reimplementing: how I structure Claude Code skills

At the end of the ship-it post I mentioned that the skill exists in two places: a team version in the repo and a personal version in my dotfiles. That’s not an accident. It’s how I develop skills without breaking things for my team.

The problem is simple. I want to experiment with workflow changes freely, test them on real tickets, and only push what works. But “what works for me” and “what works for the team” aren’t the same thing. My personal start-ticket has a team config step, uses custom skills I built (design-gate, impl-plan), and talks to my personal knowledge vault. The team version uses Superpowers skills (brainstorming, writing-plans) and doesn’t know about my dotfiles. If I push my version as-is, it breaks for everyone else.

This post is about the pattern I landed on for structuring skills so they can be developed personally, tested on real work, and shared incrementally. It’s also about a related problem that forced the same solution: context budget.

The two-version problem

The pattern I landed on is that the team version is canonical and the personal version shadows it.

The team version lives in tools/claude-skills/ in the repo. It goes through code review like any other change, and it can only reference tools and skills that everyone on the team has access to. When someone clones the project they get the skills automatically.

The personal version lives in ~/.claude/skills/. It mirrors the team version but deviates where behavior is specific to my setup. My start-ticket has a team config step that dispatches work to specialized agents. It uses design-gate instead of Superpowers brainstorming, impl-plan instead of writing-plans. It queries my personal memento vault on top of the team’s knowledge directory. The team version doesn’t know about any of that.

The way I develop changes is: experiment in the personal copy, test it on real tickets for a few days, then port what worked back to the team version through a PR. The team gets a subset of what I have, just the parts that don’t depend on my specific setup.

The problem is they diverge silently. I update my personal version, forget to port the change, and now the two behave differently in ways nobody notices until something breaks. I’ve been bitten by this three times already. Once my personal version had a vault capture step that the team version didn’t, so teammates were losing domain knowledge that my sessions were keeping. I only found out when I went looking for a note I expected to exist and it wasn’t there, because my teammate’s session had used the team skill.

There’s no automation for keeping them in sync yet. It’s manual discipline: any personal change gets a note to port it back, and I check both files side by side every couple of weeks. Not great, but it beats the alternatives of never experimenting or pushing experimental changes directly to the team.

Context budget

The two-version problem pushed me toward smaller, composable skills. A completely separate problem pushed me the same direction from a different angle.

I ran an audit on my Claude Code setup in March and found 109 skills loaded into every session. Every one registers its name and description into the agent’s context at startup. That’s tokens spent before I’ve typed anything.

The biggest offenders were two plugin suites. Superpowers had 14 skills, about 108KB of SKILL.md files plus 75KB of supporting docs. It also had an always-on meta-skill called using-superpowers that cost roughly 1,300 tokens per session just to tell the agent how to use the other skills. Octo was worse: 32 personas, 50+ skills, 10 “droids,” all registered into every session. I was using maybe 5 of those capabilities across both plugins.

The ETH Zurich study I mentioned in the harness engineering post found that more instructions means worse adherence. The QRSPI framework puts the sweet spot at 150-200 instructions max, each step under 40 words. I was blowing past that before doing any actual work.

I replaced Superpowers (14 skills, 108KB) with 4 custom skills totaling 8.5KB: design-gate, impl-plan, check-done, debug-method. TDD got baked into impl-plan as the default rhythm instead of being a separate opt-in. The always-on meta-skill disappeared entirely.

Octo (32 personas, 50+ skills) became 3 skills at about 140 lines total: /research, /debate, /security-audit. A grill-me session before the replacement turned up something useful: the multi-perspective value in these tools comes from prompt framing (giving agents different angles), not from carrying dozens of personas in context. Different LLM providers matter for debate where you want genuine disagreement, and barely matter for research where they converge on the same docs anyway.

Before: Superpowers + Octo 109 skills

After: custom skills 25 skills

109 plugin skills (183KB) replaced by 25 custom skills (under 10KB). Less context, better adherence.

Fewer skills loaded at session start, but each one sharper and more opinionated. The skills I kept are stricter about control flow (hard gates, explicit approval points) precisely because there are fewer of them. With 109 skills the agent guesses which one applies. With 25, each has a clear trigger and a clear scope.

What composition looks like in practice

Start-ticket is a good example. Seven steps, but it barely does anything itself. It’s an orchestrator.

Step 1 fetches the ticket from Linear through MCP tools. Step 2 explores the codebase with grep and read, then spawns a dala-expert agent to check the knowledge vault. Step 5 invokes design-gate, which is its own skill with its own flow: clarifying questions, multiple approaches, incremental approval. When design-gate finishes, start-ticket records discoveries through dala-expert again, updates Linear, and hands off to impl-plan for task decomposition and TDD.

Each of those sub-skills is standalone. Design-gate doesn’t know it’s being called by start-ticket. Impl-plan doesn’t know design-gate ran before it. They communicate through conversation context, not direct coupling. I can use design-gate on its own for a spike that doesn’t have a ticket, or use impl-plan directly when I already know what I want to build and don’t need the design phase.

The practical benefit is I can replace one piece without touching the rest. When I swapped Superpowers brainstorming for design-gate, start-ticket’s step 5 changed from “invoke brainstorming” to “invoke design-gate.” One line. The rest of the skill didn’t know or care. When I eventually improve how impl-plan handles parallel task dispatch, start-ticket won’t change at all.

Ship-it has the same shape. Scope detection, quality gates, guided walkthrough, code review, PR creation, each phase mostly self-contained. The guided walkthrough is actually a separate skill (/guided-review) that my coworker Guðjón built. Ship-it just calls it conditionally when the diff is big enough. Code review is another skill (requesting-code-review). Ship-it is the sequencer that decides when to call what and which gates to enforce between them. That’s composition: different people can build different pieces, and the orchestrator snaps them together.

The pattern

Looking at what survived across all of this, a few things held up.

Load on demand, not always-on. The Superpowers meta-skill cost 1,300 tokens every session whether I used it or not. Design-gate costs zero until start-ticket invokes it. Skills should be invisible until they’re needed.

Orchestrate, don’t reimplement. Start-ticket doesn’t contain brainstorming logic. Ship-it doesn’t contain code review logic. They call skills that do those things, and when the sub-skill improves the orchestrator gets better for free.

Fewer skills, stricter control flow. 109 skills with loose triggers means the agent guesses. 25 with explicit triggers and hard gates means the agent follows a path. Cutting quantity is what made it possible to raise quality per skill.

Personal version leads, team version follows. Experiment in the personal copy, prove it on real work, extract the portable parts into a PR. The divergence problem is real but manageable.

What connects the two-version problem and the context budget problem is that both push toward the same design: small, single-purpose skills that compose through orchestration. Small skills are easier to shadow because there’s less surface area to diverge. They cost less context budget. They’re easier to replace (one line in the orchestrator). The constraints reinforce each other.

I’m still figuring out how to automate the sync between personal and team versions, and the context budget keeps growing as I add MCP servers and tools. For now the manual discipline holds. I think the better answer is path-scoped skill loading, where skills only register into context when the agent touches files in their domain. More on that in the next post.

Next up: our AGENTS.md hit 714 lines. I audited it and about half was stuff the model already knows. How to treat your convention file like a constitution, not an encyclopedia.

Part of a series on agentic development tooling. See also: Harness engineering: the model is just the horse, Five revisions of start-ticket, Ship-it