# **Skill Mining**: Extracting What Your Codebase Already Knows

Published: 2026-06-04T10:00:00.000-05:00
Tags: agents, agent-skills, skills, llm, open-source
Canonical: https://www.voodootikigod.com/skill-mining

> Your codebase already encodes how your team builds. Skill mining extracts that latent know-how into reusable agent skills. Here is the loop.

---

As the Head of Forward Deployed Engineering at [Vercel](https://vercel.com), I drop into customer codebases for a living. Startups, enterprises, teams of two and teams of two hundred. The repository is always unfamiliar, the deadline is always real, and I am always the person who cloned it this morning.

So is the agent.

Port 5441, not 5432. `.env.test`, not `.env.local`. A filter flag you write from memory in three seconds, if you have been in this repo before. Neither of us has. That gap costs real time, and it compounds across every engagement.

Repos accumulate that kind of knowledge. The unwritten rule that money is always stored in cents, never floats. The folder a new feature is supposed to go in, enforced only when a reviewer catches you putting it somewhere else. The four-step dance to run a schema migration safely, which you get wrong exactly once and never forget again.

None of that is in a file you can point to. It lives in commit history, in scattered validators, in review comments, in the heads of whoever has been here longest. It is the difference between someone who has worked in your repo for two years and someone who cloned it this morning.

Every agent starts as the person who cloned it this morning. Brilliant and completely without context, every single time.

Skill mining is how you fix that.

## Mining, because the value is already in the ground

I picked the word deliberately. You are not inventing skills. You are extracting something that is already there.

A codebase is a sedimentary record of how a team builds software. Every commit is a decision. Every cluster of bug-fix commits around the same file is a sign that says this part is hard, and here is how we eventually got it right. Every convention the team follows is a pattern that an agent currently has to re-discover by reading three other files and guessing.

That latent know-how has real value, and right now it is locked in a form only humans (and only some humans) can read fluently. Skill mining is the dig: you survey the terrain, find the rich seams, score them by leverage, and pull the valuable ones up into a form your agents can use directly.

The form already exists. The open [Agent Skills Specification](https://agentskills.io/?ref=voodootikigod.com) standardized on a simple format: a folder with a `SKILL.md` file, a name, a trigger-rich description, and a body of instructions. Drop one into the right directory and any compliant harness loads it on demand when the description matches what you are doing. Claude Code, Codex, Antigravity, Cursor, Zed. The package manager (`npx skills`) and the registry at [skills.sh](https://skills.sh/?ref=voodootikigod.com) handle distribution. The plumbing is solved. What has been missing is a disciplined way to figure out which skills are worth having for a given codebase.

That is the gap skill mining fills.

## The prompt that started it

This practice started, like a lot of good practices do, as a long prompt I kept reusing. Something like: 

> do a thorough review of this project's codebase with the goal of building up a set of high-value agent skills, find existing ones to reuse where possible, create new skills only where something is bespoke or unique to this app, then define agents that leverage them to drive implementation, fixing, and improvements as a team.

It worked. But a one-off prompt is itself un-mined knowledge. So I turned it into a proper, repeatable skill with a defined loop, a scoring rubric, and templates. That skill is open source under MIT, installs cross-harness, and you can run it on your own repos today.

The rest of this post is what is inside it.

## The survey comes first, the scalpel comes later

Skill mining runs as seven phases. The first half is broad and parallel. You are surveying and judging. The second half is surgical and sequential. You are writing artifacts you will have to maintain, so restraint matters more than coverage.

Survey, Detect, Score, Dedupe, Author, Compose, Verify. With two adversarial gates wired in between.

```mermaid
flowchart TD
    S[Survey] --> D[Detect]
    D --> SC[Score]
    SC --> DD[Dedupe]
    DD --> GA{Gate A}:::gate
    GA -->|reuse or reject| X[Logged, never built]
    GA -->|build earned| AU[Author]
    AU --> CO[Compose]
    CO --> GB{Gate B}:::gate
    GB -->|fix| AU
    GB -->|ship| V[Verify and report]
    classDef gate fill:#d70000,stroke:#0f0f0f,color:#f2f2f0
```

### Survey: map the territory before judging it

Before deciding anything, the agent builds a factual map. Languages, frameworks, package layout, the actual build/test/lint/deploy commands, and the hotspots. `git log` churn tells you which files change most often, and high churn means high leverage. A skill that speeds up work on the hottest files in the repo pays back faster than one for a corner nobody touches.

It also reads the pain markers: clusters of `TODO`/`FIXME`/`HACK`, files with the most bug-fix commits, recurring reverts, flaky-test annotations. Pain is signal. Pain that recurs is a skill waiting to be written.

If your harness can run subagents in parallel (Claude's workflow engine, Codex's parallel tasks), this is where you fan out: one explorer per subsystem, then converge. If it cannot, you iterate. The method is identical; only the wall-clock differs.

### Detect: surface candidates everywhere they hide

Here is where most people get it wrong: they think skills means code patterns. The highest-leverage skills almost always live in the operational and tribal layers instead. The stuff nobody wrote down because everybody just knows it.

The mining loop sweeps a deliberate taxonomy: build/test/run incantations (the cheapest, highest-hit-rate skills in existence; every agent re-derives "how do I run this" on every task), domain rules and invariants (allowed state transitions, tenancy isolation, money handling, PII boundaries, enforced today only by scattered validators and reviewer memory), architectural conventions, review checklists (whatever your reviewers reliably catch is a skill), debugging playbooks, and migration and deploy recipes. Multi-step, error-prone, infrequent. Exactly the things people get wrong.

For each candidate, the agent captures the evidence: the files, line ranges, and commit history that prove the pattern actually recurs. That evidence is what makes the authored skill specific instead of generic, which turns out to be the whole game.

### Score: rank by leverage, not enthusiasm

It is very easy to get excited and mine forty skills. Forty skills is noise. The loop forces discipline with a five-axis rubric.

Frequency and leverage are a pair: high frequency with low leverage is noise, low frequency with high leverage is a trap. Bespokeness is the tiebreaker. A pattern that recurs constantly but already has a maintained community skill is a REUSE, not a BUILD. Stability matters because a skill that churns out of date in six weeks is worse than no skill. And verifiability is the gate on everything: if an agent cannot check whether it followed the skill, the skill can never get better.

The scores do not just rank the backlog. Their shape drives the next decision, which is the one that matters most.

### Dedupe: reuse before you build

This is the heart of skill mining, and the part that separates it from "the agent wrote me a pile of markdown."

For every candidate that survives scoring, the agent checks the existing ecosystem before authoring anything. It does that with a skill built exactly for the job: `find-skills`. Point it at a candidate and it searches your installed skills, runs `npx skills find <query>`, and checks the [skills.sh](https://skills.sh/?ref=voodootikigod.com) leaderboard for a maintained skill that already covers the need. The cheapest skill in the world is the one somebody else already maintains. There are battle-tested community skills for React, Next.js, testing, security review, ClickHouse, and dozens more, with hundreds of thousands of installs and active maintainers. Re-implementing those in your repo is not a flex. It is a liability you now own.

Most candidates resolve to REUSE: a maintained public skill already covers this, install it and move on. A few are close but need a thin overlay. That is EXTEND. BUILD is the exception that has to earn its place: genuinely bespoke, high-leverage, nothing in the ecosystem that covers it. REJECT is not failure; it is the loop doing its job correctly. A good mining run might surface thirty candidates and build six. The other twenty-four are not failures. They are you not re-inventing things that do not need re-inventing.

```mermaid
flowchart TD
    C[Scored candidate] --> Q1{Public skill exists?}
    Q1 -->|fits as-is| RE[REUSE]
    Q1 -->|close, needs overlay| EX[EXTEND]
    Q1 -->|nothing covers it| Q2{Bespoke and high-leverage?}
    Q2 -->|yes| BU[BUILD]:::build
    Q2 -->|no| RJ[REJECT]
    classDef build fill:#d70000,stroke:#0f0f0f,color:#f2f2f0
```

## The failure mode nobody talks about

The moment skills become easy to create, you get a new problem that looks a lot like the old `utils.js` problem, or the company wiki nobody trusts.

Skill sprawl. Dozens of overlapping, half-maintained skills piling up faster than anyone can curate them. And its close cousin, skill redundancy: three different write-good-React skills, each slightly different, none authoritative, all drifting apart over time. Sprawl and redundancy do not just waste effort. They actively degrade your agents. When five skills could match a task, the agent loads the wrong one, or loads two that contradict each other, and the quality you were trying to add turns into noise.

A bloated skill library is worse than a small one. Same way a 4,000-line `utils` file is worse than a tight standard library.

`find-skills` is the guardrail against both. By forcing a does-this-already-exist check before anything gets authored, it makes reuse the default and authoring the exception. Every candidate it can route to an existing skill is one fewer thing you maintain, one fewer near-duplicate for an agent to trip over, one fewer source of drift. A skill you did not write (because a maintained one already existed) is the highest-leverage outcome of the entire process. If you want the full quality toolkit once skills are in place (audit, lint, token budgeting, security scanning) that is [a separate post](https://voodootikigod.com/the-missing-quality-toolkit-for-agent-skills/?ref=voodootikigod.com).

## Author: write skills that are specific, or not at all

For the candidates you do build, the loop writes a `SKILL.md` with a few non-negotiable properties.

A trigger-rich description: this single field is how the skill gets discovered, and it has to contain the phrases a person would actually say. You write it last, once you know exactly what the skill does. One job per skill: if describing it needs the word "and," split it. Real commands and real paths: "write good tests" is not a skill; "run `pnpm test:int`; integration tests live in `tests/integration` and need `.env.test` with port 5441" is a skill. The specificity is the value. And a verification step: if an agent cannot tell whether it followed the skill, the skill can never improve.

## Compose: mine the agents, not just the skills

Skills are capabilities. Agents are the roles that wield them.

A generic code reviewer agent is weak. It reviews like a smart stranger. A reviewer agent that loads your repo's convention skill and your repo's security skill reviews like a senior engineer who has been on the team for years. It is carrying the same institutional knowledge they are. Same for an implementer that loads your architecture skill, a fixer that loads your debugging playbooks, a migrator that loads your deploy recipes.

A typical mined roster is an implementer, a fixer, a reviewer, and a migrator. It is [the same separation of roles](https://voodootikigod.com/gemini-plugin-cc/?ref=voodootikigod.com) that makes multi-model agent teams work. Each one a thin definition that names the specific skills it loads and the procedure it follows. The skills are the shared knowledge base; the agents are the specialists who have studied it.

The loop then writes a team manifest: who hands off to whom, and in what order. Implementer ships a diff to Reviewer; Reviewer routes findings to Fixer; Fixer's patch goes back to Reviewer; anything touching schema or deploy branches to the migrator. That manifest is what makes "drive improvements as a team" a runnable workflow instead of an aspiration.

```mermaid
flowchart LR
    I[Implementer] -->|diff| R[Reviewer]
    R -->|findings| F[Fixer]
    F -->|patch| R
    R -->|schema or deploy| M[Migrator]
    R -->|clean| SH[Ship]:::ship
    classDef ship fill:#d70000,stroke:#0f0f0f,color:#f2f2f0
```

## Verify: prove it, and hide nothing

A mined skill is a hypothesis until it is tested. A fresh-context agent gets only the authored skill and has to complete a real recent task with it. The verdict is SHIP, FIX, or REJECT, recorded with concrete evidence. A skill is verified only once a cold agent used it and it actually worked. Then lint the artifacts: valid frontmatter, unique names, descriptions that contain real trigger phrases.

Then the loop writes `SKILLS_MINED.md`: every candidate considered, its scores, its decision, and why. Nothing is dropped silently. A rejected candidate with a clear reason is a real output. It stops the next person, or the next mining run, from re-mining the same dead end. A deferred list captures the mid-scoring candidates worth a second look next pass, so the practice compounds instead of restarting from zero.

## Built-in skepticism: the two gates

Running this on my own repos taught me the most important lesson the hard way: the loop is biased toward building things. Left alone, the agent talks itself into bespoke skills it thinks are clever, inflates its own leverage scores, and writes skills that read perfectly to the agent that wrote them and uselessly to everyone else. An author always fills its own gaps from memory. A cold reader cannot.

The fix is to put an adversary in the loop. An independent reviewer, fresh context, prompted to refute rather than to check.

**Gate A challenges the decision.** Before any candidate gets built, a skeptic re-scores it with the burden of proof reversed. Default verdict is reuse or reject; it has to be talked into a build. It attacks the recurrence evidence and the bespokeness claim. This is what keeps the loop honest about reuse. It is where most of the accuracy comes from.

**Gate B red-teams the artifact.** After a skill is written, a fresh agent gets only that file. Not the survey, not the reasoning that produced it. Has to complete a real task with it. A skill that says "handle money correctly" with no commands, no invariants, no specific paths is not a skill. The agent guesses. The guess is wrong. You do not find out until a float slips into a charge.

The non-negotiable property is independence. A skill grading its own homework rubber-stamps every time. The reviewer has to be a separate pass with no stake in the answer, told its job is to break things, not to bless them. That single discipline is the difference between a pile of plausible markdown and a portfolio you can trust.

## Three things check into the repo

Run this on a real repo and three things check into the project. A handful of sharp, repo-specific skills your agents load automatically when the work matches. A small team of agent definitions that compose those skills into roles. And a report that documents what exists, what you reused, and what you chose not to build: institutional memory about your institutional memory.

The compounding effect is the real prize. Every skill is something your agents no longer re-derive. Every agent is a specialist you can summon. And because the skills are versioned files in the repo, they improve through normal pull requests. Someone hits a sharp edge, files a fix, and now every agent and every teammate inherits the lesson. The codebase starts teaching itself how to build itself.

## The same loop runs an organization's AI portfolio

Skill mining is the individual loop: one developer or team, one repo, run on demand. But look at the shape of it. Detect recurring know-how. Match it against what already exists. Find the gaps. Build only what is missing. Measure how much of the important work is now covered.

That is not just a repo workflow. That is a governance loop. And the moment you have more than a handful of people using AI across an organization, the absence of that loop starts costing real money.

Most enterprises rolling out AI cannot answer the questions that actually matter. What work are people trying to do with AI? Which of those patterns recur often enough to be worth standardizing? Which approved skills are being adopted and which are quietly ignored? Where are people reinventing the same capability, ad hoc, across teams that never compare notes?

I have watched teams spend weeks building a custom SQL generation skill, with three other teams in the same org having already built functionally identical skills and quietly abandoned them six months earlier. Nobody compared notes. Nobody knew.

When nobody can answer those questions, three things happen at once. People do not adopt agents, because there is no trustworthy approved capability for the work they actually do. The org pays for the same capability to be re-invented badly, dozens of times. Finance sees a token bill climbing with no way to tie it to value, so the instinct becomes throttle the spend rather than curate the portfolio. Which kills adoption from the other direction.

Cost dashboards describe the symptoms. They tell you that you spent a lot of tokens. They do not give you the operating loop to turn scattered AI usage into an intentional portfolio of capabilities. That loop is exactly skill mining, run continuously over an organization's AI traffic instead of over a single repo.

The metric that makes it legible is the org-scale version of the `SKILLS_MINED.md` report: portfolio coverage, the percentage of your top recurring work patterns backed by an approved, trusted capability. That single number reframes the whole conversation. When it goes up, people actually adopt agents because the work they do has a capability built for it. When it goes down, you know exactly where the gap is and what to build next.

The enterprise-scale version of this is what I am building toward next. Skill mining gives an individual team that loop on a repository; the same loop run across an organization is how enterprise AI stops being a cost center and becomes an operating model. One is the dig site; the other is running the whole mineral economy.

The skill is open source under MIT and installs cross-harness. Two steps:

```bash
# Install the skill into your harness
npx skills add voodootikigod/skill-mining

# Then in your agent, on the project you want to mine
mine this repo for skills
```

It will survey the codebase, score what it finds, tell you what to reuse versus build, write the skills and agents it recommends, and hand you a report of everything it considered. Read the report critically. The rejected and deferred candidates are often as informative as the built ones. Commit the skills you keep, and watch the next task go faster.

Trust the reuse bias. Let `find-skills` win the argument. The instinct to build everything bespoke is the thing to fight; it is how you end up with skill sprawl. Re-mine after big changes: a major refactor or a new subsystem lays down a fresh seam of conventions, and capturing them while they are hot is the whole point. Treat skills like code: review them, version them, fix them in PRs. A [skill that drifts out of date](https://voodootikigod.com/your-agents-knowledge-has-a-shelf-life/?ref=voodootikigod.com) is worse than no skill at all.

The knowledge of how your team builds is already in your codebase. It is just locked in a form only your most senior people can read fluently. Skill mining is how you get it out, and how you make sure the next change, by a human or an agent, starts two years ahead instead of from scratch.

* * *

_The skill-mining skill is MIT-licensed and available via `npx skills add voodootikigod/skill-mining`. The [Agent Skills Specification](https://agentskills.io/?ref=voodootikigod.com) lives at agentskills.io. The enterprise-scale version of this loop is a story for soon._
