How I Built the LLM Council Inside My Notion AI Agent (And Why It Produces Better Answers Than a Single Prompt)

Adapting Karpathy's multi-model deliberation framework to work inside my Notion AI

May 07, 2026

∙ Paid

Access your FREE Solopreneur Success Hub - your subscribers-only comprehensive command center for building and scaling a successful one-person business.

I created this all-in-one toolkit for building a profitable one-person business, something I wish existed when I first started, and it saves me 20+ hours a week.

Now, it’s yours… FREE!

More information about the Hub here.

Access My Solopreneur Success Hub

I spent a Saturday morning reading through Andrej Karpathy’s LLM Council GitHub repo.

By lunchtime, I’d built a new mode for my Notion AI agent that simulates the same multi-model deliberation process, but with a single AI model.

This post walks through exactly how I did it, why I made certain design choices, and what happened when I tested it head-to-head against my existing setup.

The Spark: Karpathy’s LLM Council

Andrej Karpathy (co-founder of OpenAI, former head of AI at Tesla) published a repo called llm-council in late 2025. He described it as a “fun Saturday hack” he vibe-coded while reading books with LLMs.

The concept is simple and clever.

Instead of asking one AI model a hard question, you ask four of them the same question at the same time. Then you make each model review and rank the others’ answers (anonymized, so they can’t play favorites). Finally, a designated “Chairman” model reads everything and produces one unified final answer.

Three stages:

First Opinions — all models answer independently, in parallel
Peer Review — each model evaluates and ranks the others’ responses (anonymized as “Response A”, “Response B”, etc.)
Chairman Synthesis — one model reads all answers and all reviews, then writes the final response

Karpathy’s default council uses GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, and Grok 4, all routed through OpenRouter. Gemini 3 Pro doubles as the Chairman.

The anonymization in Stage 2 is the smartest design choice. Models can’t self-boost because they don’t know which response is theirs. It forces honest evaluation.

Why This Caught My Attention

I run a custom Notion AI agent called NOVA that handles everything from daily briefs to content repurposing to strategic decisions. NOVA has 16 specialized modes (think of them as sub-agents), each with its own instructions page and behavior rules.

One of my most-used modes is the AI Sparring Partner. I use it to stress-test ideas, challenge assumptions, and get counterarguments before committing to a direction.

How I Built an AI Sparring Partner in Notion AI to Challenge My Ideas (Before I Waste Time and Money)

Anfernee

December 16, 2025

Read full story

It’s good. But it has a limitation: it gives me one perspective at a time. I ask a question, I get one answer. If I want a different angle, I have to prompt for it. The thinking is linear.

Karpathy’s council solves this by design. Multiple perspectives, evaluated against each other, synthesized into one answer. The output quality is structurally better because the process forces disagreement and reconciliation.

I wanted that inside Notion! My NOVA!

How I Built NOVA, The Notion AI Agent That Actually Works

Anfernee

October 3, 2025

Read full story

The Problem: One Model, Not Four

Karpathy’s version works because it uses four different frontier models. GPT thinks differently from Claude. Gemini reasons differently from Grok. You get genuine diversity of thought because the models have different training data, architectures, and tendencies.

Inside Notion AI, I’m working with one model. I can’t spin up parallel API calls to four different providers. Every response comes from the same underlying model.

So the question became: can a single model simulate a multi-perspective deliberation process and still produce meaningfully better results than a standard single-pass answer?

The answer, after testing, is yes. With constraints.

Designing Council Mode: The Key Decisions

I didn’t try to replicate Karpathy’s system exactly. That would’ve been pointless inside a single-model environment. Instead, I adapted the core idea to work within Notion AI’s constraints.

Here’s what I kept, what I changed, and why.

Decision 1: Keep the 3-Stage Structure

The three stages (generate perspectives → evaluate → synthesize) are the engine of the concept. Removing any stage weakens the output.

Without Stage 2 (cross-evaluation), you just get a list of perspectives with no analysis of which is strongest. That’s brainstorming, not deliberation.

Without Stage 3 (synthesis), you leave the user to reconcile four different viewpoints themselves. The whole point is that the AI does the reconciliation.

I kept all three stages intact.

Decision 2: Replace Models with Perspectives (Lenses)

This is the biggest adaptation. Instead of four different AI models, I defined four distinct thinking lenses:

The Strategist - long-term positioning, leverage, second-order effects
The Operator - speed, simplicity, what ships this week
The Skeptic - what could go wrong, hidden costs, blind spots
The Creator - fresh angles, unconventional approaches, audience resonance

Why these four?

I chose perspectives that naturally produce tension. The Strategist and Operator almost always disagree (long-term vs. short-term). The Skeptic challenges everyone. The Creator offers angles the others miss.

If all four lenses agreed on everything, the mode would be useless. The value comes from structured disagreement. These four lenses maximize the chance of genuine divergence in the output.

I also made them swappable. If a question is about pricing, The Creator might not be the right lens. The instructions allow NOVA to swap in “The Economist” or “The Teacher” when a lens doesn’t fit the question. Flexibility without losing structure.

Decision 3: Keep Cross-Evaluation Explicit

In Stage 2, the model evaluates each perspective against the others: what each one gets right, what each one misses, where they agree (high-confidence signal), and where they disagree (the interesting part).

Then it ranks them with a one-sentence justification per ranking.

I considered skipping this stage to save output length. But in testing, the cross-evaluation is where the real insight lives. It forces the model to do comparative reasoning, not just generate multiple answers. The ranking also gives me a quick signal about which direction to lean without reading everything.

I added one rule: if all four perspectives agree, skip Stage 2 entirely. Unanimous agreement means the question wasn’t complex enough for Council Mode. No point in evaluating perspectives that all say the same thing.

Decision 4: The Chairman’s Synthesis Has Rules

Stage 3 doesn’t just summarize. It follows specific rules:

Lead with the strongest insight from the top-ranked perspective
Integrate non-obvious points from lower-ranked perspectives
Flag any genuine unresolved tension (don’t pretend consensus when it doesn’t exist)
End with a concrete next action

The last two are important. AI tends to smooth over disagreements and end with vague encouragement. Council Mode explicitly instructs against both. If the perspectives genuinely conflict, the synthesis says so. And every output ends with something actionable, not a motivational sentence.

Decision 5: Hard Word Limit

I capped the full output at 800 words unless the question demands more. Without this constraint, the three-stage process balloons into 2,000+ words. That defeats the purpose. Council Mode should be a sharper answer, not a longer one.

Decision 6: Composable with Other Modes

Council Mode can stack with other NOVA modes. “Council Mode + Strategy Mode” for a strategic decision. “Council Mode + Sales Mode” for offer positioning. This makes it a reasoning layer, not a standalone function.

Building It: The Exact Steps

Once the design was finalized, here’s exactly what I did to get Council Mode live inside NOVA. The whole process took about 30 minutes.

Step 1: Read the Source Material

I started by asking NOVA to read through Karpathy’s entire GitHub repo. Not just the README, but the actual source code: council.py, openrouter.py, main.py, config.py, the frontend components, everything.

This mattered because the README only describes what the tool does. The source code shows how it works. Reading council.py revealed the exact prompt structure Karpathy uses for the ranking stage (including the FINAL RANKING: parsing format) and how the Chairman prompt is constructed with both Stage 1 responses and Stage 2 rankings as context. Those implementation details shaped how I designed the cross-evaluation and synthesis rules for my version.

Step 2: Evaluate Feasibility Inside Notion AI

Before designing anything, I had NOVA assess what’s replicable and what isn’t within a single-model environment.

What I could replicate:

The 3-stage structure (perspectives → evaluation → synthesis)
The structured output format
The cross-evaluation and ranking logic
The Chairman synthesis with specific rules

What I couldn’t replicate:

Actual model diversity (four different AI architectures producing genuinely independent outputs)
True anonymized peer review (a model reviewing its own outputs knows they’re all from itself)
Parallel API calls (everything runs sequentially in one response)

This assessment set the constraints. Instead of pretending I could replicate Karpathy’s system exactly, I focused on adapting the core value (structured multi-perspective reasoning) to work within a single model.

Step 3: Draft the Mode Instructions

I wrote the full Council Mode specification in chat first, before touching any pages. This included:

The trigger phrases (”LLM Council”, “llm council”, or “Use Council”)
The four default perspectives with their definitions
The Stage 2 cross-evaluation criteria (what each perspective gets right, misses, where they agree, where they disagree)
The Stage 3 synthesis rules (lead with top-ranked insight, integrate minority viewpoints, flag unresolved tension, end with next action)
The output format template
Edge case rules (skip Stage 2 if all perspectives agree, swap lenses when they don’t fit)
The 800-word cap
Composability with other modes

I reviewed this draft in chat before committing anything to the workspace. This is a habit I recommend: treat the chat as a staging area. Get the spec right before you write it into your agent’s instructions.

Step 4: Update the NOVA Instructions Page (Numbered List)

My NOVA agent has a master instructions page with a numbered list of all 16 modes. I added Council Mode as #17:

17. **Council Mode** — multi-perspective deliberation for complex questions, inspired by Karpathy's LLM Council.

This goes in the “NOVA Modes (Sub-Agents)” section, right after Dan Koe Style Mode (#16). The numbered list is what NOVA references for auto-select routing, so the mode needs to be registered here to be discoverable.

I will talk about Dan Koe Style Mode in the next post :)

Step 5: Create the AI Instructions Page

Each NOVA mode has its own dedicated instructions page. I created a new page titled “Council Mode — AI Instructions” with the full specification from Step 3.

The page includes:

The trigger phrases
When to use it (strategic decisions, trade-off evaluations, content positioning, ambiguous questions)
The complete 3-stage process with detailed instructions for each stage
The output format template (with the 🏛️ emoji header)
All rules and constraints

This page lives as a subpage under the main NOVA instructions page, alongside the other 16 mode instruction pages.

Step 6: Link the Instructions Page

I added the new page to the callout block that contains links to all mode instruction pages. This is how NOVA knows where to find the detailed instructions when Council Mode is activated. Without this link, the mode would appear in the numbered list but NOVA wouldn’t have access to the full behavioral spec.

You can duplicate my complete NOVA instruction template - Get the exact Notion AI Agent setup I use, ready to customize with your business details

Duplicate Custom Instructions

How I Built NOVA, The Notion AI Agent That Actually Works

Anfernee

October 3, 2025

Read full story

Step 7: Add a Row to the NOVA Modes Database

NOVA has an inline database called “NOVA Modes” that tracks all modes with their status, description, and mode number. I added a new row:

Mode Name: Council Mode
Mode #: 17
Status: Active
Description: Multi-perspective deliberation for complex questions, inspired by Karpathy’s LLM Council. 3 stages: perspectives, cross-evaluation, synthesis.

This database is the operational registry. The numbered list on the instructions page tells NOVA what modes exist. The database tracks their status (Active, Draft, Archived) and provides a quick reference.

Why Three Touch Points?

You might wonder why adding one mode requires updates in three places (instructions page list, dedicated instructions page, modes database). It’s intentional redundancy:

The numbered list is for routing. NOVA reads this to decide which mode to activate.
The instructions page is for behavior. It contains the full rules the mode follows.
The database is for management. It lets me track status, search across modes, and archive old ones without editing the main instructions page.

This structure means I can add, modify, or retire modes without breaking the agent’s core instructions. Each piece has a single job.

Trigger phrases: “LLM Council”, “llm council”, or “Use Council”

Testing: Council Mode vs. AI Sparring Partner

Designing the mode is one thing. Proving it works better than what I already had is another.

I ran a head-to-head comparison using the same question in both modes.

The test question: “Is creating a blog to social media repurposing app the perfect solution for solopreneurs?”

I chose this because it’s a real strategic decision I’m evaluating for my business. It’s complex, has multiple valid answers, and requires weighing trade-offs. Exactly the kind of question Council Mode was designed for.

Test 1: AI Sparring Partner Mode (Standard)

Build the Content Repurposer as your v1 because it’s the fastest path to learning. But treat it as an entry product, not the destination. The real play is a Solopreneur Content Engine that starts with repurposing and expands into content strategy, scheduling, and performance tracking.

The 24-hour move: write the v1 brief with one avatar (Substack writers), three features (paste article, choose platforms, get copy-ready output), and one price ($15/month). Ship in 2 weeks. Learn in 4.

The Sparring Partner gave me a solid, linear analysis:

Listed reasons the app is a good fit (existing prototype, recurring pain, built-in distribution)
Listed where it falls short (crowded market, AI commoditizes the feature, solopreneurs are price-sensitive)
Offered a conditional recommendation based on my goal (fast revenue vs. defensible SaaS vs. deep audience service)
Ended with a next step

Strengths: Clear structure. Practical. Addressed real concerns. Cited my own workspace pages.

Weakness: One voice, one perspective. The “pros and cons” format is useful but predictable. It didn’t surface any insight I couldn’t have generated by thinking about it for 10 minutes.

Test 2: Council Mode (LLM Council)

A blog-to-social repurposing app is a good first product for you, not the perfect one. It becomes closer to perfect only if you solve the commodity problem.

The function (paste post → get social assets) is necessary but not sufficient. Every AI tool does this now. Your edge is building a voice-aware repurposing engine that learns and enforces a solopreneur’s unique style, hook preferences, and platform rules over time. That’s what no generic tool offers, and it’s what your “Nova Repurpose” prompt already does for you (with 20+ sections of brand rules, voice constraints, and format specifications).

The cannibalization risk is real. You already sell repurposing prompts in the AI Toolbox. The app needs to deliver something prompts alone can’t: saved voice profiles, one-click publishing-ready formatting, a content calendar view, and iterative learning from past performance. If the app is just “your prompt with a UI,” it won’t retain users past the first month.

The verdict: Build it, but build the voice engine version, not the generic repurposer. Position it as “the repurposing tool that sounds like you, not like AI.” Use your Nova Repurpose prompt as the foundation, add voice profile onboarding, and launch to your Substack audience first.

Concrete next step:

Write a one-page v1 brief that separates the two layers: (1) the repurposing engine (commodity, ship fast), and (2) the voice profile system (differentiator, build in v1.1). Price the combo at $29/month. Ship the engine in 3 weeks, add voice profiles within 60 days based on early user feedback.

Council Mode produced a fundamentally different output.

Stage 1 gave me four genuinely distinct perspectives:

The Strategist framed the app as a wedge product (viable but with a ceiling)
The Operator said “ship it in 3 weeks, don’t overthink”
The Skeptic flagged the cannibalization risk (I already sell repurposing prompts in my AI Toolbox, so what’s the app actually adding?)
The Creator proposed a “brand voice engine” as the real product, with repurposing as just the entry point

Stage 2 ranked them and identified where the real tension was. The Creator’s “voice engine” idea was ranked #1 because it was the only perspective that solved the commodity problem. The Skeptic’s cannibalization point was ranked #2 because it raised a risk nobody else addressed.

Stage 3 synthesized everything into a verdict: build the voice-aware version, not the generic repurposer. Position it as “the repurposing tool that sounds like you, not like AI.” The concrete next step was specific: write a v1 brief that separates the repurposing engine (ship fast) from the voice profile system (build in v1.1).

The Comparison

Council Mode’s output was better on every dimension that matters for hard decisions. The Sparring Partner is still great for quick challenges and pressure-testing a direction you’ve already chosen. But for genuinely ambiguous questions where you need multiple angles weighed against each other, Council Mode produces a stronger result.

The two insights Council Mode surfaced that Sparring Partner didn’t:

The cannibalization risk. I already sell repurposing prompts in my AI Toolbox. If the app is just “my prompt with a UI,” I’m competing with my own product. The Skeptic caught this. The Sparring Partner mentioned the crowded market in general terms but didn’t connect it to my own product line.
The voice engine as moat. The Creator’s idea to build a brand voice engine (not just a repurposer) was the single most valuable insight from either test. It reframed the entire product concept. The Sparring Partner never generated this angle.

What I Learned

Structured disagreement produces better outputs than balanced analysis. The Sparring Partner tries to be fair. Council Mode forces conflict. Conflict surfaces insights that balance hides.

The cross-evaluation stage is where the magic happens. Stage 1 generates options. Stage 3 synthesizes them. But Stage 2 (ranking and critiquing) is what elevates the output above a standard brainstorm. It’s the step most people would skip, and it’s the most valuable one.

Single-model deliberation works, with limits. A single model debating with itself won’t match four different frontier models with genuinely different architectures and training data. But the structured process compensates for a lot of that gap. The perspectives lens forces the model into different reasoning modes it wouldn’t default to in a single pass.

Not every question needs a council. Simple, well-defined questions get over-engineered by this process. Council Mode is for the “it depends” questions, the ones where smart people would genuinely disagree.

How to Build Your Own

If you run a custom AI agent in Notion (or anywhere else), you can build your own version in under an hour. Here’s the framework:

Define 3-5 perspectives that naturally create tension for your domain. For business: strategist, operator, skeptic, creative. For content: editor, audience advocate, SEO specialist, brand voice guardian. For product: user, engineer, business, designer.
Require cross-evaluation where each perspective is critiqued and ranked. Don’t skip this stage.
Set synthesis rules that force the final answer to lead with the top-ranked insight, integrate minority viewpoints, flag unresolved tensions, and end with a specific next action.
Add a word limit. Without it, the output gets bloated. 800 words is a good cap for most questions.
Make perspectives swappable. Not every question fits the same four lenses. Let the AI substitute when a lens doesn’t apply.

The full process takes one AI instructions page and a trigger phrase. That’s it.

What’s Next

Council Mode is now Mode #17 in my NOVA agent. I’m using it for all major strategic decisions, offer positioning, and content angle selection going forward.

I’m also exploring a variation where Stage 2 uses a weighted scoring system instead of simple ranking, and another where the perspectives are pulled dynamically based on the question topic rather than using a fixed set of four.

But even the v1 is already producing noticeably better outputs for hard questions. Sometimes the best upgrade to your AI workflow isn’t a new tool. It’s a better thinking process.

Want the Full Council Mode Setup?

You don’t have to build this from scratch. Paid subscribers get the complete Council Mode instructions page, ready to drop into your own Notion AI agent.

No prompt engineering required. Just import, activate, and start using it.

👉 Upgrade to a paid subscription to download the full LLM Council setup and start making better decisions with your AI today.

If you are a paid subscriber, head to the bottom of the page to duplicate the LLM Council.