We Built an AI Strategy Team. Here's What Worked, What Didn't, and What Surprised Us.

A few days ago, I started building a tool that would change how we evaluate strategic decisions from here on out. Not replace our team — augment it. The idea was simple: What if you could put a CPO, a Board Advisor, a VP of Sales, an existing customer, a skeptical prospect who likes our competitors more, a CFO, and a handful of other specialists around a virtual table and have them debate a real strategic question using our actual business data?
That’s what we built. The results are better than we hoped. And the things we learned along the way surprised us as much as the quality of the results.
What It Actually Is
The SmartSights Strategy Advisor is a Python application that orchestrates up to 11 AI agents — each with a distinct role, perspective, and set of priorities — through structured debate rounds. You feed it a question like “How do we accelerate adoption of ABLE, our production intelligence product?” and it runs for about 45–50 minutes, pulling in live data from our Salesforce CRM, Gong, uploaded documents, and web search.
The agents debate across multiple rounds, challenge each other, verify claims against real data mid-debate, and produce a final recommendation. We also gave the agents the ability to read industry message boards where some of our users post — and not always positively. That turned out to be an amazing source of unfiltered signal that you simply can’t get from your own CRM.
The output comes in three forms, and each turned out to matter more than we expected. There’s a 3–4 page executive summary PDF — concise enough that a busy leader can read it in five minutes and get the full picture. There’s a detailed recommendation document that typically runs about 20 pages — comprehensive enough to load into Claude separately and start asking follow-up questions, drilling into specific points, or stress-testing the logic from a human perspective. And there’s a full transcript of every round of debate, useful for understanding how the team arrived at its conclusions.
The full transcripts can get pretty funny. In one run, an agent called another “intellectually lazy” and demanded more evidence. In another, one said “Corey needs to start asking these questions before an acquisition — not 16 months later.” The agents are remarkably blunt with each other. We have not needed an HR agent yet.
We run it exclusively on Claude Opus 4.6. We tested the faster, cheaper models and they weren’t worth it — the quality difference in strategic reasoning is significant. Each run costs about $25–35 and takes roughly 45–50 minutes. That’s a meaningful investment per question, but compare it to what a strategy consulting engagement costs and it’s a rounding error.
Why We Use the Paid API — And Why It Matters
This isn’t just about quality. We had to make absolutely certain that none of our sensitive internal data — Salesforce pipeline, financials, internal plans — ends up in the public domain. The paid API gives us that guarantee. Our data is not used for training, and it doesn’t leave our control. For any company considering this approach with real business data, this is non-negotiable.
The Debate Structure — And Why It Matters
Early versions were essentially “ask a bunch of AI agents the same question and staple the answers together.” The output was polished but shallow — agents agreed with each other too quickly and produced the kind of vaguely reasonable analysis that sounds good in a meeting but doesn’t really help you decide anything.
The breakthrough was treating it like actual organizational dynamics. We got help from Dr. Jim Kestenbaum, a psychologist with deep expertise in corporate leadership and group decision-making. His insights about how groups actually make decisions — versus how they think they make decisions — directly shaped three structural changes that transformed the output quality. We then tried different variations until we found some tricks that really improved the analysis. One key observation: many best practices from human group decision-making apply here, but certain things make more sense for AI agents than for humans.
Here are a few things we learned.
Independent first rounds. In Round 1 of each phase, agents form their positions without seeing anyone else’s responses. This was Jim’s recommendation to combat anchoring bias, and it turned out to be the single most impactful change we made. Without it, the first agent to speak frames the entire discussion and everyone else riffs on that frame. With it, you get 6–8 genuinely independent perspectives, which means Round 2 produces real disagreement that must be resolved rather than surface-level head-nodding.
Shuffled speaking order. Agents speak in a randomized order every round. No one gets to consistently frame the conversation.
Targeted rebuttals. A Board Advisor agent speaks last every round, challenges the weakest arguments by name, and assigns specific agents to respond to specific challenges — “VP of Sales, defend your claim that the win rate justifies the investment.” Those rebuttals are deliberately constrained to 2–3 sentences. Short rebuttals are sharper, more direct, and more likely to actually shift the conversation than long diplomatic responses that bury the disagreement in caveats.
Role specificity creates better disagreement than general intelligence. We tested generic smart agents (“You are a brilliant strategist”) versus specifically-roled agents (“You are VP of Sales at SmartSights selling to manufacturing, utilities, water/wastewater… You think in win rates, deal size, and sales cycle”). The generic agents were more eloquent. The specific agents produced better recommendations. Role specificity constrains what each agent notices and prioritizes, creating natural disagreement. The CFO genuinely sees different things in the same data than the VP of Engineering — not because they’re told to disagree, but because their cognitive frame highlights different features. Emergent disagreement rather than performed disagreement.
Live Data Changes Everything
The agents pull from our actual Salesforce pipeline, Gong, uploaded documents, and web search. But the real game-changer was giving agents the ability to request data verification mid-debate.
Any agent can say “I want to check that claim against Salesforce” or “search the web for what our competitors’ product reviews actually say.” The system translates the request into a SOQL (Salesforce) query or web search, executes it, and injects the results back into the debate for everyone to see.
In one run, the agents pulled information from an interview one of our product managers had participated in and compared it to conversations from the sales team. The agents decided that we immediately needed to get that product manager in a training situation with the sales team because his framing of the product’s value was much stronger than how the sales team had been positioning it. After looking into it, we agreed — and we’re doing just that.
We also built a deduplication layer because multiple agents often request the same data in different words. A two-layer cache — exact match plus a semantic similarity check — catches duplicates so the same Salesforce query doesn’t run three times. Cached results still get shared with all agents; they just don’t waste API calls.
What We Learned That We Didn’t Expect
Structure beats prompting. We spent a lot of time trying to make agents “smarter” through better system prompts. Marginal returns. The Board Advisor isn’t effective because of its prompt — it’s effective because it always speaks last, assigns rebuttals, can trigger data checks, and delivers a final approve/reject assessment. Giving an agent structural mechanisms to act on its skepticism matters more than telling it to be skeptical.
AI agents converge faster than humans — and for different reasons. Human teams rush to consensus because of social pressure, status dynamics, and conflict avoidance. AI agents have none of those motivations, yet they converge just as fast. The fix is the same (force structural divergence), but the underlying cause seems to be different — more about how language models process sequential text than about social dynamics.
The tool found communication gaps we didn’t know we had. This was completely unexpected. Cody Bann, our VP of Software Engineering, reviewed one of the early runs and noticed the agents were recommending features we’d already built. We realized the system couldn’t know about those features because we’d never documented them anywhere externally. And if the AI couldn’t find that information — pulling from our website, public docs, and the entire web — then our customers and prospects couldn’t either. A tool we built for strategy analysis accidentally became an audit of our external communication. We’re now addressing those gaps.
The agents occasionally go off the rails. Just like a human team sometimes takes a meeting in a completely unproductive direction, the AI team occasionally fixates on a tangent or builds an elaborate argument on a faulty premise. It doesn’t happen often, but it happens. The Board Advisor catches most of these, and the data checks catch others. When it does happen, you can usually spot it in the transcript. We treat those runs the same way you’d treat a bad meeting — note what went wrong, adjust your question framing, and run it again.
The 20-page recommendation is the real product. The executive summary is useful for quick decisions. But the detailed recommendation — loaded into a separate Claude conversation — becomes an incredibly rich starting point for human analysis. You can ask “what if the pricing assumption is wrong?”, “walk me through the competitive risks in more detail”, or “how does this change if our engineering capacity is half what you assumed?” and get substantive, grounded answers because the full context is right there. This is key: we don’t see any situation where we would just offload decision-making to the agents, but they give us an amazing depth of analysis as a starting point for real discussion.
The Context Setup That Makes It Work
Two folders feed the system beyond the live integrations. A context/ folder holds standing information that applies to every run — our org chart, headcount, the last twelve months of financial snapshots. This gives agents grounding they can’t get from Salesforce, Gong, or the web. A data/ folder holds question-specific materials — a customer case study, a competitive analysis, a product spec. The combination of live CRM data, standing context, question-specific documents, and real-time web search gives the agents a remarkably complete picture.
Where We’re Going
The Strategy Advisor keeps getting better with each run — not because the code changes dramatically, but because we get better at asking questions, loading the right context, and knowing when to trust the output versus when to push back.
The technology is accessible — it’s a single Python script, the Claude API, and our existing business data. I started building this four days ago and now I’m using it several times a day. I don’t even know Python, I just told Claude I wanted this thing and it suggested Python and started pumping out code. The hard part isn’t the code. It’s designing the debate structure so the agents actually fight instead of politely agreeing with each other.
That’s where the real insight lives — in the disagreements.
Corey Rhoden is CEO of SmartSights. Claude is at Anthropic. The Strategy Advisor was built collaboratively — Corey provides the domain expertise, strategic direction, and business data; Claude provides the architecture, code, and multi-agent orchestration. Neither of us could have built it alone.