Every enterprise software vendor now claims to sell “AI agents.” The term has become a checkbox on pitch decks, right next to “cloud-native” and “zero trust.” But beneath the marketing inflation, real agent deployments are running in production at scale — and the data on what works and what does not is finally concrete enough to draw useful conclusions.
The pattern across successful deployments is consistent: narrow scope, measurable outcomes, human oversight, and relentless focus on the boring problems that actually cost companies money. The pattern across failures is equally consistent: vague goals, no success metrics, and the assumption that an LLM with access to everything will figure it out.
A 2026 survey of 1,200 enterprises by Gartner found that 73% have at least one AI agent in production, up from 31% in early 2025. But the distribution of use cases is heavily concentrated in a handful of proven categories.
These numbers tell a clear story: agents have found product-market fit in tasks that are repetitive, well-defined, and expensive when done by humans. Let’s examine what each looks like in practice.
Customer support is the most proven enterprise agent deployment, and it is worth examining in detail because it illustrates both the potential and the limits.
What “40-60% ticket resolution” actually means. This statistic appears in every vendor pitch, and it is real — but it is more nuanced than it sounds. Klarna reported that its AI agent handled 2.3 million customer conversations in its first month of operation (early 2024), performing the equivalent work of 700 full-time human agents. The agent resolved 67% of inbound conversations without human intervention. But “resolved” has a specific definition: the customer’s issue was addressed and the customer did not re-open the ticket or contact support again within 7 days.
What falls within that 67%: password resets, order status inquiries, refund requests for clear-cut cases (item not delivered, wrong item shipped), subscription cancellations, billing explanations, and FAQ-type questions where the answer exists in the knowledge base.
What does not: disputes requiring judgment (partial refunds for “item not as described”), technical troubleshooting beyond standard playbooks, escalated complaints from angry customers, and anything requiring access to systems the agent is not integrated with.
The economics are stark. Klarna estimated $40 million in annualized savings. The average cost per AI-resolved conversation is $0.50-2.00, compared to $5-15 for a human agent. But the human agents who remain handle the harder cases, and their average handle time has increased because the easy tickets are gone.
The architecture that works: The successful support agent pattern is not “give the LLM access to everything and hope for the best.” It is a structured pipeline:
The agent does not freelance. It operates within defined guardrails for each intent category, and the escalation path is the safety valve.
Every large company has the same problem: critical information is scattered across Confluence wikis, Google Docs, Slack threads, Jira tickets, Notion databases, and email chains. An employee looking for “what is our policy on returning customers who cancelled?” might need to search four systems and read twelve documents to find the answer.
Knowledge search agents solve this by searching across all integrated sources, synthesizing the relevant information, and providing an answer with citations. The best implementations:
Dropbox reported that its internal AI search agent reduced the average time to find internal information from 15 minutes to under 90 seconds. Atlassian’s Rovo agent, which searches across Confluence, Jira, and connected third-party tools, is used by over 50,000 enterprise teams.
The ROI math: if 1,000 employees each save 20 minutes per day on information retrieval, that is 333 hours per day, or roughly 42 full-time employees’ worth of productive time recovered. At a fully loaded cost of $150,000 per employee per year, that is $6.3 million in recovered productivity annually.
Legal, finance, and compliance teams have embraced agent-based document processing because the alternative — reading thousands of pages manually — is both expensive and error-prone.
The use cases are concrete:
Harvey, the legal AI startup valued at $3 billion, reports that its agent reduces first-pass contract review time by 70-80%. But every major law firm using it emphasizes the same point: the agent produces the first draft of the review. A human lawyer reviews the agent’s work before it goes to a client. The error rate for extraction tasks is approximately 2-5%, which is lower than the human error rate of 5-10% on the same tasks — but the consequences of errors in legal documents are severe enough that human review remains mandatory.
The build-versus-buy question is the first strategic decision every enterprise faces. The answer depends on three variables: how unique the use case is, how much data sensitivity is involved, and whether you have the engineering team to maintain a custom system.
| Factor | Build Custom | Buy Platform |
|---|---|---|
| Customer support | Only if your product requires deep proprietary integrations (e.g., telecom billing systems) | Default choice -- Intercom, Zendesk, Salesforce all offer production-ready agents |
| Internal search | Build if your data sources are unusual or your security requirements prohibit third-party access | Buy for standard setups (Confluence + Slack + Google Workspace) -- Glean, Atlassian Rovo |
| Document processing | Build for industry-specific document types (medical records, insurance claims) | Buy for common document types -- Harvey (legal), Eigen Technologies (finance) |
| Code review | Build if you need deep integration with proprietary build systems and custom linting rules | Buy for standard setups -- GitHub Copilot code review, CodeRabbit, Sourcery |
| Data analysis | Build for proprietary data schemas and domain-specific analysis patterns | Buy for standard BI use cases -- ThoughtSpot, Mode with AI features |
The hidden cost of building: a custom agent requires ongoing maintenance that most teams underestimate. Models change (Claude 3 to Claude 4, prompt formats shift), APIs evolve, tool definitions need updating, and prompt engineering is an iterative process that never really ends. Budget 0.5-1.0 full-time engineers for ongoing maintenance of each custom agent.
The hidden cost of buying: vendor lock-in and limited customization. If Zendesk’s agent resolves 45% of your tickets but you need 60%, you will hit a ceiling where the platform’s configuration options are not sufficient and you cannot modify the underlying prompts or tool definitions.
The single most important deployment pattern in enterprise AI agents is human-in-the-loop (HITL). Every successful deployment we have examined uses some form of it. The implementations that skip human oversight are the ones that generate the horror stories.
The mechanics: the agent completes its task and produces a confidence score (either explicitly calibrated or derived from model log-probabilities). Actions above the threshold execute automatically. Actions below the threshold enter a human review queue with the agent’s proposed action and reasoning attached.
The ratchet effect: this is where the real value compounds. In month one, you set the threshold high — perhaps only 20% of actions auto-approve. Human reviewers process the rest, and their decisions become training data (or at least evaluation data) for the agent. By month six, with better prompts and more examples, 50-60% auto-approve. By month twelve, 70-80%. The agent gets steadily more autonomous, but only as fast as the data supports.
Intercom’s Fin agent is the canonical example. When first deployed for a new customer, it operates in “suggest mode” — proposing answers that human agents can accept, edit, or reject. The accept/reject data tunes the agent’s knowledge base and confidence calibration. Customers typically move from suggest mode to auto-resolve mode over 4-8 weeks, by which point the agent has seen enough real conversations to calibrate well.
Enterprise AI agent deployments produce measurable ROI, but the numbers vary enormously based on implementation quality. Here are the ranges we see across multiple published case studies and analyst reports:
Customer support: $2-8 million annual savings per 1,000 tickets/day, depending on ticket complexity and resolution rate. Payback period: 3-6 months.
Internal search: $1-5 million in recovered productivity per 1,000 employees, depending on how information-intensive the work is. Payback period: 6-12 months (harder to measure).
Document processing: 60-80% reduction in first-pass review time. For a legal team processing 10,000 contracts per year, this translates to $500K-2M in labor savings. Payback period: 3-9 months.
Code review: 30-50% reduction in review cycle time, with measurable improvement in defect detection rates. Harder to quantify in dollars, but engineering teams report significant velocity improvements.
The cost side: a typical enterprise agent deployment costs $200K-500K in the first year (engineering time, API costs, integration work) and $100K-200K annually thereafter. The API costs alone — running Claude 4 or GPT-4o at enterprise scale — typically run $5,000-30,000 per month depending on volume.
Honesty about failures is as valuable as documenting successes.
Fully autonomous decision-making without human oversight fails consistently in enterprises. An agent that can approve purchase orders, modify customer accounts, or change production configurations without a human check will eventually make an expensive mistake. The error rate for current frontier models on complex reasoning tasks is 5-15%, which is fine for draft documents but unacceptable for irreversible actions.
“Boil the ocean” deployments that try to build one agent to handle everything — support, sales, internal ops — fail because the system prompt becomes incoherent, the tool set becomes too large for reliable selection, and no single team owns the quality. The successful pattern is one agent, one use case, one owner.
Deployments without feedback loops stagnate. If nobody is reviewing what the agent does and feeding corrections back into the system, performance plateaus or slowly degrades as the world changes and the agent’s knowledge becomes stale.
For enterprises evaluating their first agent deployment, here is the pattern that works:
Days 1-30: Pick one use case. Choose the highest-volume, most repetitive task you can find. Customer support triage is the default recommendation because it has the most mature tooling and the clearest metrics. Define success quantitatively: “Resolve 30% of tier-1 tickets without human intervention within 90 days.”
Days 31-60: Build or buy and deploy in shadow mode. The agent runs on every ticket but does not take action. Instead, it proposes what it would do, and humans evaluate whether the proposals are correct. This gives you a baseline accuracy number before you let the agent act.
Days 61-90: Go live with human-in-the-loop. Start with a conservative confidence threshold. Auto-resolve only the tickets where the agent is near-certain. Route everything else to human review with the agent’s proposed action attached. Measure resolution rate, customer satisfaction, and escalation frequency weekly.
If you reach 25-30% autonomous resolution by day 90 with customer satisfaction at or above previous levels, you have a successful deployment. Scale from there.
Enterprise AI agents are real and delivering measurable value, but the success stories are more boring than the marketing suggests. They handle support tickets. They search documents. They review contracts. They do not run companies, replace management, or eliminate departments.
The companies seeing the best results share three traits: they chose a narrow, well-defined use case; they deployed with human oversight from day one; and they treated the agent as a system to be continuously improved, not a product to be installed and forgotten.
That is not as exciting as the conference keynote version. But it is what actually works.
One email at dawn. The five stories that mattered, with the bits removed and the meaning kept. Free, for now.