- Advisory
- Solutions
- Customers
- About Zimply
- Resources
- Contact
EN
- Book a Demo
12 May, 2026

There is something fitting about a talk on agentic operations that was — partly — built by an agent. At this year's Data Innovation Summit in Stockholm, Zimply's CTO Jesper Fredriksson toAgentic Operations: The Missing Discipline in Data Scienceok the stage at the Databases & Data Quality Stage [M9] with a session titled "Agentic Operations: The Missing Discipline in Data Science?" The session had been on the schedule for less than two weeks. Jesper had used his own AI agent to gather data and assemble the slides. He admitted — with a grin — that he accidentally uploaded the agent's version rather than his final edited one.
It was the right accident for the right talk.
Jesper opened with the METR benchmark — a graph that tracks how long a task an AI model can complete with 50% probability, plotted against the model's release date. When GPT-4 launched, that number was a matter of minutes. As of early 2026, Claude Opus 4.6 clocks in at approximately 12 hours of sustained human-equivalent work.
The implication: we are likely to see AI agents capable of tackling tasks that would take a human 100 hours, within this calendar year. That is not a distant forecast — it is the trajectory we are already on.
But AI capability is only one side of the story.
The second data point Jesper presented came from CrowdStrike's RSA 2026 analysis of Falcon endpoint telemetry. Their data shows 1,800 distinct AI applications running on enterprise endpoints, with 160 million unique application instances across their customer base.
Source: CrowdStrike, RSA 2026 — Falcon endpoint telemetry
Agents are not coming. Instead, they are already running inside most organisations. Many of them are unsanctioned — shadow AI deployed by individuals and teams without organisational oversight.
And yet, according to a March 2026 report from DigitalApplied:
Source: DigitalApplied, AI Agent Scaling Gap (March 2026)
The gap between pilot and production is enormous. The organisations that have successfully bridged it share one structural practice: they created a dedicated AI operations function — separate from both IT and the business unit.
Zimply has been automating business processes since 2018. Today, the company operates roughly 100 AI assistants in production across clients in accounting, order management, and other domains. Deploying agents is one thing. Keeping them working reliably — week after week, as the underlying models, tools, and data change — is something else entirely.
That operational reality is what the rest of the talk was built around.
Jesper referenced a statement from Boris Cherny, Head of Claude Code at Anthropic, in February 2026: "Coding is largely solved."
Whether or not one agrees completely with that framing, the point Jesper was making is accurate and important. The hard parts of agentic AI are not the code generation. They are:
These are operational and organisational challenges, not engineering ones in the traditional sense.
This is the conceptual heart of the talk. The distinction Jesper drew is worth dwelling on.
In classical MLOps, you are managing a model. Inputs and outputs are relatively well-defined. Drift means a measurable distribution shift on a metric. Failure looks like accuracy dropping below an SLA threshold. You train, validate, deploy, and monitor.
With agents, you are managing a system — one that includes the model, its memory structures, the tools it can call, prompt scaffolding, other agents it may coordinate with, and the environments it acts in. Failure modes are different in kind:
LLM upgrades: A new minor model version ships and your carefully constructed prompts subtly break. There is no regression test to catch it.
Prompt drift: Edits accumulate over months. Nobody remembers why a clause was added. Nobody wants to remove it. The prompt becomes archaeology.
Changing components: A tool, retriever, or downstream API changes and the agent's reasoning path shifts silently with it.
Silent quality decay: Outputs get worse before anyone notices. There is no red bar on a dashboard. There is just a customer complaint.
Jesper cited IBM Research's 2025 paper "Agentic AI Needs a Systems Theory" for the framing of emergent failures — defined as failures that arise not from any single component, but from the complex interactions between agents, tools, environments, and humans.
Source: IBM Research, Agentic AI Needs a Systems Theory (2025)
This is the frontier of what makes agentic operations genuinely difficult. Unit tests on the model do not catch emergent failures. You need observability over the entire trajectory — every tool call, every reasoning step, every branch the agent took to arrive at its action.
One of the most quotable moments of the talk was Jesper's observation about trust in operational contexts. It was not phrased as a data point. It was phrased as hard-won experience.
Trust is earned slowly: months of clean trajectories, predictable cost, predictable latency, no surprises in the audit log.
Trust is lost instantly: one agent that fires off the wrong refund, one that leaks data into the wrong tenant, one unexplained loop that runs for six hours. When that happens, the system goes back to manual approval that quarter.
This is the operational reality at Zimply, and it shapes everything about how they approach deployment.
Jesper outlined three near-term trends from the operator's seat:
The buyer side wakes up. Procurement will start asking: show me the evals, show me the work. Governance readiness will become a commercial requirement.
Agents consuming agents. As multi-agent architectures become more common, the blast radius of any single failure expands. Failures propagate through chains.
LLM deprecations as everyday work. The question will increasingly be: which model from vendor X is actually stable right now? Model lifecycle management becomes an operational discipline.
Jesper cited the MIT Sloan Management Review — specifically a piece by Davenport and Bean, Five Trends in AI and Data Science for 2026 — for the prediction that within five years, AI agents will handle most transactions in large-scale business processes.
Source: Davenport & Bean, Five Trends in AI and Data Science for 2026, MIT Sloan Management Review
That future requires a new operating model. The Berkeley California Management Review introduced a framework for what they call the Agentic Operating Model, built around four interdependent layers:
Source: Berkeley CMR, Governing the Agentic Enterprise
The concept of "human in the loop" is giving way to "human on the loop" — where humans design the governance structures and protocols, then step back from moment-to-moment validation.
A Grant Thornton 2026 AI Impact Survey of approximately 1,000 senior leaders found that 78% of C-level executives lack strong confidence they could pass an independent AI governance audit within 90 days.
Source: Grant Thornton, 2026 AI Impact Survey (n ≈ 1,000 senior leaders)
That number is striking. The majority of organisations deploying agents cannot fully account for what those agents are doing.
Jesper's central argument is that the discipline of agentic operations — "AgentOps" — is the natural successor to MLOps, and that data scientists are the right people to own it.
The instincts are the same: measurement, reproducibility, regression testing. What changes is the surface area.
| MLOps | AgentOps | |
| Unit of work | One model, scored | A system, behaving |
| Workflow | Train · validate · deploy · monitor | Trajectory · trace · rollback · re-eval |
| I/O definition | Well-defined inputs and outputs | Spans tools, memory, other agents |
| Drift definition | Distribution shift on a metric | Subtle change in how the agent reasons |
| Failure modes | Accuracy below SLA | Looping, deception, leakage, silent decay |
Engineering will not take this on alone. Compliance cannot. Somebody has to. The argument is that data scientists — with their native fluency in evaluation, monitoring, and systematic thinking about model behaviour — are the right seat at the table.
Jesper closed with a practical framework — five operational disciplines that turn a pilot into a production system you can stand behind.
Agentic AI is not a future problem. It is a present operational challenge that most organisations are not yet structured to address. The pilot-to-production gap is real, and it will not close on its own.
The organisations that will succeed are the ones that treat agentic operations as a discipline — not an afterthought. That means building the harness around the model: the observability, the versioning, the governance, the evals, and the escalation pathways.
At Zimply, this is the work we do every week. We deploy agents and we operate them — in production, for real clients, where the cost of failure is real.
If you are thinking about compliance, governance, and agent behaviour at scale, this is the conversation worth having.
This article is based on a talk by Jesper Fredriksson, CTO at Zimply, presented at Data Innovation Summit 2026, Stockholm, May 7. He previously served as AI Engineer Lead at Volvo Cars, with five years in automotive data science and a decade in medical imaging research. He has been a returning Data Innovation Summit speaker since 2016.
Increase productivity
Less errors
Reduce costs
Save time
Book a meeting, digital or physical and we will tell you more