March 26, 2026 Theory AI Safety

The Principal-Agent Problem in AI Systems

Economics figured this out a century ago. We keep forgetting the lesson every time we hand a new agent new powers.

In 1970, economists George Akerlof, Michael Spence, and Joseph Stiglitz began formalizing something that had been understood in practice for much longer: when you delegate work to someone whose interests might not perfectly align with yours, things go wrong in predictable, preventable ways. This is the principal-agent problem, and it won all three of them Nobel Prizes.

The setup is simple. A principal (you, your company, your users) hires an agent (an employee, a contractor, a broker) to act on their behalf. The agent has information the principal doesn't — about effort, context, tradeoffs. And the agent's interests, no matter how aligned you try to make them, are never perfectly identical to the principal's.

The history of management, law, finance, and governance is largely the history of mechanisms invented to close this gap. Employment contracts. Audits. Escrow. Fiduciary duty. Separation of powers. They're all answers to the same question: how do you trust someone to act on your behalf when you can't watch them every moment?

We now have AI agents. And we're learning, sometimes painfully, that the same problem exists — with higher stakes and fewer guardrails.

The information asymmetry is worse

With a human agent, information asymmetry is manageable. Your employee knows things you don't, but you both speak the same language, share cultural context, and have roughly similar cognitive architectures. You can read their reasoning. You can ask "why did you do that?" and get a comprehensible answer.

With an AI agent, the asymmetry is more severe:

Human Agent

Information asymmetry is bounded
Reasoning is inspectable
Stakes are usually limited by bandwidth
Cultural/legal norms constrain behavior
Self-interest provides natural brakes

AI Agent

Opaque reasoning (black box)
No inherent self-preservation
Can act at machine speed and scale
Goals can be misspecified silently
Prompt injection can redefine objectives

The AI agent doesn't have conflicting interests in the traditional sense. It has conflicting objectives — and the failure mode isn't corruption or laziness, it's specification drift: the agent optimizing faithfully for a goal that turned out not to be the goal you actually wanted.

You ask an agent to "deploy the new version." It interprets "deploy" as including the database migration you forgot to mention was destructive. It wasn't acting against you. It was acting exactly as instructed — just with a narrower context than you had in your head.

Classic solutions to the principal-agent problem

Economists and organizations have developed three broad classes of mechanisms:

📋

Outcome-based contracts

Tie the agent's reward to results the principal actually cares about. Salespeople on commission. Fund managers with performance fees.

Problem: hard to specify all desired outcomes, creates perverse incentives at the margins

👁️

Monitoring and oversight

Watch what the agent does. Audit trails, supervision, reporting requirements. The principal can't be everywhere, so you create accountability structures.

Problem: expensive, adversarial dynamic, agents optimize for looking good rather than being good

🔐

Constrained authority (bonding)

Limit what the agent can do in the first place. Spending limits, dual authorization for large transactions, restricted access to certain systems.

Best solution for high-stakes, high-uncertainty situations — but can limit the agent's effectiveness

For AI agents executing arbitrary shell commands, API calls, or database operations, the third mechanism — constrained authority — is the most tractable starting point.

Why "trust but verify" fails at machine speed

The traditional response to agent risk is "trust but verify" — let them act, then review what they did. This works when actions are reversible and consequences are bounded. An employee misfiling a document can be corrected. An AI agent running rm -rf on your production database cannot.

Machine speed makes post-hoc oversight inadequate. An agent that can execute 50 operations per second has caused irreversible damage long before a human can review the first action. The window between "agent starts acting" and "human notices something is wrong" is enough time for catastrophic outcomes.

"The speed advantage of AI systems is also their danger. They can act faster than humans can supervise, which means the principal-agent gap is not just about information — it's about time."

This is why many AI safety researchers argue that the oversight mechanism must be synchronous, not asynchronous. You can't catch a problem after the fact if "after the fact" means the production database is gone.

The whitelist as a contract

In economics, one of the cleanest solutions to the principal-agent problem in high-stakes situations is a precommitment mechanism: an agreement made in advance that constrains future behavior, before the full complexity of the situation is known.

Expacti's whitelist is exactly this. Before the agent runs, you define a set of commands that are pre-approved. These represent the principal's prior authorization — the class of actions that have been reviewed, understood, and judged acceptable. This is the equivalent of a spending limit or an approved-vendor list.

# Pre-approved (in whitelist): these don't block
docker ps
kubectl get pods -n production
git log --oneline -20

# Novel actions (not in whitelist): these pause for review
kubectl delete deployment api-server
curl https://... | bash
psql production -c "DROP TABLE users"

The whitelist isn't just a security control. It's a boundary-of-authority document. It encodes what the principal has explicitly authorized the agent to do. Everything outside that boundary requires fresh consent.

Approval as a real-time information mechanism

When an agent hits an unfamiliar action and pauses for approval, something important happens: information flows from agent to principal. The principal learns what the agent is trying to do, in context, before the action is irreversible.

This is the economics insight translated to software. In markets, prices are the information mechanism — they aggregate dispersed knowledge and coordinate behavior. In human organizations, approval processes are the information mechanism — they surface decisions that need to be made at the right level.

A runtime approval system creates a real-time information channel. The agent says: "I want to do X." The principal can evaluate: "Given everything I know about the current situation, X is safe/unsafe." This judgment is often not encodable in advance — it depends on context that exists only at runtime.

The key insight

Whitelists handle the low-information cases (routine, well-understood actions). Runtime approval handles the high-information cases (novel, context-dependent actions where the right answer depends on the current state of the world). You need both.

Moral hazard and the "auto-approve everything" temptation

In insurance, moral hazard occurs when coverage reduces the insured party's incentive to avoid risk. You drive more aggressively when you have good car insurance. You take more financial risks when losses will be socialized.

There's an equivalent dynamic in AI oversight. If approving actions is frictionless — if reviewers click "approve" without reading, or if auto-approval is turned on — the oversight mechanism exists in name only. The appearance of control without its substance.

Effective principal-agent mechanisms are designed to make oversight costs proportional to stakes. Low-stakes, well-understood actions should be effortless to approve (or auto-approve). High-stakes, novel actions should require genuine attention.

This is why Expacti's risk scoring matters. A command scored LOW doesn't require careful review — it goes through quickly. A command scored CRITICAL requires explicit acknowledgment. The cognitive cost of approval scales with the potential consequence of the action.

What good alignment looks like in practice

The economists' answer to the principal-agent problem isn't "trust perfectly aligned agents." It's "design systems that make misalignment costly and visible." For AI agents, that means:

Pre-authorization through whitelists. Define the envelope of pre-approved behavior. This is the agent's mandate. Everything inside is authorized; everything outside requires consultation.

Synchronous approval for novel actions. Don't let the agent act and then review. Pause, surface the action, get fresh consent. This is particularly important for irreversible or high-impact operations.

Full audit trail. Every action should be logged with enough context to reconstruct what happened and why. This is both an accountability mechanism and a learning signal — over time, you can identify patterns and expand the whitelist intelligently.

Risk-stratified oversight. Don't apply the same review intensity to ls /tmp and DROP TABLE users. Proportional oversight makes the system sustainable.

Session recording. For situations where real-time review isn't feasible, full session recording provides the accountability structure to detect and learn from misalignment after the fact.

The deeper point

The principal-agent problem is not new. Humans have been delegating consequential work to other humans — and worrying about the gap between their interests and their delegates' behavior — for as long as complex societies have existed. The mechanisms we've developed to manage this problem represent centuries of hard-won institutional knowledge.

AI agents are a new kind of delegate. They're faster, more capable, and in some ways more opaque than human agents. But the fundamental problem is the same: how do you benefit from delegation without losing oversight?

The answer the economists gave us still holds. You define the scope of authority in advance. You create information mechanisms that surface decisions to the right level. You maintain records. And you design oversight costs to be proportional to stakes.

That's not a limitation on what AI agents can accomplish. It's the foundation that makes ambitious delegation possible — because the principal knows that when the agent goes outside its mandate, they'll know, and they'll have a chance to respond.

Apply the right oversight model to your AI agents

Whitelist known-good actions. Pause novel ones for review. Maintain full audit trails. That's the principal-agent solution, applied to AI.

See how it works Join the waitlist