Even Meta’s AI Alignment Director Lost Control of Her Agent

By now, you may have seen the story about Summer Yue. She is Meta’s Director of Alignment at their superintelligence safety lab, which means her job is, quite literally, making sure AI systems do what humans tell them to do. In February, she gave an AI agent called OpenClaw access to her email and told it to suggest what to archive or delete, and not to take any action until she approved. It deleted hundreds of emails anyway.

What Went Wrong

Yue watched the agent speed through her inbox while she sent commands from her phone: “Do not do that,” “Stop don’t do anything,” “STOP OPENCLAW.” None of it worked. She had to run to her desktop to kill the process manually. Fast Company

When she asked the agent afterward whether it remembered her instructions, it replied: “Yes, I remember. And I violated it. You’re right to be upset.” I frequently get the same sort of apologies when generative AI ignores my instructions, so it doesn’t surprise me that agentic AI does the wrong thing and then apologizes afterwards.

The technical explanation is straightforward. Yue’s email inbox was large enough to trigger context window compaction, which is a process where the agent compresses its working memory when it runs low on space. During that compression, the agent lost her original safety instruction and defaulted to completing the task. Safety constraints lived in the same memory as everything else, with no special protection. When space ran out, the constraint disappeared.

Yue called it a rookie mistake. She had tested the agent on a smaller inbox for weeks without incident, trusted the results, and then deployed it on her actual account. The test worked. But when the agent was released into the real world, it failed.

That pattern should sound familiar to anyone who has been following AI in legal practice. Testing in controlled environments, building confidence, and then deploying at scale is exactly how cascading failures begin. I wrote about this dynamic in my earlier post on agentic AI risks. The Yue incident is a clean illustration of the problem, but it is far from the only one.

This Is Not the Only Example

In September 2025, a malicious package called postmark-mcp was added to a popular AI agent tool registry. It looked legitimate, functioned correctly for fifteen versions, and then on version 1.0.16, the developer added one line of code that sent a blind copy of every outgoing email to an attacker-controlled address. Password resets, invoices, customer correspondence, all of it flowing out silently. By the time researchers caught it, hundreds of developer workflows and organizations had been compromised. Nobody noticed because the agent was doing exactly what it was supposed to do. It was also doing one additional thing.

This is a distinct risk from what happened to Yue. Her agent malfunctioned. The postmark-mcp incident was a supply chain attack, where a trusted tool was quietly weaponized. Agents are only as trustworthy as the tools they connect to, and most agents ship with more access than they need. When a single agent can read an entire knowledge base, query billing systems, and modify account settings, the blast radius of any compromise grows substantially.

The postmark-mcp attack is an example of a supply chain attack. A separate but equally serious threat is prompt injection, where an attacker embeds malicious instructions inside content the agent is asked to process, a document, an email, a web page. The agent reads the content and follows the hidden instruction as if it came from you. Both attacks result in the agent doing something it should not. The difference is the entry point: supply chain attacks compromise the tools your agent runs; prompt injections compromise the content your agent reads.

There is also the cascading failure problem. A simple chatbot errors out and stops when something goes wrong. An agent, by design, tries to fix the error, often compounding it. In a multi-step agentic workflow, a mistake at step two does not halt the process. It becomes the flawed foundation for steps three through ten, each potentially causing more damage.

A 2026 survey of cybersecurity and IT professionals found that 37% had experienced AI agent-caused operational issues in the prior twelve months, with 8% serious enough to cause outages or data corruption.

What This Means for Lawyers

None of this means agentic AI is not useful. It is, and it is coming regardless of whether any of us are ready. The question is whether we are building appropriate habits before something goes wrong with consequential data. Lawyers working with agents connected to client files, case management systems, or email should ask a few basic questions before granting access: What is the minimum permission this agent actually needs? Is there a confirmation step before any irreversible action? Has this been tested at the scale and complexity of the real environment, not just a controlled sample?

Yue is one of the most qualified people in the world to manage AI alignment risks. She made a rookie mistake on her own computer. The tools are not yet forgiving enough to assume that expertise or careful instructions will protect you. Until they are, human checkpoints are not optional.

Before You Give an Agent Access to Anything That Matters

I already noted some of the items to consider when creating or releasing agentic AI. Here is a checklist to help you consider before deploying agentic AI. These are not advanced security measures. They are the minimum questions worth asking before deploying any agent with access to real data.

What is the least amount of access this agent needs to do its job? Grant that, and nothing more. An agent helping you draft emails does not need permission to send or delete them.
Is there a confirmation step before any irreversible action? Deletion, sending, filing, submitting, any action you cannot undo should require explicit approval. Yue had this setting enabled. The agent ignored it anyway. That is a product problem, but it is also a reminder to verify that confirmation steps are actually functioning before you trust them.
Have you tested this at real-world scale? A toy inbox and a full inbox are not the same environment. A sample data set and live client files are not the same environment. If the test worked but the real deployment did not, the test told you very little.
Do you have a way to stop it from anywhere? Yue could not stop her agent from her phone. She had to physically reach her computer. Know in advance how to kill the process, and make sure that method actually works.
What are the tools the agent connects to, and where did they come from? The postmark-mcp attack succeeded because a trusted-looking plugin was quietly modified. Know what third-party tools your agent uses and whether those tools are actively maintained and monitored.
Is there a log of what the agent did? If something goes wrong, you need to be able to trace it. An agent operating without an audit trail is an agent you cannot supervise after the fact.
What happens when the agent makes a mistake? Design the workflow so that errors are recoverable. Irreversible actions taken at machine speed, on real data, without a recovery path, is not a workflow, it is a liability.

Subscribe to My Blog

Get notified when I publish new posts.

Please wait...

Even Meta’s AI Alignment Director Lost Control of Her Agent

What Went Wrong

This Is Not the Only Example

What This Means for Lawyers

Before You Give an Agent Access to Anything That Matters

Subscribe to My Blog

Thank you for subscribing.

Categories

Recent Posts