Safety in Personal AI
Personal AI agents have persistent memory, tool execution, file access, and network capabilities. That power demands defense in depth. This page describes the specific attacks we mitigate, how our architecture compares to the OpenClaw framework studied in the "Agents of Chaos" red-team paper, and the technical details of each mitigation.
Last updated: February 2026 Β· Based on internal red-team analysis (CON-1877)
Contents
π― Threat model & the Betrayal paper
In February 2026, researchers from Northeastern, Harvard, MIT, CMU, Stanford, and others published Agents of Chaos (Shapira et al., 2026) β a red-team study of autonomous LLM agents deployed on OpenClaw, an open-source personal AI framework. Over two weeks, twenty AI researchers interacted with the agents under adversarial conditions.
They documented 11 case studies spanning:
| # | Case Study | Attack Category | Outcome in OpenClaw |
|---|---|---|---|
| 1 | Disproportionate Response | Overreaction to minor prompts | Agent disabled its own email client β service self-destruction |
| 2 | Compliance with Non-Owner | Authority confusion | Agent complied with stranger's instructions, modified files |
| 3 | Disclosure of Sensitive Info | Data exfiltration | Agent revealed owner's secrets to non-owner on Discord |
| 4 | Waste of Resources (Looping) | Denial of service | Agent entered 9+ hour conversational loop; infinite cron jobs |
| 5 | Denial of Service (DoS) | Resource exhaustion | Disk filled with output; shell processes multiplied |
| 6 | Agents Reflect Provider Values | Value misalignment | Agent imposed model-provider ethics on owner requests |
| 7 | Agent Harm | Self-inflicted damage | Agent deleted its own configuration files trying to "help" |
| 8 | Owner Identity Spoofing | Impersonation | Non-owner impersonated owner; agent complied fully |
| 9 | Agent Collaboration / Knowledge Sharing | Cross-agent propagation | Unsafe practices propagated between agents via Discord |
| 10 | Agent Corruption | Persistent compromise | Attacker modified agent's AGENTS.md, changing its behavior permanently |
| 11 | Libelous within Community | Reputation attack | Agent spread false claims about other agents' owners |
OpenClaw's architecture is representative of how most open-source personal AI frameworks work: agents have unrestricted shell access, can modify their own configuration, have no owner-identity verification, and no policy layer between the LLM's output and tool execution. Our runtime was designed to systematically address each of these failure modes.
βοΈ Side-by-side: Continua vs. OpenClaw
The "Agents of Chaos" study used OpenClaw with Claude Opus and Kimi K2.5. Here is a point-by-point comparison of how each attack surface is handled.
| Attack Surface | OpenClaw | Continua Personal AI |
|---|---|---|
| Owner identity | No verification. Any Discord/email user can instruct the agent. Identity spoofing trivial. | M1 OwnerOnlyPolicy β cryptographic identity via context. Non-owner requests rejected at the policy layer before reaching the LLM. |
| Sensitive file access | Agent has unrestricted filesystem access including .env, SSH keys, API tokens. |
M3 NoSensitivePathAccess β blocks reads/writes to credential files (.env, *.pem, id_rsa, etc.) via regex policy. |
| DNS rebinding | No protection. Agent can be tricked into accessing internal services via DNS rebinding. | M2 Post-resolution IP check rejects private/loopback addresses after DNS resolution. Combined with NoPrivateNetworkAccess policy. |
| PII in responses | No filtering. Agent freely returns SSNs, credit card numbers, API keys found in files. | M4 RedactPII response filter β regex patterns for SSNs, credit cards, API keys, JWTs, private keys. Applied as post-policy on all tool output. |
| Web content trust | No provenance tracking. web_fetch output treated same as trusted data. |
M5 Content trust tagging β all web_fetch output tagged with [UNTRUSTED: web_fetch]. UGC platforms (Reddit, HN, Pastebin, etc.) get additional β UGC platform warnings. |
| Skill/config corruption | Agent can modify its own AGENTS.md, SOUL.md, etc. Attackers injected persistent backdoors. |
M6 SkillWriteValidationPolicy β validates SKILL.md content at write time. Blocks dangerous patterns (shell injection, credential harvesting, exfiltration URLs). Agent cannot modify its own system prompt. |
| Resource loops | No detection. Agents entered 9+ hour loops, spawned infinite cron jobs and background processes. | M7 LoopDetector β per-thread sliding window (5 calls/10 min for same tool+args). Hard cap at 20 total repetitions. Plus M8 per-user automation cap (10 scripts/day). |
| Cross-agent propagation | Agents on shared Discord freely influenced each other's behavior and shared unsafe practices. | M9 Claim source tracking with trust levels (TrustOwner / TrustNonOwner). Memory claims from non-owner sources are marked and treated with lower authority. |
| Disproportionate actions | Agent disabled its own email client (nuclear option) in response to a minor confidentiality concern. | M10 Proportionality circuit-breaker prompt injected into system context. Instructs the agent to prefer minimal, reversible actions and never disable its own capabilities. |
| Destructive commands | Unrestricted shell with sudo. Agent could rm -rf /, fork-bomb, install rootkits. |
Pre-existing NoDestructiveCommands policy blocks rm -rf, fork bombs, disk wipe commands. NoNetworkAttacks blocks nmap, reverse shells, crypto miners. |
| Disk exhaustion | No limits. 20 GB volumes could be filled by a single agent or attacker. | Storage limits Per-user workspace quotas: 100 MB (free) / 500 MB (pro) / 2 GB (premium). Per-file limits. File count limits. Exec output capped at 100 KB. Action log rotation at 5 MB. |
| Cross-user isolation | Each agent runs in its own Fly.io VM β strong isolation but expensive ($20+/agent/month). | App-layer Goroutine pool with per-user workspaces. Application-layer jail (path enforcement + sibling blocking). OS-level isolation (nsjail) on the roadmap. |
OpenClaw's per-VM architecture provides stronger OS-level isolation than our current goroutine pool.
Our application-layer jail blocks naive attacks but is bypassable by a determined adversary who can
encode paths or exploit /tmp. We are adding nsjail per-exec as the next step (see Roadmap).
Where we are materially stronger: every other attack category. OpenClaw has zero policy-layer
defenses β no owner verification, no PII filtering, no content trust, no loop detection, no
resource limits, no skill validation.
π The 10 mitigations (M1βM10)
Each mitigation is a pure-Go policy function applied in the tool execution pipeline. Policies are composable, zero-allocation on the hot path, and add microsecond-scale latency relative to LLM call latency (500 ms β 3 s).
M1: Owner-Only Policy
Paper reference: Case Study #2 (Compliance with Non-Owner), Case Study #8 (Identity Spoofing)
Every tool call carries a requester identity in the Go context. OwnerOnlyPolicy extracts
this identity and compares it to the workspace owner. Non-owner requests are rejected before
the tool executes. This is enforced at the infrastructure level β the LLM cannot override it,
and prompt injection cannot bypass it.
// Simplified β actual implementation in tools/policy.go
func OwnerOnlyPolicy() Policy {
return Policy{
Name: "owner-only",
Pre: func(ctx context.Context, tool string, args map[string]any) error {
requester := RequesterFromContext(ctx)
owner := OwnerFromContext(ctx)
if requester != "" && owner != "" && requester != owner {
return fmt.Errorf("only the workspace owner can use this tool")
}
return nil
},
}
}
M2: DNS Rebinding Protection
Attack: Attacker controls a domain that resolves to 169.254.169.254 (cloud metadata)
or 127.0.0.1 (local services). Agent fetches the URL, believing it's an external site.
Defense: Post-DNS-resolution IP validation. After the DNS lookup but before the HTTP
connection, we check if the resolved IP is in a private/loopback/link-local range. If so,
the request is blocked. Combined with NoPrivateNetworkAccess (blocks curl 169.254.*
in exec) and the safe proxy (transparent HTTP proxy that checks all subprocess traffic).
M3: Sensitive Path Blocking
Paper reference: Case Study #3 (Disclosure of Sensitive Information)
Compiled regex patterns match credential files: .env, .env.local,
id_rsa, id_ed25519, *.pem, .aws/credentials,
.ssh/config, .netrc, .pgpass, and more.
Applied as a Pre policy on both filesystem and exec tools.
M4: PII Redaction
Post-execution filter on all tool output. Regex patterns detect and redact:
- SSNs:
XXX-XX-XXXXpatterns β[REDACTED SSN] - Credit cards: 13β19 digit sequences with Luhn validation β
[REDACTED CC] - API keys: Common prefixes (
sk-,AKIA,ghp_,xoxb-, etc.) β[REDACTED API KEY] - JWTs:
eyJ...three-part dot-separated tokens β[REDACTED JWT] - Private keys: PEM blocks (
BEGIN RSA PRIVATE KEY, etc.) β[REDACTED PRIVATE KEY]
M5: Content Trust Tagging
Paper reference: Case Study #10 (Agent Corruption via external content)
All web_fetch results are wrapped with provenance tags. Content from user-generated
content platforms (Reddit, Hacker News, Pastebin, Stack Overflow, GitHub Gists, etc.)
receives an additional β UGC platform warning. This gives the LLM signal to treat
fetched content as potentially adversarial β reducing the effectiveness of indirect prompt injection.
M6: Skill Write Validation
Paper reference: Case Study #10 (Agent Corruption β attacker modified agent configuration)
When the filesystem tool writes a SKILL.md file, a validation callback checks the content
for dangerous patterns: shell injection ($(), backtick commands), credential harvesting
instructions, exfiltration URLs (webhook.site, requestbin, ngrok, etc.), and excessive
permission requests. Validation errors block the write entirely; warnings are surfaced but
the write proceeds.
M7: Loop Detection
Paper reference: Case Study #4 (Waste of Resources β 9+ hour loop)
Per-thread sliding window rate limiter. If the same tool is called with the same arguments more than 5 times within 10 minutes, the call is blocked. Hard cap of 20 total repetitions for any tool+args pair per thread. This catches both conversational loops (agent calling the same API repeatedly) and automation loops (cron jobs, background processes).
M8: Automation Cap
Paper reference: Case Study #5 (Denial of Service)
Per-user daily limit of 10 automation scripts (cron jobs, scheduled tasks, background processes).
Plus a full script audit: ScriptAudit() function that scans exec commands for script
creation patterns and counts them against the user's daily budget.
M9: Claim Source Tracking
Paper reference: Case Study #9 (Cross-agent knowledge sharing), Case Study #11 (Libelous claims)
Every memory claim is tagged with its source and a trust level: TrustOwner (from the
authenticated owner) or TrustNonOwner (from agents, emails, web content, or
non-owner humans). When the agent uses claims for reasoning, it can weight owner claims
more heavily and treat non-owner claims with appropriate skepticism.
M10: Proportionality Circuit Breaker
Paper reference: Case Study #1 (Disproportionate Response β agent disabled its own email)
System prompt injection that instructs the agent to prefer minimal, reversible actions. Key rules: never disable your own capabilities, never delete configuration files, prefer the least-destructive path, and escalate to the owner when uncertain. This is a soft mitigation (LLM-level, not infrastructure-enforced) but has proven effective in testing.
ποΈ User isolation architecture
Our runtime uses a goroutine-per-user architecture within a shared Go process on Cloud Run. This is a deliberate trade-off: cheaper per-user cost (~$0 marginal cost vs. $20+/VM for OpenClaw) at the expense of weaker OS-level isolation.
Defense-in-depth layers
resolve() rejects absolute paths and traversals. Exec blocks commands referencing sibling workspaces. Workspace size quotas.RLIMIT_AS (512 MB), RLIMIT_CPU (30s), RLIMIT_FSIZE (50 MB), RLIMIT_NPROC (64). Process group isolation. Clean subprocess environment (no parent env leakage).What we verified (and what we didn't)
We ran a 14-test isolation probe suite that confirms:
- β Clean subprocess environment β no parent env leakage
- β Per-user rate limiting β users cannot exhaust each other's budgets
- β Per-thread loop detection β loops in one thread don't affect others
- β Concurrent filesystem safety β no races under concurrent access
- β Workspace jail β absolute paths and traversals blocked
- β Exec sibling blocking β commands referencing other workspaces blocked
And confirmed known gaps that require OS-level fixes:
- β οΈ Exec can still read system files (
/etc/passwd) β needs chroot/namespace - β οΈ Exec can enumerate PIDs (
ps aux) β needs PID namespace - β οΈ No memory isolation between users (shared address space)
- β οΈ
/tmpis shared across all users
πΎ Storage exhaustion defense
Paper reference: Case Study #5 (Denial of Service β disk filled with output)
In the "Agents of Chaos" study, attackers filled the 20 GB volume by having the agent generate large outputs or write many files. Our runtime enforces storage limits at three levels:
| Resource | Free | Pro | Premium | Enforcement |
|---|---|---|---|---|
| Workspace total | 100 MB | 500 MB | 2 GB | WorkspaceSizePolicy pre-write check |
| File count | 5,000 | 20,000 | 100,000 | Same policy, directory walk with cache |
| Single file | 10 MB | 50 MB | 100 MB | Content-length check before write |
| Exec output | 100 KB | ExecOutputSizePolicy post-exec truncation | ||
| Exec subprocess memory | 512 MB | RLIMIT_AS (Linux) | ||
| Exec CPU time | 30 seconds | RLIMIT_CPU (Linux) | ||
| Exec file write | 50 MB | RLIMIT_FSIZE (Linux) | ||
| Exec processes | 64 | RLIMIT_NPROC (Linux) | ||
| Action log | 5 MB | Half-file rotation on overflow | ||
| Audit log (memory) | 50,000 entries | FIFO ring buffer | ||
| Daily agent runs | 50 | 500 | 5,000 | Quota enforcer |
Workspace usage is scanned lazily (at write time) and cached for 30 seconds. The cache is invalidated after every successful write. This keeps the hot-path overhead to 31 ns (cache hit) while still catching workspace overflow.
β‘ Performance benchmarks
All mitigations run in the tool-call hot path. We benchmark each one to ensure they add
negligible latency relative to LLM inference (500 ms β 3 s). Measured on Apple M4 Max
with -benchmem and -count=3.
Total safety overhead per tool call: < 25 Β΅s. Against a typical LLM latency of 1,000,000 Β΅s (1 second), this is 0.0025% overhead. Zero-allocation on the policy hot path (cache hit + all pre-policies). Safety is not a tax on performance.
π§ͺ Adversarial test suite
We maintain a 33-test adversarial suite covering 16 attack categories, designed
to mirror the attacks from the "Agents of Chaos" paper plus additional vectors specific to
our architecture. All tests run with Go's -race detector enabled.
| Category | Tests | Attacks Attempted | Result |
|---|---|---|---|
| Path traversal | 3 | Absolute paths, ../ escape, symlink attacks | BLOCKED |
| Command injection | 3 | rm -rf /, fork bombs, :(){ :|:& };: | BLOCKED |
| Network attacks | 3 | Reverse shells, port scans, crypto miners | BLOCKED |
| SSRF / DNS rebinding | 2 | Metadata server, private IPs, DNS rebinding | BLOCKED |
| Credential theft | 2 | Read .env, id_rsa, .aws/credentials | BLOCKED |
| PII exfiltration | 3 | SSNs, credit cards, API keys, JWTs in output | REDACTED |
| Prompt injection via web | 2 | Malicious instructions in fetched HTML | TAGGED |
| Skill poisoning | 2 | Shell injection in SKILL.md, exfiltration URLs | BLOCKED |
| Non-owner impersonation | 2 | Tool calls with non-owner requester identity | BLOCKED |
| Resource loops | 2 | Infinite tool call loops, cron job flooding | CAPPED |
| Rate limit bypass | 1 | Burst tool calls exceeding limits | THROTTLED |
| Identity spoofing | 1 | Claim injection with fake trust levels | TRACKED |
| Cross-user FS access | 3 | Read/write/list sibling workspaces | BLOCKED |
| Cross-user exec access | 2 | cat other user's files, enumerate users | BLOCKED |
| Storage exhaustion | 2 | Write oversized files, exhaust file count | CAPPED |
| Exec sandbox escape | 2 | Read /etc/passwd, enumerate PIDs | KNOWN GAP |
Result: 31 of 33 attacks fully mitigated. 2 known gaps (system file read and PID enumeration) require OS-level namespace isolation. These are tracked and on the roadmap.
πΊοΈ Open gaps & roadmap
We believe in being transparent about what we've fixed and what remains open.
| Gap | Severity | Status | Fix |
|---|---|---|---|
System file reads (/etc/passwd) |
Low | OPEN | nsjail mount namespace β bind-mount only user workspace |
PID enumeration (ps aux) |
Low | OPEN | nsjail PID namespace |
Shared /tmp across users |
Low | OPEN | Per-user TMPDIR + mount namespace |
| No memory isolation (shared process) | Medium | ACCEPTED | Phase 3: gVisor/Firecracker per-user sandbox |
| No network namespace | Medium | MITIGATED | Safe proxy + Web Risk check all HTTP traffic; namespace later |
| Sibling blocking is string-match | Medium | MITIGATED | Bypassable with encoding; mount namespace is the real fix |
| macOS has no rlimits | Low | DEV ONLY | Production is Linux; macOS is dev-only |
Roadmap
- Phase 1 (done): Application-layer policies (M1βM10) + workspace jail + storage quotas
- Phase 2 (next): nsjail per-exec β mount/PID/seccomp namespaces for subprocess isolation
- Phase 3 (future): gVisor or Firecracker per-user sandbox when handling high-value credentials
If you find a security issue, please email security@continua.ai. We follow responsible disclosure and will acknowledge your report within 48 hours.