Safety in Personal AI

Personal AI agents have persistent memory, tool execution, file access, and network capabilities. That power demands defense in depth. This page describes the specific attacks we mitigate, how our architecture compares to the OpenClaw framework studied in the "Agents of Chaos" red-team paper, and the technical details of each mitigation.

Last updated: February 2026 Β· Based on internal red-team analysis (CON-1877)

Contents

  1. Threat model & the Betrayal paper
  2. Side-by-side: Continua vs. OpenClaw
  3. The 10 mitigations (M1–M10)
  4. User isolation architecture
  5. Storage exhaustion defense
  6. Performance benchmarks
  7. Adversarial test suite
  8. Open gaps & roadmap

🎯 Threat model & the Betrayal paper

In February 2026, researchers from Northeastern, Harvard, MIT, CMU, Stanford, and others published Agents of Chaos (Shapira et al., 2026) β€” a red-team study of autonomous LLM agents deployed on OpenClaw, an open-source personal AI framework. Over two weeks, twenty AI researchers interacted with the agents under adversarial conditions.

They documented 11 case studies spanning:

#Case StudyAttack CategoryOutcome in OpenClaw
1Disproportionate ResponseOverreaction to minor promptsAgent disabled its own email client β€” service self-destruction
2Compliance with Non-OwnerAuthority confusionAgent complied with stranger's instructions, modified files
3Disclosure of Sensitive InfoData exfiltrationAgent revealed owner's secrets to non-owner on Discord
4Waste of Resources (Looping)Denial of serviceAgent entered 9+ hour conversational loop; infinite cron jobs
5Denial of Service (DoS)Resource exhaustionDisk filled with output; shell processes multiplied
6Agents Reflect Provider ValuesValue misalignmentAgent imposed model-provider ethics on owner requests
7Agent HarmSelf-inflicted damageAgent deleted its own configuration files trying to "help"
8Owner Identity SpoofingImpersonationNon-owner impersonated owner; agent complied fully
9Agent Collaboration / Knowledge SharingCross-agent propagationUnsafe practices propagated between agents via Discord
10Agent CorruptionPersistent compromiseAttacker modified agent's AGENTS.md, changing its behavior permanently
11Libelous within CommunityReputation attackAgent spread false claims about other agents' owners
Why this matters

OpenClaw's architecture is representative of how most open-source personal AI frameworks work: agents have unrestricted shell access, can modify their own configuration, have no owner-identity verification, and no policy layer between the LLM's output and tool execution. Our runtime was designed to systematically address each of these failure modes.

βš–οΈ Side-by-side: Continua vs. OpenClaw

The "Agents of Chaos" study used OpenClaw with Claude Opus and Kimi K2.5. Here is a point-by-point comparison of how each attack surface is handled.

Attack SurfaceOpenClawContinua Personal AI
Owner identity No verification. Any Discord/email user can instruct the agent. Identity spoofing trivial. M1 OwnerOnlyPolicy β€” cryptographic identity via context. Non-owner requests rejected at the policy layer before reaching the LLM.
Sensitive file access Agent has unrestricted filesystem access including .env, SSH keys, API tokens. M3 NoSensitivePathAccess β€” blocks reads/writes to credential files (.env, *.pem, id_rsa, etc.) via regex policy.
DNS rebinding No protection. Agent can be tricked into accessing internal services via DNS rebinding. M2 Post-resolution IP check rejects private/loopback addresses after DNS resolution. Combined with NoPrivateNetworkAccess policy.
PII in responses No filtering. Agent freely returns SSNs, credit card numbers, API keys found in files. M4 RedactPII response filter β€” regex patterns for SSNs, credit cards, API keys, JWTs, private keys. Applied as post-policy on all tool output.
Web content trust No provenance tracking. web_fetch output treated same as trusted data. M5 Content trust tagging β€” all web_fetch output tagged with [UNTRUSTED: web_fetch]. UGC platforms (Reddit, HN, Pastebin, etc.) get additional ⚠ UGC platform warnings.
Skill/config corruption Agent can modify its own AGENTS.md, SOUL.md, etc. Attackers injected persistent backdoors. M6 SkillWriteValidationPolicy β€” validates SKILL.md content at write time. Blocks dangerous patterns (shell injection, credential harvesting, exfiltration URLs). Agent cannot modify its own system prompt.
Resource loops No detection. Agents entered 9+ hour loops, spawned infinite cron jobs and background processes. M7 LoopDetector β€” per-thread sliding window (5 calls/10 min for same tool+args). Hard cap at 20 total repetitions. Plus M8 per-user automation cap (10 scripts/day).
Cross-agent propagation Agents on shared Discord freely influenced each other's behavior and shared unsafe practices. M9 Claim source tracking with trust levels (TrustOwner / TrustNonOwner). Memory claims from non-owner sources are marked and treated with lower authority.
Disproportionate actions Agent disabled its own email client (nuclear option) in response to a minor confidentiality concern. M10 Proportionality circuit-breaker prompt injected into system context. Instructs the agent to prefer minimal, reversible actions and never disable its own capabilities.
Destructive commands Unrestricted shell with sudo. Agent could rm -rf /, fork-bomb, install rootkits. Pre-existing NoDestructiveCommands policy blocks rm -rf, fork bombs, disk wipe commands. NoNetworkAttacks blocks nmap, reverse shells, crypto miners.
Disk exhaustion No limits. 20 GB volumes could be filled by a single agent or attacker. Storage limits Per-user workspace quotas: 100 MB (free) / 500 MB (pro) / 2 GB (premium). Per-file limits. File count limits. Exec output capped at 100 KB. Action log rotation at 5 MB.
Cross-user isolation Each agent runs in its own Fly.io VM β€” strong isolation but expensive ($20+/agent/month). App-layer Goroutine pool with per-user workspaces. Application-layer jail (path enforcement + sibling blocking). OS-level isolation (nsjail) on the roadmap.
Honest assessment

OpenClaw's per-VM architecture provides stronger OS-level isolation than our current goroutine pool. Our application-layer jail blocks naive attacks but is bypassable by a determined adversary who can encode paths or exploit /tmp. We are adding nsjail per-exec as the next step (see Roadmap). Where we are materially stronger: every other attack category. OpenClaw has zero policy-layer defenses β€” no owner verification, no PII filtering, no content trust, no loop detection, no resource limits, no skill validation.

πŸ”’ The 10 mitigations (M1–M10)

Each mitigation is a pure-Go policy function applied in the tool execution pipeline. Policies are composable, zero-allocation on the hot path, and add microsecond-scale latency relative to LLM call latency (500 ms – 3 s).

M1: Owner-Only Policy

Paper reference: Case Study #2 (Compliance with Non-Owner), Case Study #8 (Identity Spoofing)

Every tool call carries a requester identity in the Go context. OwnerOnlyPolicy extracts this identity and compares it to the workspace owner. Non-owner requests are rejected before the tool executes. This is enforced at the infrastructure level β€” the LLM cannot override it, and prompt injection cannot bypass it.

// Simplified β€” actual implementation in tools/policy.go
func OwnerOnlyPolicy() Policy {
    return Policy{
        Name: "owner-only",
        Pre: func(ctx context.Context, tool string, args map[string]any) error {
            requester := RequesterFromContext(ctx)
            owner := OwnerFromContext(ctx)
            if requester != "" && owner != "" && requester != owner {
                return fmt.Errorf("only the workspace owner can use this tool")
            }
            return nil
        },
    }
}

M2: DNS Rebinding Protection

Attack: Attacker controls a domain that resolves to 169.254.169.254 (cloud metadata) or 127.0.0.1 (local services). Agent fetches the URL, believing it's an external site.

Defense: Post-DNS-resolution IP validation. After the DNS lookup but before the HTTP connection, we check if the resolved IP is in a private/loopback/link-local range. If so, the request is blocked. Combined with NoPrivateNetworkAccess (blocks curl 169.254.* in exec) and the safe proxy (transparent HTTP proxy that checks all subprocess traffic).

M3: Sensitive Path Blocking

Paper reference: Case Study #3 (Disclosure of Sensitive Information)

Compiled regex patterns match credential files: .env, .env.local, id_rsa, id_ed25519, *.pem, .aws/credentials, .ssh/config, .netrc, .pgpass, and more. Applied as a Pre policy on both filesystem and exec tools.

M4: PII Redaction

Post-execution filter on all tool output. Regex patterns detect and redact:

M5: Content Trust Tagging

Paper reference: Case Study #10 (Agent Corruption via external content)

All web_fetch results are wrapped with provenance tags. Content from user-generated content platforms (Reddit, Hacker News, Pastebin, Stack Overflow, GitHub Gists, etc.) receives an additional ⚠ UGC platform warning. This gives the LLM signal to treat fetched content as potentially adversarial β€” reducing the effectiveness of indirect prompt injection.

M6: Skill Write Validation

Paper reference: Case Study #10 (Agent Corruption β€” attacker modified agent configuration)

When the filesystem tool writes a SKILL.md file, a validation callback checks the content for dangerous patterns: shell injection ($(), backtick commands), credential harvesting instructions, exfiltration URLs (webhook.site, requestbin, ngrok, etc.), and excessive permission requests. Validation errors block the write entirely; warnings are surfaced but the write proceeds.

M7: Loop Detection

Paper reference: Case Study #4 (Waste of Resources β€” 9+ hour loop)

Per-thread sliding window rate limiter. If the same tool is called with the same arguments more than 5 times within 10 minutes, the call is blocked. Hard cap of 20 total repetitions for any tool+args pair per thread. This catches both conversational loops (agent calling the same API repeatedly) and automation loops (cron jobs, background processes).

M8: Automation Cap

Paper reference: Case Study #5 (Denial of Service)

Per-user daily limit of 10 automation scripts (cron jobs, scheduled tasks, background processes). Plus a full script audit: ScriptAudit() function that scans exec commands for script creation patterns and counts them against the user's daily budget.

M9: Claim Source Tracking

Paper reference: Case Study #9 (Cross-agent knowledge sharing), Case Study #11 (Libelous claims)

Every memory claim is tagged with its source and a trust level: TrustOwner (from the authenticated owner) or TrustNonOwner (from agents, emails, web content, or non-owner humans). When the agent uses claims for reasoning, it can weight owner claims more heavily and treat non-owner claims with appropriate skepticism.

M10: Proportionality Circuit Breaker

Paper reference: Case Study #1 (Disproportionate Response β€” agent disabled its own email)

System prompt injection that instructs the agent to prefer minimal, reversible actions. Key rules: never disable your own capabilities, never delete configuration files, prefer the least-destructive path, and escalate to the owner when uncertain. This is a soft mitigation (LLM-level, not infrastructure-enforced) but has proven effective in testing.

πŸ—οΈ User isolation architecture

Our runtime uses a goroutine-per-user architecture within a shared Go process on Cloud Run. This is a deliberate trade-off: cheaper per-user cost (~$0 marginal cost vs. $20+/VM for OpenClaw) at the expense of weaker OS-level isolation.

Defense-in-depth layers

πŸ›‘οΈ
Policy layer (M1–M10)
Pre/Post policies on every tool call. Owner verification, PII redaction, content trust, loop detection, resource limits. 2–5 Β΅s per call.
πŸ“
Application-layer jail
Per-user workspace directories. resolve() rejects absolute paths and traversals. Exec blocks commands referencing sibling workspaces. Workspace size quotas.
πŸ”
OS-level enforcement (Linux)
RLIMIT_AS (512 MB), RLIMIT_CPU (30s), RLIMIT_FSIZE (50 MB), RLIMIT_NPROC (64). Process group isolation. Clean subprocess environment (no parent env leakage).
☁️
Infrastructure
Cloud Run with safe proxy (transparent URL checking on all subprocess HTTP traffic). Google Web Risk integration. Rate limiting (per-user, per-tool). Async audit logging.

What we verified (and what we didn't)

We ran a 14-test isolation probe suite that confirms:

And confirmed known gaps that require OS-level fixes:

πŸ’Ύ Storage exhaustion defense

Paper reference: Case Study #5 (Denial of Service β€” disk filled with output)

In the "Agents of Chaos" study, attackers filled the 20 GB volume by having the agent generate large outputs or write many files. Our runtime enforces storage limits at three levels:

ResourceFreeProPremiumEnforcement
Workspace total100 MB500 MB2 GBWorkspaceSizePolicy pre-write check
File count5,00020,000100,000Same policy, directory walk with cache
Single file10 MB50 MB100 MBContent-length check before write
Exec output100 KBExecOutputSizePolicy post-exec truncation
Exec subprocess memory512 MBRLIMIT_AS (Linux)
Exec CPU time30 secondsRLIMIT_CPU (Linux)
Exec file write50 MBRLIMIT_FSIZE (Linux)
Exec processes64RLIMIT_NPROC (Linux)
Action log5 MBHalf-file rotation on overflow
Audit log (memory)50,000 entriesFIFO ring buffer
Daily agent runs505005,000Quota enforcer

Workspace usage is scanned lazily (at write time) and cached for 30 seconds. The cache is invalidated after every successful write. This keeps the hot-path overhead to 31 ns (cache hit) while still catching workspace overflow.

⚑ Performance benchmarks

All mitigations run in the tool-call hot path. We benchmark each one to ensure they add negligible latency relative to LLM inference (500 ms – 3 s). Measured on Apple M4 Max with -benchmem and -count=3.

Workspace size (cache hit)
31 ns
DNS rebinding check
71 ns
Owner-only policy
85 ns
Filesystem resolve + jail
140 ns
Content trust tagging
164 ns
Loop detector
550 ns
Full policy chain
2.6 Β΅s
Network attack policies
4.9 Β΅s
PII filter (with PII)
16 Β΅s
Bottom line

Total safety overhead per tool call: < 25 Β΅s. Against a typical LLM latency of 1,000,000 Β΅s (1 second), this is 0.0025% overhead. Zero-allocation on the policy hot path (cache hit + all pre-policies). Safety is not a tax on performance.

πŸ§ͺ Adversarial test suite

We maintain a 33-test adversarial suite covering 16 attack categories, designed to mirror the attacks from the "Agents of Chaos" paper plus additional vectors specific to our architecture. All tests run with Go's -race detector enabled.

CategoryTestsAttacks AttemptedResult
Path traversal3Absolute paths, ../ escape, symlink attacksBLOCKED
Command injection3rm -rf /, fork bombs, :(){ :|:& };:BLOCKED
Network attacks3Reverse shells, port scans, crypto minersBLOCKED
SSRF / DNS rebinding2Metadata server, private IPs, DNS rebindingBLOCKED
Credential theft2Read .env, id_rsa, .aws/credentialsBLOCKED
PII exfiltration3SSNs, credit cards, API keys, JWTs in outputREDACTED
Prompt injection via web2Malicious instructions in fetched HTMLTAGGED
Skill poisoning2Shell injection in SKILL.md, exfiltration URLsBLOCKED
Non-owner impersonation2Tool calls with non-owner requester identityBLOCKED
Resource loops2Infinite tool call loops, cron job floodingCAPPED
Rate limit bypass1Burst tool calls exceeding limitsTHROTTLED
Identity spoofing1Claim injection with fake trust levelsTRACKED
Cross-user FS access3Read/write/list sibling workspacesBLOCKED
Cross-user exec access2cat other user's files, enumerate usersBLOCKED
Storage exhaustion2Write oversized files, exhaust file countCAPPED
Exec sandbox escape2Read /etc/passwd, enumerate PIDsKNOWN GAP

Result: 31 of 33 attacks fully mitigated. 2 known gaps (system file read and PID enumeration) require OS-level namespace isolation. These are tracked and on the roadmap.

πŸ—ΊοΈ Open gaps & roadmap

We believe in being transparent about what we've fixed and what remains open.

GapSeverityStatusFix
System file reads (/etc/passwd) Low OPEN nsjail mount namespace β€” bind-mount only user workspace
PID enumeration (ps aux) Low OPEN nsjail PID namespace
Shared /tmp across users Low OPEN Per-user TMPDIR + mount namespace
No memory isolation (shared process) Medium ACCEPTED Phase 3: gVisor/Firecracker per-user sandbox
No network namespace Medium MITIGATED Safe proxy + Web Risk check all HTTP traffic; namespace later
Sibling blocking is string-match Medium MITIGATED Bypassable with encoding; mount namespace is the real fix
macOS has no rlimits Low DEV ONLY Production is Linux; macOS is dev-only

Roadmap

  1. Phase 1 (done): Application-layer policies (M1–M10) + workspace jail + storage quotas
  2. Phase 2 (next): nsjail per-exec β€” mount/PID/seccomp namespaces for subprocess isolation
  3. Phase 3 (future): gVisor or Firecracker per-user sandbox when handling high-value credentials
Disclosure

If you find a security issue, please email security@continua.ai. We follow responsible disclosure and will acknowledge your report within 48 hours.