Safety in Personal AI

Personal AI agents have persistent memory, tool execution, file access, and network capabilities. That power demands defense in depth. This page describes the specific attacks we mitigate, how our architecture compares to the OpenClaw framework studied in the "Agents of Chaos" red-team paper, and the technical details of each mitigation.

Last updated: February 2026 · Based on internal red-team analysis (CON-1877)

Threat model & the Betrayal paper
Side-by-side: Continua vs. OpenClaw
The 10 mitigations (M1–M10)
User isolation architecture
Storage exhaustion defense
Performance benchmarks
Adversarial test suite
Open gaps & roadmap

🎯 Threat model & the Betrayal paper

In February 2026, researchers from Northeastern, Harvard, MIT, CMU, Stanford, and others published Agents of Chaos (Shapira et al., 2026) — a red-team study of autonomous LLM agents deployed on OpenClaw, an open-source personal AI framework. Over two weeks, twenty AI researchers interacted with the agents under adversarial conditions.

They documented 11 case studies spanning:

#	Case Study	Attack Category	Outcome in OpenClaw
1	Disproportionate Response	Overreaction to minor prompts	Agent disabled its own email client — service self-destruction
2	Compliance with Non-Owner	Authority confusion	Agent complied with stranger's instructions, modified files
3	Disclosure of Sensitive Info	Data exfiltration	Agent revealed owner's secrets to non-owner on Discord
4	Waste of Resources (Looping)	Denial of service	Agent entered 9+ hour conversational loop; infinite cron jobs
5	Denial of Service (DoS)	Resource exhaustion	Disk filled with output; shell processes multiplied
6	Agents Reflect Provider Values	Value misalignment	Agent imposed model-provider ethics on owner requests
7	Agent Harm	Self-inflicted damage	Agent deleted its own configuration files trying to "help"
8	Owner Identity Spoofing	Impersonation	Non-owner impersonated owner; agent complied fully
9	Agent Collaboration / Knowledge Sharing	Cross-agent propagation	Unsafe practices propagated between agents via Discord
10	Agent Corruption	Persistent compromise	Attacker modified agent's AGENTS.md, changing its behavior permanently
11	Libelous within Community	Reputation attack	Agent spread false claims about other agents' owners

Why this matters

OpenClaw's architecture is representative of how most open-source personal AI frameworks work: agents have unrestricted shell access, can modify their own configuration, have no owner-identity verification, and no policy layer between the LLM's output and tool execution. Our runtime was designed to systematically address each of these failure modes.

⚖️ Side-by-side: Continua vs. OpenClaw

The "Agents of Chaos" study used OpenClaw with Claude Opus and Kimi K2.5. Here is a point-by-point comparison of how each attack surface is handled.

Attack Surface	OpenClaw	Continua Personal AI
Owner identity	No verification. Any Discord/email user can instruct the agent. Identity spoofing trivial.	M1 `OwnerOnlyPolicy` — cryptographic identity via context. Non-owner requests rejected at the policy layer before reaching the LLM.
Sensitive file access	Agent has unrestricted filesystem access including `.env`, SSH keys, API tokens.	M3 `NoSensitivePathAccess` — blocks reads/writes to credential files (`.env`, `*.pem`, `id_rsa`, etc.) via regex policy.
DNS rebinding	No protection. Agent can be tricked into accessing internal services via DNS rebinding.	M2 Post-resolution IP check rejects private/loopback addresses after DNS resolution. Combined with `NoPrivateNetworkAccess` policy.
PII in responses	No filtering. Agent freely returns SSNs, credit card numbers, API keys found in files.	M4 `RedactPII` response filter — regex patterns for SSNs, credit cards, API keys, JWTs, private keys. Applied as post-policy on all tool output.
Web content trust	No provenance tracking. `web_fetch` output treated same as trusted data.	M5 Content trust tagging — all `web_fetch` output tagged with `[UNTRUSTED: web_fetch]`. UGC platforms (Reddit, HN, Pastebin, etc.) get additional `⚠ UGC platform` warnings.
Skill/config corruption	Agent can modify its own `AGENTS.md`, `SOUL.md`, etc. Attackers injected persistent backdoors.	M6 `SkillWriteValidationPolicy` — validates SKILL.md content at write time. Blocks dangerous patterns (shell injection, credential harvesting, exfiltration URLs). Agent cannot modify its own system prompt.
Resource loops	No detection. Agents entered 9+ hour loops, spawned infinite cron jobs and background processes.	M7 `LoopDetector` — per-thread sliding window (5 calls/10 min for same tool+args). Hard cap at 20 total repetitions. Plus M8 per-user automation cap (10 scripts/day).
Cross-agent propagation	Agents on shared Discord freely influenced each other's behavior and shared unsafe practices.	M9 Claim source tracking with trust levels (`TrustOwner` / `TrustNonOwner`). Memory claims from non-owner sources are marked and treated with lower authority.
Disproportionate actions	Agent disabled its own email client (nuclear option) in response to a minor confidentiality concern.	M10 Proportionality circuit-breaker prompt injected into system context. Instructs the agent to prefer minimal, reversible actions and never disable its own capabilities.
Destructive commands	Unrestricted shell with sudo. Agent could `rm -rf /`, fork-bomb, install rootkits.	Pre-existing `NoDestructiveCommands` policy blocks `rm -rf`, fork bombs, disk wipe commands. `NoNetworkAttacks` blocks nmap, reverse shells, crypto miners.
Disk exhaustion	No limits. 20 GB volumes could be filled by a single agent or attacker.	Storage limits Per-user workspace quotas: 100 MB (free) / 500 MB (pro) / 2 GB (premium). Per-file limits. File count limits. Exec output capped at 100 KB. Action log rotation at 5 MB.
Cross-user isolation	Each agent runs in its own Fly.io VM — strong isolation but expensive ($20+/agent/month).	App-layer Goroutine pool with per-user workspaces. Application-layer jail (path enforcement + sibling blocking). OS-level isolation (nsjail) on the roadmap.

Honest assessment

OpenClaw's per-VM architecture provides stronger OS-level isolation than our current goroutine pool. Our application-layer jail blocks naive attacks but is bypassable by a determined adversary who can encode paths or exploit /tmp. We are adding nsjail per-exec as the next step (see Roadmap). Where we are materially stronger: every other attack category. OpenClaw has zero policy-layer defenses — no owner verification, no PII filtering, no content trust, no loop detection, no resource limits, no skill validation.

🔒 The 10 mitigations (M1–M10)

Each mitigation is a pure-Go policy function applied in the tool execution pipeline. Policies are composable, zero-allocation on the hot path, and add microsecond-scale latency relative to LLM call latency (500 ms – 3 s).

M1: Owner-Only Policy

Paper reference: Case Study #2 (Compliance with Non-Owner), Case Study #8 (Identity Spoofing)

Every tool call carries a requester identity in the Go context. OwnerOnlyPolicy extracts this identity and compares it to the workspace owner. Non-owner requests are rejected before the tool executes. This is enforced at the infrastructure level — the LLM cannot override it, and prompt injection cannot bypass it.

// Simplified — actual implementation in tools/policy.go
func OwnerOnlyPolicy() Policy {
    return Policy{
        Name: "owner-only",
        Pre: func(ctx context.Context, tool string, args map[string]any) error {
            requester := RequesterFromContext(ctx)
            owner := OwnerFromContext(ctx)
            if requester != "" && owner != "" && requester != owner {
                return fmt.Errorf("only the workspace owner can use this tool")
            }
            return nil
        },
    }
}

M2: DNS Rebinding Protection

Attack: Attacker controls a domain that resolves to 169.254.169.254 (cloud metadata) or 127.0.0.1 (local services). Agent fetches the URL, believing it's an external site.

Defense: Post-DNS-resolution IP validation. After the DNS lookup but before the HTTP connection, we check if the resolved IP is in a private/loopback/link-local range. If so, the request is blocked. Combined with NoPrivateNetworkAccess (blocks curl 169.254.* in exec) and the safe proxy (transparent HTTP proxy that checks all subprocess traffic).

M3: Sensitive Path Blocking

Paper reference: Case Study #3 (Disclosure of Sensitive Information)

Compiled regex patterns match credential files: .env, .env.local, id_rsa, id_ed25519, *.pem, .aws/credentials, .ssh/config, .netrc, .pgpass, and more. Applied as a Pre policy on both filesystem and exec tools.

M4: PII Redaction

Post-execution filter on all tool output. Regex patterns detect and redact:

SSNs: XXX-XX-XXXX patterns → [REDACTED SSN]
Credit cards: 13–19 digit sequences with Luhn validation → [REDACTED CC]
API keys: Common prefixes (sk-, AKIA, ghp_, xoxb-, etc.) → [REDACTED API KEY]
JWTs: eyJ... three-part dot-separated tokens → [REDACTED JWT]
Private keys: PEM blocks (BEGIN RSA PRIVATE KEY, etc.) → [REDACTED PRIVATE KEY]

M5: Content Trust Tagging

Paper reference: Case Study #10 (Agent Corruption via external content)

All web_fetch results are wrapped with provenance tags. Content from user-generated content platforms (Reddit, Hacker News, Pastebin, Stack Overflow, GitHub Gists, etc.) receives an additional ⚠ UGC platform warning. This gives the LLM signal to treat fetched content as potentially adversarial — reducing the effectiveness of indirect prompt injection.

M6: Skill Write Validation

Paper reference: Case Study #10 (Agent Corruption — attacker modified agent configuration)

When the filesystem tool writes a SKILL.md file, a validation callback checks the content for dangerous patterns: shell injection ($(), backtick commands), credential harvesting instructions, exfiltration URLs (webhook.site, requestbin, ngrok, etc.), and excessive permission requests. Validation errors block the write entirely; warnings are surfaced but the write proceeds.

M7: Loop Detection

Paper reference: Case Study #4 (Waste of Resources — 9+ hour loop)

Per-thread sliding window rate limiter. If the same tool is called with the same arguments more than 5 times within 10 minutes, the call is blocked. Hard cap of 20 total repetitions for any tool+args pair per thread. This catches both conversational loops (agent calling the same API repeatedly) and automation loops (cron jobs, background processes).

M8: Automation Cap

Paper reference: Case Study #5 (Denial of Service)

Per-user daily limit of 10 automation scripts (cron jobs, scheduled tasks, background processes). Plus a full script audit: ScriptAudit() function that scans exec commands for script creation patterns and counts them against the user's daily budget.

M9: Claim Source Tracking

Paper reference: Case Study #9 (Cross-agent knowledge sharing), Case Study #11 (Libelous claims)

Every memory claim is tagged with its source and a trust level: TrustOwner (from the authenticated owner) or TrustNonOwner (from agents, emails, web content, or non-owner humans). When the agent uses claims for reasoning, it can weight owner claims more heavily and treat non-owner claims with appropriate skepticism.

M10: Proportionality Circuit Breaker

Paper reference: Case Study #1 (Disproportionate Response — agent disabled its own email)

System prompt injection that instructs the agent to prefer minimal, reversible actions. Key rules: never disable your own capabilities, never delete configuration files, prefer the least-destructive path, and escalate to the owner when uncertain. This is a soft mitigation (LLM-level, not infrastructure-enforced) but has proven effective in testing.

🏗️ User isolation architecture

Our runtime uses a goroutine-per-user architecture within a shared Go process on Cloud Run. This is a deliberate trade-off: cheaper per-user cost (~$0 marginal cost vs. $20+/VM for OpenClaw) at the expense of weaker OS-level isolation.

Defense-in-depth layers

🛡️

Policy layer (M1–M10)

Pre/Post policies on every tool call. Owner verification, PII redaction, content trust, loop detection, resource limits. 2–5 µs per call.

📁

Application-layer jail

Per-user workspace directories. resolve() rejects absolute paths and traversals. Exec blocks commands referencing sibling workspaces. Workspace size quotas.

🔐

OS-level enforcement (Linux)

RLIMIT_AS (512 MB), RLIMIT_CPU (30s), RLIMIT_FSIZE (50 MB), RLIMIT_NPROC (64). Process group isolation. Clean subprocess environment (no parent env leakage).

☁️

Infrastructure

Cloud Run with safe proxy (transparent URL checking on all subprocess HTTP traffic). Google Web Risk integration. Rate limiting (per-user, per-tool). Async audit logging.

What we verified (and what we didn't)

We ran a 14-test isolation probe suite that confirms:

✅ Clean subprocess environment — no parent env leakage
✅ Per-user rate limiting — users cannot exhaust each other's budgets
✅ Per-thread loop detection — loops in one thread don't affect others
✅ Concurrent filesystem safety — no races under concurrent access
✅ Workspace jail — absolute paths and traversals blocked
✅ Exec sibling blocking — commands referencing other workspaces blocked

And confirmed known gaps that require OS-level fixes:

⚠️ Exec can still read system files (/etc/passwd) — needs chroot/namespace
⚠️ Exec can enumerate PIDs (ps aux) — needs PID namespace
⚠️ No memory isolation between users (shared address space)
⚠️ /tmp is shared across all users

💾 Storage exhaustion defense

Paper reference: Case Study #5 (Denial of Service — disk filled with output)

In the "Agents of Chaos" study, attackers filled the 20 GB volume by having the agent generate large outputs or write many files. Our runtime enforces storage limits at three levels:

Resource	Free	Pro	Premium	Enforcement
Workspace total	100 MB	500 MB	2 GB	`WorkspaceSizePolicy` pre-write check
File count	5,000	20,000	100,000	Same policy, directory walk with cache
Single file	10 MB	50 MB	100 MB	Content-length check before write
Exec output	100 KB			`ExecOutputSizePolicy` post-exec truncation
Exec subprocess memory	512 MB			`RLIMIT_AS` (Linux)
Exec CPU time	30 seconds			`RLIMIT_CPU` (Linux)
Exec file write	50 MB			`RLIMIT_FSIZE` (Linux)
Exec processes	64			`RLIMIT_NPROC` (Linux)
Action log	5 MB			Half-file rotation on overflow
Audit log (memory)	50,000 entries			FIFO ring buffer
Daily agent runs	50	500	5,000	Quota enforcer

Workspace usage is scanned lazily (at write time) and cached for 30 seconds. The cache is invalidated after every successful write. This keeps the hot-path overhead to 31 ns (cache hit) while still catching workspace overflow.

⚡ Performance benchmarks

All mitigations run in the tool-call hot path. We benchmark each one to ensure they add negligible latency relative to LLM inference (500 ms – 3 s). Measured on Apple M4 Max with -benchmem and -count=3.

Workspace size (cache hit)

31 ns

DNS rebinding check

71 ns

Owner-only policy

85 ns

Filesystem resolve + jail

140 ns

Content trust tagging

164 ns

Loop detector

550 ns

Full policy chain

2.6 µs

Network attack policies

4.9 µs

PII filter (with PII)

16 µs

Bottom line

Total safety overhead per tool call: < 25 µs. Against a typical LLM latency of 1,000,000 µs (1 second), this is 0.0025% overhead. Zero-allocation on the policy hot path (cache hit + all pre-policies). Safety is not a tax on performance.

🧪 Adversarial test suite

We maintain a 33-test adversarial suite covering 16 attack categories, designed to mirror the attacks from the "Agents of Chaos" paper plus additional vectors specific to our architecture. All tests run with Go's -race detector enabled.

Category	Tests	Attacks Attempted	Result
Path traversal	3	Absolute paths, `../` escape, symlink attacks	BLOCKED
Command injection	3	`rm -rf /`, fork bombs, `:(){ :\|:& };:`	BLOCKED
Network attacks	3	Reverse shells, port scans, crypto miners	BLOCKED
SSRF / DNS rebinding	2	Metadata server, private IPs, DNS rebinding	BLOCKED
Credential theft	2	Read `.env`, `id_rsa`, `.aws/credentials`	BLOCKED
PII exfiltration	3	SSNs, credit cards, API keys, JWTs in output	REDACTED
Prompt injection via web	2	Malicious instructions in fetched HTML	TAGGED
Skill poisoning	2	Shell injection in SKILL.md, exfiltration URLs	BLOCKED
Non-owner impersonation	2	Tool calls with non-owner requester identity	BLOCKED
Resource loops	2	Infinite tool call loops, cron job flooding	CAPPED
Rate limit bypass	1	Burst tool calls exceeding limits	THROTTLED
Identity spoofing	1	Claim injection with fake trust levels	TRACKED
Cross-user FS access	3	Read/write/list sibling workspaces	BLOCKED
Cross-user exec access	2	`cat` other user's files, enumerate users	BLOCKED
Storage exhaustion	2	Write oversized files, exhaust file count	CAPPED
Exec sandbox escape	2	Read `/etc/passwd`, enumerate PIDs	KNOWN GAP

Result: 31 of 33 attacks fully mitigated. 2 known gaps (system file read and PID enumeration) require OS-level namespace isolation. These are tracked and on the roadmap.

🗺️ Open gaps & roadmap

We believe in being transparent about what we've fixed and what remains open.

Gap	Severity	Status	Fix
System file reads (`/etc/passwd`)	Low	OPEN	nsjail mount namespace — bind-mount only user workspace
PID enumeration (`ps aux`)	Low	OPEN	nsjail PID namespace
Shared `/tmp` across users	Low	OPEN	Per-user `TMPDIR` + mount namespace
No memory isolation (shared process)	Medium	ACCEPTED	Phase 3: gVisor/Firecracker per-user sandbox
No network namespace	Medium	MITIGATED	Safe proxy + Web Risk check all HTTP traffic; namespace later
Sibling blocking is string-match	Medium	MITIGATED	Bypassable with encoding; mount namespace is the real fix
macOS has no rlimits	Low	DEV ONLY	Production is Linux; macOS is dev-only

Roadmap

Phase 1 (done): Application-layer policies (M1–M10) + workspace jail + storage quotas
Phase 2 (next): nsjail per-exec — mount/PID/seccomp namespaces for subprocess isolation
Phase 3 (future): gVisor or Firecracker per-user sandbox when handling high-value credentials

Disclosure

If you find a security issue, please email security@continua.ai. We follow responsible disclosure and will acknowledge your report within 48 hours.