Safety in Continua
Continua agents have persistent memory, tool execution, file access, and network capabilities. That power demands defense in depth. This page describes the specific attacks we mitigate, how our architecture compares to the OpenClaw framework studied in the "Agents of Chaos" red-team paper, and the technical details of each mitigation.
Last updated: February 25, 2026 Β· Based on internal red-team analysis (CON-1877, CON-1883, CON-1885, CON-1913) Β· nsjail verified on Cloud Run gen2 (gVisor) Β· Production observability + dynamic config live Β· Credential scoping shipped Β· Mitigation strength classifications added
Contents
- Threat model & the Betrayal paper
- Mitigation strength classification
- Side-by-side: Continua vs. OpenClaw
- The 10 mitigations (M1βM10)
- Production observability & alerting
- Dynamic config for safety tuning
- User isolation architecture
- Storage exhaustion defense
- Performance benchmarks
- Adversarial test suite
- Cloud Run safety e2e tests
- Cloud Run verification
- Premium isolation: Firecracker microVM
- Completed & remaining
π― Threat model & the Betrayal paper
In February 2026, researchers from Northeastern, Harvard, MIT, CMU, Stanford, and others published Agents of Chaos (Shapira et al., 2026) β a red-team study of autonomous LLM agents deployed on OpenClaw, an open-source personal AI framework. Over two weeks, twenty AI researchers interacted with the agents under adversarial conditions.
They documented 11 case studies spanning:
| # | Case Study | Attack Category | Outcome in OpenClaw |
|---|---|---|---|
| 1 | Disproportionate Response | Overreaction to minor prompts | Agent disabled its own email client β service self-destruction |
| 2 | Compliance with Non-Owner | Authority confusion | Agent complied with stranger's instructions, modified files |
| 3 | Disclosure of Sensitive Info | Data exfiltration | Agent revealed owner's secrets to non-owner on Discord |
| 4 | Waste of Resources (Looping) | Denial of service | Agent entered 9+ hour conversational loop; infinite cron jobs |
| 5 | Denial of Service (DoS) | Resource exhaustion | Disk filled with output; shell processes multiplied |
| 6 | Agents Reflect Provider Values | Value misalignment | Agent imposed model-provider ethics on owner requests |
| 7 | Agent Harm | Self-inflicted damage | Agent deleted its own configuration files trying to "help" |
| 8 | Owner Identity Spoofing | Impersonation | Non-owner impersonated owner; agent complied fully |
| 9 | Agent Collaboration / Knowledge Sharing | Cross-agent propagation | Unsafe practices propagated between agents via Discord |
| 10 | Agent Corruption | Persistent compromise | Attacker modified agent's AGENTS.md, changing its behavior permanently |
| 11 | Libelous within Community | Reputation attack | Agent spread false claims about other agents' owners |
OpenClaw's architecture is representative of how most open-source personal AI frameworks work: agents have unrestricted shell access, can modify their own configuration, have no owner-identity verification, and no policy layer between the LLM's output and tool execution. Our runtime was designed to systematically address each of these failure modes.
π Mitigation strength classification
Not all mitigations are created equal. We classify each one by how it can be bypassed, so you can reason about the actual security boundary.
| Level | Meaning | Bypass requires |
|---|---|---|
| HARD | Infrastructure-enforced. Cannot be bypassed by prompt injection or LLM misbehavior. | Exploiting a bug in Go runtime, nsjail, gVisor, or iptables β not the LLM. |
| FIRM | Code-enforced via pattern matching. Catches known-bad patterns but has known bypass vectors. | Creative encoding (Base64, Unicode, splitting across messages) or patterns not in the blocklist. |
| SOFT | Prompt-level / advisory. Relies on LLM compliance with system instructions. | Prompt injection β an unsolved problem in the field. Useful as defense-in-depth, not a security boundary. |
| Mitigation | Strength | Rationale |
|---|---|---|
| M1: Owner-Only Policy | HARD | Go context check β LLM cannot override requester identity |
| M2: DNS Rebinding | HARD | Post-resolution IP check in Go, before HTTP connection |
| M3: Sensitive Path Blocking | HARD | Regex on filesystem tool in Go code, pre-execution |
| M4: Credential/Secret Redaction | FIRM | Regex catches structured patterns (XXX-XX-XXXX, AKIA...) but not creative spelling, Base64 encoding, or secrets split across messages |
| M5: Content Trust Tagging | SOFT | Tags are advisory to the LLM β modern prompt injection can override tag semantics |
| M6: Skill Write Validation | FIRM | Deterministic scanner + semantic intent review on SKILL.md writes. Catches known-bad patterns immediately and denies/escalates ambiguous poisoning attempts with bounded-latency model checks. |
| M7: Loop Detection | HARD | Sliding window rate limiter in Go β not bypassable via prompting |
| M8: Automation Cap | HARD | Counter in Go β not bypassable via prompting |
| M9: Claim Source Tracking | SOFT | Tags are applied in code (hard), but the LLM's use of trust levels is soft β prompt injection could convince it to ignore tags |
| M10: Proportionality Prompt | SOFT | System prompt instruction β useful guardrail, not a security boundary |
| nsjail Sandbox | HARD | OS-level mount/PID namespace isolation, verified on Cloud Run gVisor |
| iptables Egress | HARD | Kernel-level network filtering β default-deny for appuser |
| NoMemoryAccess Policy | HARD | Blocks /dev/mem, /proc/*/mem, /proc/*/environ in exec commands (defense-in-depth with nsjail) |
| Skill Publish Validation | FIRM | Security scanner + rate limiting at publish time β catches known-bad patterns and limits blast radius |
Soft mitigations (M5, M9, M10) rely on the LLM following instructions. Prompt injection is an unsolved problem in the field β no production system can guarantee that an LLM will never be tricked into ignoring system instructions. We include these mitigations because they measurably reduce attack success rates in testing, but we do not claim they are security boundaries. The security boundaries are the hard mitigations: owner identity in Go context, nsjail namespaces, iptables rules, and rate limiters.
βοΈ Side-by-side: Continua vs. OpenClaw
The "Agents of Chaos" study used OpenClaw with Claude Opus and Kimi K2.5. Here is a point-by-point comparison of how each attack surface is handled.
| Attack Surface | OpenClaw | Continua |
|---|---|---|
| Owner identity | No verification. Any Discord/email user can instruct the agent. Identity spoofing trivial. | M1 OwnerOnlyPolicy β cryptographic identity via context. Non-owner requests rejected at the policy layer before reaching the LLM. |
| Sensitive file access | Agent has unrestricted filesystem access including .env, SSH keys, API tokens. |
M3 NoSensitivePathAccess β blocks reads/writes to credential files (.env, *.pem, id_rsa, etc.) via regex policy on the filesystem tool. OnlyOwnedCredentials scopes the credential tool to services the user has actually connected (CON-1885). |
| DNS rebinding | No protection. Agent can be tricked into accessing internal services via DNS rebinding. | M2 Post-resolution IP check rejects private/loopback addresses after DNS resolution. Combined with NoPrivateNetworkAccess policy. |
| Credentials in responses | No filtering. Agent freely returns SSNs, credit card numbers, API keys found in files. | M4 FIRM RedactPII credential/secret filter β regex patterns for SSNs, credit cards, API keys, JWTs, private keys. Known limitations: regex catches structured formats (XXX-XX-XXXX, AKIA...) but not creative spelling ("my SSN is three hundred"), Base64-encoded secrets, or secrets split across multiple messages. Available via dynamic config (pii_filter_enabled). Broader PII (emails, phone numbers) is intentionally not filtered β personal agents need to share contact information for coordination and collaboration use cases. |
| Web content trust | No provenance tracking. web_fetch output treated same as trusted data. |
M5 SOFT Content trust tagging β all web_fetch output tagged with [UNTRUSTED: web_fetch]. UGC platforms (Reddit, HN, Pastebin, etc.) get additional β UGC platform warnings. Known limitations: tags are advisory to the LLM. Sophisticated indirect prompt injection in fetched content can override tag semantics. This is defense-in-depth, not a security boundary. |
| Skill/config corruption | Agent can modify its own AGENTS.md, SOUL.md, etc. Attackers injected persistent backdoors. |
M6 FIRM SkillWriteValidationPolicy β validates SKILL.md content at write time and at shared registry publish time (CON-1913). Uses deterministic scanning for known-bad patterns plus semantic classification for ambiguous poisoning intent. Semantic decisions are normalized to allow|deny|escalate with cache + micro-batch economics controls. Publish rate limited to 20/day per user. Agent cannot modify its own system prompt or builtin files. |
| Resource loops | No detection. Agents entered 9+ hour loops, spawned infinite cron jobs and background processes. | M7 LoopDetector β per-thread sliding window (5 calls/10 min for same tool+args). Hard cap at 20 total repetitions. Plus M8 per-user automation cap (10 scripts/day). |
| Outbound phishing / social pressure | No structured outbound policy kernel. Social-engineering checks are mostly prompt-level and ad hoc. | Hybrid Message safety uses deterministic rules (NoPhishingInMessages) plus semantic classification for ambiguous/high-risk content. Decisions are normalized to allow|deny|escalate with reason codes + audit metadata. Semantic checks are cached and micro-batched to keep latency/cost bounded. |
| Cross-agent propagation | Agents on shared Discord freely influenced each other's behavior and shared unsafe practices. | M9 SOFT Claim source tracking with trust levels (TrustOwner / TrustNonOwner). Memory claims from non-owner sources are marked and treated with lower authority. Known limitations: the tagging is code-enforced (hard), but the LLM's use of trust levels is soft β prompt injection could convince it to treat non-owner claims as authoritative. |
| Disproportionate actions | Agent disabled its own email client (nuclear option) in response to a minor confidentiality concern. | M10 SOFT Proportionality prompt injected into system context. Instructs the agent to prefer minimal, reversible actions and never disable its own capabilities. Known limitations: this is a system prompt instruction, not a code-enforced boundary. It reduces the frequency of disproportionate actions in testing but cannot prevent them against determined prompt injection. |
| Destructive commands | Unrestricted shell with sudo. Agent could rm -rf /, fork-bomb, install rootkits. |
Pre-existing NoDestructiveCommands policy blocks rm -rf, fork bombs, disk wipe commands. NoNetworkAttacks blocks nmap, reverse shells, crypto miners. |
| Disk exhaustion | No limits. 20 GB volumes could be filled by a single agent or attacker. | Storage limits Per-user workspace quotas: 100 MB (free) / 500 MB (pro) / 2 GB (premium). Per-file limits. File count limits. Exec output capped at 100 KB. Action log rotation at 5 MB. |
| Cross-user isolation | Each agent runs in its own Fly.io VM β strong isolation but expensive ($20+/agent/month). | nsjail Exec subprocesses run in nsjail sandboxes: mount namespace (only user workspace visible), PID namespace (can't see other processes), per-sandbox /tmp, rlimits. Verified on Cloud Run β cold exec <1 ms. Premium tier: Firecracker microVM for full memory isolation. |
All exec subprocesses run inside nsjail sandboxes with mount namespaces (only user
workspace visible), PID namespaces (can't see other processes), per-sandbox /tmp,
and rlimits β verified end-to-end on Cloud Run gen2 (gVisor).
Cloud Run's gVisor hypervisor provides additional syscall filtering at the kernel level.
Cold exec latency: <1 ms.
The remaining architectural difference with OpenClaw: they provide full VM-level memory isolation (separate address space per user), while our goroutine pool shares a single Go process. Go's memory safety guarantees (no pointer arithmetic, bounds checking, GC) prevent cross-goroutine memory reads through normal code; a Go runtime CVE would be required to break this boundary. For users who need hardware-level isolation, we offer a Firecracker premium tier. Where we are materially stronger: every other attack category. OpenClaw has zero policy-layer defenses β no owner verification, no PII filtering, no content trust, no loop detection, no resource limits, no skill validation.
Important caveat on soft mitigations: M5 (content trust), M9 (claim source tracking), and M10 (proportionality) are prompt-level defenses. They measurably improve safety in testing but are not security boundaries β see the classification table.
π The 10 mitigations (M1βM10)
Each mitigation is a pure-Go policy function applied in the tool execution pipeline. Policies are composable, zero-allocation on the hot path, and add microsecond-scale latency relative to LLM call latency (500 ms β 3 s).
M1: Owner-Only Policy
Paper reference: Case Study #2 (Compliance with Non-Owner), Case Study #8 (Identity Spoofing)
Every tool call carries a requester identity in the Go context. OwnerOnlyPolicy extracts
this identity and compares it to the workspace owner. Non-owner requests are rejected before
the tool executes. This is enforced at the infrastructure level β the LLM cannot override it,
and prompt injection cannot bypass it.
// Simplified β actual implementation in tools/policy.go
func OwnerOnlyPolicy() Policy {
return Policy{
Name: "owner-only",
Pre: func(ctx context.Context, tool string, args map[string]any) error {
requester := RequesterFromContext(ctx)
owner := OwnerFromContext(ctx)
if requester != "" && owner != "" && requester != owner {
return fmt.Errorf("only the workspace owner can use this tool")
}
return nil
},
}
}
M2: DNS Rebinding Protection
Attack: Attacker controls a domain that resolves to 169.254.169.254 (cloud metadata)
or 127.0.0.1 (local services). Agent fetches the URL, believing it's an external site.
Defense: Post-DNS-resolution IP validation. After the DNS lookup but before the HTTP
connection, we check if the resolved IP is in a private/loopback/link-local range. If so,
the request is blocked. Combined with NoPrivateNetworkAccess (blocks curl 169.254.*
in exec) and the safe proxy (transparent HTTP proxy that checks all subprocess traffic).
M3: Sensitive Path Blocking + Credential Scoping
Paper reference: Case Study #3 (Disclosure of Sensitive Information)
Compiled regex patterns match credential files: .env, .env.local,
id_rsa, id_ed25519, *.pem, .aws/credentials,
.ssh/config, .netrc, .pgpass, and more.
NoSensitivePathAccess is applied on the filesystem tool, forcing credential
access through the dedicated credential tool.
Credential scoping (CON-1885): The credential tool is further scoped by
OnlyOwnedCredentials β it only allows access to services the user has actually
connected (e.g., Gmail, Google Calendar). If you haven't connected GitHub, the agent can't
retrieve GitHub credentials even via the credential tool. The connected services list is
looked up from the connection store at request time.
M4: Credential/Secret Redaction
Response-path filter for credential and secret patterns. Controllable via dynamic config
(pii_filter_enabled). Regex patterns detect and redact:
- SSNs:
XXX-XX-XXXXpatterns β[SSN REDACTED] - Credit cards: Visa, Mastercard, Amex formats β
[CARD NUMBER REDACTED] - API keys: Common prefixes (
sk-ant-,sk-,AIza,ghp_,gho_,glpat-) β[API KEY REDACTED] - AWS keys:
AKIA/ASIAprefixes β[AWS KEY REDACTED] - JWTs:
eyJ...three-part dot-separated tokens β[BEARER TOKEN REDACTED] - Private keys: PEM blocks (
BEGIN RSA PRIVATE KEY, etc.) β[PRIVATE KEY REDACTED]
This filter intentionally targets credentials and financial secrets, not general
personal information like email addresses or phone numbers. Continua agents need to work
with contact information for legitimate use cases: coordinating schedules between agents,
sharing meeting details in group chats, finding a place to eat with friends. Blocking emails
and phone numbers would break core product functionality. The right boundary is: redact things
that should never appear in conversation (API keys, SSNs, private keys), and use
OwnerOnlyPolicy + OnlyOwnedCredentials to control who can
access sensitive data.
M5: Content Trust Tagging
Paper reference: Case Study #10 (Agent Corruption via external content)
All web_fetch results are wrapped with provenance tags. Content from user-generated
content platforms (Reddit, Hacker News, Pastebin, Stack Overflow, GitHub Gists, etc.)
receives an additional β UGC platform warning. This gives the LLM signal to treat
fetched content as potentially adversarial β reducing the effectiveness of indirect prompt injection.
M6: Skill Write Validation
Paper reference: Case Study #10 (Agent Corruption β attacker modified agent configuration)
When the filesystem tool writes a SKILL.md file, validation runs in two layers:
deterministic scanning for known-dangerous patterns (shell injection, credential harvesting,
exfiltration URLs, prompt-injection markers, disallowed tool usage) and a semantic intent
classifier for ambiguous poisoning attempts that evade regexes. The semantic path returns
allow|deny|escalate; deny/escalate block the write and require explicit owner review.
Classifier verdicts are cached and micro-batched to keep added latency and model cost bounded.
M7: Loop Detection
Paper reference: Case Study #4 (Waste of Resources β 9+ hour loop)
Per-thread sliding window rate limiter. If the same tool is called with the same arguments more than 5 times within 10 minutes, the call is blocked. Hard cap of 20 total repetitions for any tool+args pair per thread. This catches both conversational loops (agent calling the same API repeatedly) and automation loops (cron jobs, background processes).
M8: Automation Cap
Paper reference: Case Study #5 (Denial of Service)
Per-user daily limit of 10 automation scripts (cron jobs, scheduled tasks, background processes).
Plus a full script audit: ScriptAudit() function that scans exec commands for script
creation patterns and counts them against the user's daily budget.
M9: Claim Source Tracking
Paper reference: Case Study #9 (Cross-agent knowledge sharing), Case Study #11 (Libelous claims)
Every memory claim is tagged with its source and a trust level: TrustOwner (from the
authenticated owner) or TrustNonOwner (from agents, emails, web content, or
non-owner humans). When the agent uses claims for reasoning, it can weight owner claims
more heavily and treat non-owner claims with appropriate skepticism.
M10: Proportionality Prompt SOFT
Paper reference: Case Study #1 (Disproportionate Response β agent disabled its own email)
System prompt instruction that instructs the agent to prefer minimal, reversible actions. Key rules: never disable your own capabilities, never delete configuration files, prefer the least-destructive path, and escalate to the owner when uncertain.
This is a system prompt instruction. Prompt injection is an unsolved problem β if an attacker (or confused agent) is sufficiently motivated, the prompt cannot guarantee compliance. We include it because it measurably reduces disproportionate actions in our testing (the agent almost always follows these rules), but we do not claim it prevents a determined attack. The hard mitigations (M3 sensitive path blocking, M7 loop detection, M8 automation cap) provide the actual boundaries.
π Production observability & alerting
CON-1883: All 10 safety policies were invisible in production β no metrics, no alerting, no way to know if PII redaction was firing or loops were being detected. Now every safety event emits a structured JSON log line that Cloud Logging ingests and Cloud Monitoring turns into real-time metrics and alerts.
Structured safety events
Every policy block, PII redaction, loop detection, automation cap hit, sandbox fallback,
and semantic-classifier batch/cache event emits a safety_event JSON entry:
{"safety_event": {
"kind": "policy_block",
"policy": "no-destructive-commands",
"tool": "exec",
"user_id": "user-abc",
"detail": "command contains blocked pattern \"rm -rf /\"",
"timestamp": "2026-02-24T20:57:24Z"
}}
Event kinds:
| Kind | When | Key fields |
|---|---|---|
policy_block | Any safety policy Pre/Post returns an error | policy, tool, detail |
pii_redaction | PII pattern matched and redacted in agent output | pattern (e.g., [SSN REDACTED]) |
loop_detected | Agent-to-agent loop threshold or rate limit exceeded | thread_id, detail |
automation_cap_hit | User hit their max concurrent automations | detail (e.g., 10/10 automations) |
rate_limit_hit | Per-user per-tool rate limit exceeded | tool |
sandbox_fallback | Exec fell back to cold nsjail path (no warm namespace) | β |
workspace_limit_hit | Workspace size or file count limit reached | detail |
semantic_classifier_batch | One semantic safety classifier batch completed | model, batch_size, latency_ms, input_tokens, output_tokens, estimated_cost_usd |
semantic_classifier_cache_hit | Policy/content cache hit for semantic safety classification | cache_type (policy or content) |
Cloud Monitoring alerts
Core safety metrics feed three alert policies:
| Alert | Threshold | Purpose |
|---|---|---|
| Policy Block Spike | >50 blocks in 5 min | Active attack or false-positive surge needing dynamic config tuning |
| PII Redaction Spike | >20 redactions in 5 min | Agents reading sensitive files, or regex false positives |
| Loop Detection Events | >10 in 5 min | Runaway agent-to-agent loops |
Dashboard
Safety dashboard widgets now include both enforcement and semantic-classifier economics:
- Safety: Policy Blocks β time series by policy name (which policies are firing most)
- Safety: PII Redactions β time series by pattern (SSN, credit card, API key, etc.)
- Safety: Loop Detections + Automation Caps β combined time series
- Safety: Semantic Classifier Cache Hit Rate β policy/content cache effectiveness
- Safety: Semantic Classifier Latency + Cost β p50/p95 added latency and estimated USD burn
Live debugging endpoint
GET /admin/safety-metrics returns a real-time JSON snapshot of all safety counters:
{
"total_blocks": 42,
"pii_redactions": 7,
"loop_detections": 2,
"rate_limit_hits": 15,
"automation_cap_hits": 1,
"sandbox_fallbacks": 0,
"workspace_limit_hits": 3,
"semantic_classifier_calls": 19,
"semantic_classifier_classified_items": 44,
"semantic_classifier_cache_hits": 61,
"semantic_classifier_policy_cache_hits": 35,
"semantic_classifier_content_cache_hits": 26,
"semantic_classifier_cache_hit_rate": 0.58,
"semantic_classifier_input_tokens": 9821,
"semantic_classifier_output_tokens": 1432,
"semantic_classifier_estimated_cost_usd": 0.0127,
"semantic_classifier_cost_by_user_usd": {
"alice@example.com": 0.0068,
"bob@example.com": 0.0059
},
"semantic_classifier_added_latency_p50_ms": 37,
"semantic_classifier_added_latency_p95_ms": 84,
"policy_blocks": {
"no-destructive-commands": 12,
"no-network-attacks": 8,
"no-sensitive-path-access": 15,
"no-phishing-in-messages": 5,
"deno-sandbox": 2
}
}
InstrumentedGuarded remains identical to Guarded on the happy path for deterministic
policies. Semantic safety checks only run for ambiguous/high-risk outbound messages and are
amortized with verdict/content caches plus micro-batching. Most benign tool calls still incur
only local policy checks + counters.
ποΈ Dynamic config for safety tuning
CON-1883: All safety thresholds were hardcoded β if PII redaction had false positives or loop detection was too aggressive, the only fix was a code change + redeploy. Safety behavior is now tuneable at runtime via dynamic config (GCS blob, polled every 30 seconds).
| Knob | Config key | Default | Use case |
|---|---|---|---|
| Loop threshold | loop_detector_threshold | 20 | Raise if agents legitimately need long back-and-forth threads |
| Loop rate limit | loop_detector_rate_limit | 5 per window | Tighten if loops are consuming resources before threshold |
| Loop rate window | loop_detector_rate_window_seconds | 600 (10 min) | Widen/narrow the sliding window |
| Automation cap | automation_max_per_user | 10 | Raise for power users, lower under load |
| PII filter toggle | pii_filter_enabled | true | Emergency kill switch if false positives are too high |
| Classifier model | classifier_model | main model | Route semantic safety checks to a cheaper/faster model (e.g., Flash Lite) |
| Workspace max size | workspace_max_size_bytes | tier-based | Override for specific deployments |
| Workspace max files | workspace_max_files | tier-based | Override for specific deployments |
| Sandbox pool min | sandbox_pool_min_size | 5 | Pre-warm more namespaces during peak hours |
| Sandbox pool max | sandbox_pool_max_size | 200 | Cap memory usage from warm namespaces |
| Sandbox idle timeout | sandbox_session_idle_seconds | 600 (10 min) | Recycle faster under memory pressure |
| Sandbox toggle | sandbox_enabled | true | Emergency fallback to unsandboxed exec |
All knobs use zero-value = code default semantics. Setting a knob to 0 (or omitting it)
uses the hardcoded default. This means an empty config file produces the same behavior as before
the feature was added β no migration needed.
pii_filter_enabled and sandbox_enabled are emergency kill switches.
Disabling them reduces security posture and should only be done to mitigate a confirmed
false-positive surge or sandbox instability, with a plan to re-enable promptly.
ποΈ User isolation architecture
Our runtime uses a goroutine-per-user architecture within a shared Go process on Cloud Run. This is a deliberate trade-off: cheaper per-user cost (~$0 marginal cost vs. $20+/VM for OpenClaw) at the expense of weaker OS-level isolation.
Defense-in-depth layers
resolve() rejects absolute paths and traversals. Exec blocks commands referencing sibling workspaces. Workspace size quotas./tmp (10 MB tmpfs), rlimits (512 MB / 30s CPU / 50 MB file / 64 procs). Cloud Run's gVisor provides additional syscall filtering at the hypervisor level. Verified on Cloud Run gen2: cold exec <1 ms.What we verified (and what we didn't)
We ran a 14-test isolation probe suite that confirms:
- β Clean subprocess environment β no parent env leakage
- β Per-user rate limiting β users cannot exhaust each other's budgets
- β Per-thread loop detection β loops in one thread don't affect others
- β Concurrent filesystem safety β no races under concurrent access
- β Workspace jail β absolute paths and traversals blocked
- β Exec sibling blocking β commands referencing other workspaces blocked
Gaps closed by nsjail (February 2026):
- β System file reads blocked β mount namespace hides files outside user workspace
- β PID enumeration blocked β PID namespace shows only the subprocess
- β
/tmpisolated β per-sandbox tmpfs (10 MB) - β Dangerous syscalls blocked β gVisor hypervisor-level syscall filtering (Cloud Run gen2)
- β Sibling workspace access β mount namespace + application-layer blocking
Shared-process memory isolation (what's actually true):
- β Go runtime memory safety: Go has no pointer arithmetic, enforces bounds checking on all slice/array access, and uses a garbage collector. One goroutine cannot read another goroutine's heap memory through normal Go code. This is materially different from C/C++ shared-process architectures where buffer overflows or use-after-free bugs can cross user boundaries.
- β
/proc and /dev not mounted in nsjail: The nsjail config only bind-mounts
/usr,/bin,/lib,/etc/ssl, and a few config files./procand/devare not mounted. Exec'd subprocesses cannot access/dev/mem(physical memory),/dev/kmem(kernel memory),/proc/self/mem(process memory),/proc/<pid>/maps(memory maps), or/proc/<pid>/environ(environment variables). - β
NoMemoryAccess policy (defense-in-depth): Even outside nsjail (fallback path),
the exec policy layer blocks commands referencing
/dev/mem,/proc/*/mem,/proc/*/environ, and kernel internals (/proc/kallsyms,/proc/kcore).
Residual risk (documented honestly):
- β οΈ Go runtime CVE: A memory corruption bug in the Go runtime itself (or in a C library called via cgo) could theoretically allow cross-goroutine memory access. This is low-probability β Go's runtime is extensively tested and fuzzed β but it is the residual risk in our shared-process architecture. For users who need hardware-level memory isolation (separate address space per user), we offer the Firecracker premium tier.
πΎ Storage exhaustion defense
Paper reference: Case Study #5 (Denial of Service β disk filled with output)
In the "Agents of Chaos" study, attackers filled the 20 GB volume by having the agent generate large outputs or write many files. Our runtime enforces storage limits at three levels:
| Resource | Free | Pro | Premium | Enforcement |
|---|---|---|---|---|
| Workspace total | 100 MB | 500 MB | 2 GB | WorkspaceSizePolicy pre-write check |
| File count | 5,000 | 20,000 | 100,000 | Same policy, directory walk with cache |
| Single file | 10 MB | 50 MB | 100 MB | Content-length check before write |
| Exec output | 100 KB | ExecOutputSizePolicy post-exec truncation | ||
| Exec subprocess memory | 512 MB | RLIMIT_AS (Linux) | ||
| Exec CPU time | 30 seconds | RLIMIT_CPU (Linux) | ||
| Exec file write | 50 MB | RLIMIT_FSIZE (Linux) | ||
| Exec processes | 64 | RLIMIT_NPROC (Linux) | ||
| Action log | 5 MB | Half-file rotation on overflow | ||
| Audit log (memory) | 50,000 entries | FIFO ring buffer | ||
| Daily agent runs | 50 | 500 | 5,000 | Quota enforcer |
Workspace usage is scanned lazily (at write time) and cached for 30 seconds. The cache is invalidated after every successful write. This keeps the hot-path overhead to 31 ns (cache hit) while still catching workspace overflow.
β‘ Performance benchmarks
All mitigations run in the tool-call hot path. We benchmark each one to ensure they add
negligible latency relative to LLM inference (500 ms β 3 s). Measured on Apple M4 Max
with -benchmem and -count=3.
Sandbox manager overhead
Total safety overhead per tool call: < 25 Β΅s (policies) + <1 ms (nsjail sandbox). Against a typical LLM latency of 1,000,000 Β΅s (1 second), this is <0.1% overhead. Zero-allocation on the policy hot path. Sandbox acquire is 113 ns on hot reuse (0 allocations). Safety is not a tax on performance.
π§ͺ Adversarial test suite
We maintain a 33-test adversarial suite covering 16 attack categories, designed
to mirror the attacks from the "Agents of Chaos" paper plus additional vectors specific to
our architecture. All tests run with Go's -race detector enabled.
| Category | Tests | Attacks Attempted | Result |
|---|---|---|---|
| Path traversal | 3 | Absolute paths, ../ escape, symlink attacks | BLOCKED |
| Command injection | 3 | rm -rf /, fork bombs, :(){ :|:& };: | BLOCKED |
| Network attacks | 3 | Reverse shells, port scans, crypto miners | BLOCKED |
| SSRF / DNS rebinding | 2 | Metadata server, private IPs, DNS rebinding | BLOCKED |
| Credential theft | 2 | Read .env, id_rsa, .aws/credentials | BLOCKED |
| PII exfiltration | 3 | SSNs, credit cards, API keys, JWTs in output | REDACTED |
| Prompt injection via web | 2 | Malicious instructions in fetched HTML | TAGGED |
| Skill poisoning | 2 | Shell injection in SKILL.md, exfiltration URLs | BLOCKED |
| Non-owner impersonation | 2 | Tool calls with non-owner requester identity | BLOCKED |
| Resource loops | 2 | Infinite tool call loops, cron job flooding | CAPPED |
| Rate limit bypass | 1 | Burst tool calls exceeding limits | THROTTLED |
| Identity spoofing | 1 | Claim injection with fake trust levels | TRACKED |
| Cross-user FS access | 3 | Read/write/list sibling workspaces | BLOCKED |
| Cross-user exec access | 2 | cat other user's files, enumerate users | BLOCKED |
| Storage exhaustion | 2 | Write oversized files, exhaust file count | CAPPED |
| Exec sandbox escape | 2 | Read /etc/passwd, enumerate PIDs | BLOCKED (nsjail) |
| Memory access (CON-1913) | 2 | /dev/mem, /proc/self/mem, /proc/*/environ, /proc/kallsyms, /proc/kcore | BLOCKED |
| Builtins write (CON-1913) | 2 | Write to /builtins/IDENTITY.md, copy-then-modify evasion | BLOCKED |
| Skill poisoning (CON-1913) | 4 | Malicious skill publish (injection, exfil, shell pipe, env harvest), publish rate limit | BLOCKED |
Result: 41 of 41 attacks fully mitigated. The original 33 tests plus 8 new tests for memory access paths, builtin write protection, and skill poisoning (CON-1913).
π Cloud Run safety e2e tests
CON-1883: The adversarial tests above run in-process. They verify policy logic but not the full deployed stack: iptables REDIRECT β safe proxy β nsjail namespace β policies β PII response filter. We now have 9 deployed-instance integration tests that run against a live Cloud Run instance and verify end-to-end enforcement.
# Run against staging:
go test -v -tags=e2e ./personal_ai_runtime/e2e/ \
-run TestSafety \
-staging-url=https://gravity-staging-329044329780.us-central1.run.app \
-timeout=5m
| Test | What it verifies | Stack layers exercised |
|---|---|---|
TestSafety_MetadataServerBlocked |
GCP metadata server (169.254.169.254) unreachable from exec | Policy regex + iptables DROP + nsjail network |
TestSafety_PIIRedactionOnResponse |
SSNs and API keys redacted in agent response | PII response filter (post-exec) |
TestSafety_DestructiveCommandBlocked |
rm -rf / blocked before execution |
NoDestructiveCommands policy |
TestSafety_PrivateNetworkBlocked |
RFC 1918 private IPs blocked from exec | NoPrivateNetworkAccess policy |
TestSafety_SafeProxyURLFiltering |
Known-malicious URLs blocked by safe proxy | Safe proxy + Google Web Risk API |
TestSafety_SensitivePathBlocked |
Credential files (credentials/) inaccessible |
NoSensitivePathAccess policy |
TestSafety_MetricsEndpoint |
/admin/safety-metrics returns valid JSON |
Safety metrics collector + HTTP endpoint |
TestSafety_NsjailNamespaceIsolation |
No host processes visible from ps aux |
nsjail PID namespace (CLONE_NEWPID) |
TestSafety_PhishingMessageBlocked |
Phishing patterns in outbound messages blocked | NoPhishingInMessages policy |
These tests send real HTTP requests to a deployed Cloud Run instance, ask the agent to perform dangerous actions, and verify the responses show proper blocking. They exercise the complete request path: load balancer β Cloud Run β Go HTTP handler β agent pool β LLM β tool call β policy chain β nsjail sandbox β response filter β HTTP response. This is the test that catches "works in unit tests, broken in production" gaps.
β Cloud Run verification (February 24, 2026)
We built a Docker image with nsjail, pushed it to Google Artifact Registry, and ran end-to-end tests on Cloud Run gen2 (which runs containers inside gVisor). These are the actual results from production infrastructure β not local dev or simulated environments.
What works on Cloud Run gVisor
| Feature | Clone flag | Status | Verified behavior |
|---|---|---|---|
| Mount namespace | CLONE_NEWNS |
WORKS | User workspace visible; /etc/passwd, /home, /var all hidden. Host filesystem invisible to subprocess. |
| PID namespace | CLONE_NEWPID |
WORKS | Process sees NSpid: 21 2 β PID 2 in its own namespace. Cannot see or signal other processes. |
| Per-sandbox tmpfs | β | WORKS | 10 MB tmpfs mounted at /tmp. Isolated from host /tmp and other sandboxes. |
| rlimits | β | WORKS | Memory (512 MB), CPU (30s), file size (50 MB), nproc (64), nofile (256) all enforced. |
| Workspace bind mount | β | WORKS | User workspace read-write; can read own files, write output. Cannot access other users' directories. |
| Cross-user write blocking | β | WORKS | Attempting to write to another user's workspace: nonexistent directory. Mount namespace hides it entirely. |
What gVisor blocks (and why that's fine)
| Feature | Clone flag | Status | Why it doesn't matter |
|---|---|---|---|
| User namespace | CLONE_NEWUSER |
BLOCKED | gVisor itself provides equivalent isolation at the hypervisor level. User mapping not needed when running as a single UID. |
| chroot / pivot_root | β | BLOCKED | Not needed β mount namespace provides filesystem isolation without chroot. bind-mounts only what the subprocess needs. |
| seccomp-bpf | β | REDUNDANT | gVisor intercepts all syscalls at the hypervisor level. Adding seccomp inside gVisor would be defense-in-depth with no practical benefit and added complexity. |
Measured latency
gVisor's clone() implementation is extremely fast β creating a mount+PID namespace
takes less than 1 millisecond. This means the warm pool's primary value shifts from
latency reduction to lifecycle management (automatic cleanup, adaptive sizing, quiescence-driven
recycling). The security benefit is real and the performance cost is effectively zero.
πΊοΈ Completed & remaining
| Gap | Status | How |
|---|---|---|
| Safety policy observability | CLOSED | Structured safety_event logging, Cloud Monitoring alerts, dashboard widgets, and /admin/safety-metrics endpoint. Includes semantic-classifier cache/cost/latency telemetry for hybrid safety checks (CON-1883 + CON-2116). |
| Runtime tunability | CLOSED | Dynamic config knobs for loop detector, automation cap, PII filter, semantic classifier model, workspace limits, and sandbox pool tuning (CON-1883 + CON-2116) |
| Full-stack e2e verification | CLOSED | Deployed-instance integration tests verify iptables β safe proxy β nsjail β policies β response filters on Cloud Run, including metrics endpoint schema checks and phishing policy behavior (CON-1883 + CON-2116). |
System file reads (/etc/passwd) |
CLOSED | nsjail mount namespace β only user workspace visible |
PID enumeration (ps aux) |
CLOSED | nsjail PID namespace |
Shared /tmp |
CLOSED | Per-sandbox tmpfs (10 MB) |
| Dangerous syscalls | CLOSED | gVisor hypervisor-level syscall filtering (Cloud Run gen2) |
| Sibling workspace bypass | CLOSED | nsjail mount namespace + application-layer blocking |
| Network egress isolation | CLOSED | Default-deny iptables egress for appuser (CON-1884). Only loopback (safe proxy), established connections, and Cloud SQL IPs whitelisted. DNS (port 53) blocked to prevent exfiltration. HTTP/HTTPS redirected through safe proxy + Web Risk. |
| Credential scoping | CLOSED | OnlyOwnedCredentials policy scopes the credential tool to services the user has connected. NoSensitivePathAccess wired on the filesystem tool to block direct credential file reads (CON-1885). |
| Memory access from exec | CLOSED | nsjail mount namespace excludes /proc and /dev. NoMemoryAccess policy blocks /dev/mem, /proc/*/mem, /proc/*/environ in exec commands (defense-in-depth). Adversarial tests verify both layers (CON-1913). |
| Skill publish validation | CLOSED | Security scanner + rate limiting (20/day) at SharedSkillStore.Publish() and SkillRegistry.Publish() time. Content validated at install time too (defense-in-depth). Skill trust tiers and provenance tracking (CON-1913). |
| Shared process memory | MITIGATED + PREMIUM | Standard tier: Go memory safety prevents cross-goroutine reads through normal code. Residual risk: Go runtime CVE (low-probability). Premium tier: Firecracker microVM provides hardware-level address space isolation. |
If you find a security issue, please email security@continua.ai. We follow responsible disclosure and will acknowledge your report within 48 hours.