Safety in Continua

Continua agents have persistent memory, tool execution, file access, and network capabilities. That power demands defense in depth. This page describes the specific attacks we mitigate, how our architecture compares to the OpenClaw framework studied in the "Agents of Chaos" red-team paper, and the technical details of each mitigation.

Last updated: February 25, 2026 Β· Based on internal red-team analysis (CON-1877, CON-1883, CON-1885, CON-1913) Β· nsjail verified on Cloud Run gen2 (gVisor) Β· Production observability + dynamic config live Β· Credential scoping shipped Β· Mitigation strength classifications added

Contents

  1. Threat model & the Betrayal paper
  2. Mitigation strength classification
  3. Side-by-side: Continua vs. OpenClaw
  4. The 10 mitigations (M1–M10)
  5. Production observability & alerting
  6. Dynamic config for safety tuning
  7. User isolation architecture
  8. Storage exhaustion defense
  9. Performance benchmarks
  10. Adversarial test suite
  11. Cloud Run safety e2e tests
  12. Cloud Run verification
  13. Premium isolation: Firecracker microVM
  14. Completed & remaining

🎯 Threat model & the Betrayal paper

In February 2026, researchers from Northeastern, Harvard, MIT, CMU, Stanford, and others published Agents of Chaos (Shapira et al., 2026) β€” a red-team study of autonomous LLM agents deployed on OpenClaw, an open-source personal AI framework. Over two weeks, twenty AI researchers interacted with the agents under adversarial conditions.

They documented 11 case studies spanning:

#Case StudyAttack CategoryOutcome in OpenClaw
1Disproportionate ResponseOverreaction to minor promptsAgent disabled its own email client β€” service self-destruction
2Compliance with Non-OwnerAuthority confusionAgent complied with stranger's instructions, modified files
3Disclosure of Sensitive InfoData exfiltrationAgent revealed owner's secrets to non-owner on Discord
4Waste of Resources (Looping)Denial of serviceAgent entered 9+ hour conversational loop; infinite cron jobs
5Denial of Service (DoS)Resource exhaustionDisk filled with output; shell processes multiplied
6Agents Reflect Provider ValuesValue misalignmentAgent imposed model-provider ethics on owner requests
7Agent HarmSelf-inflicted damageAgent deleted its own configuration files trying to "help"
8Owner Identity SpoofingImpersonationNon-owner impersonated owner; agent complied fully
9Agent Collaboration / Knowledge SharingCross-agent propagationUnsafe practices propagated between agents via Discord
10Agent CorruptionPersistent compromiseAttacker modified agent's AGENTS.md, changing its behavior permanently
11Libelous within CommunityReputation attackAgent spread false claims about other agents' owners
Why this matters

OpenClaw's architecture is representative of how most open-source personal AI frameworks work: agents have unrestricted shell access, can modify their own configuration, have no owner-identity verification, and no policy layer between the LLM's output and tool execution. Our runtime was designed to systematically address each of these failure modes.

πŸ“Š Mitigation strength classification

Not all mitigations are created equal. We classify each one by how it can be bypassed, so you can reason about the actual security boundary.

LevelMeaningBypass requires
HARD Infrastructure-enforced. Cannot be bypassed by prompt injection or LLM misbehavior. Exploiting a bug in Go runtime, nsjail, gVisor, or iptables β€” not the LLM.
FIRM Code-enforced via pattern matching. Catches known-bad patterns but has known bypass vectors. Creative encoding (Base64, Unicode, splitting across messages) or patterns not in the blocklist.
SOFT Prompt-level / advisory. Relies on LLM compliance with system instructions. Prompt injection β€” an unsolved problem in the field. Useful as defense-in-depth, not a security boundary.
MitigationStrengthRationale
M1: Owner-Only PolicyHARDGo context check β€” LLM cannot override requester identity
M2: DNS RebindingHARDPost-resolution IP check in Go, before HTTP connection
M3: Sensitive Path BlockingHARDRegex on filesystem tool in Go code, pre-execution
M4: Credential/Secret RedactionFIRMRegex catches structured patterns (XXX-XX-XXXX, AKIA...) but not creative spelling, Base64 encoding, or secrets split across messages
M5: Content Trust TaggingSOFTTags are advisory to the LLM β€” modern prompt injection can override tag semantics
M6: Skill Write ValidationFIRMDeterministic scanner + semantic intent review on SKILL.md writes. Catches known-bad patterns immediately and denies/escalates ambiguous poisoning attempts with bounded-latency model checks.
M7: Loop DetectionHARDSliding window rate limiter in Go β€” not bypassable via prompting
M8: Automation CapHARDCounter in Go β€” not bypassable via prompting
M9: Claim Source TrackingSOFTTags are applied in code (hard), but the LLM's use of trust levels is soft β€” prompt injection could convince it to ignore tags
M10: Proportionality PromptSOFTSystem prompt instruction β€” useful guardrail, not a security boundary
nsjail SandboxHARDOS-level mount/PID namespace isolation, verified on Cloud Run gVisor
iptables EgressHARDKernel-level network filtering β€” default-deny for appuser
NoMemoryAccess PolicyHARDBlocks /dev/mem, /proc/*/mem, /proc/*/environ in exec commands (defense-in-depth with nsjail)
Skill Publish ValidationFIRMSecurity scanner + rate limiting at publish time β€” catches known-bad patterns and limits blast radius
Honest about soft mitigations

Soft mitigations (M5, M9, M10) rely on the LLM following instructions. Prompt injection is an unsolved problem in the field β€” no production system can guarantee that an LLM will never be tricked into ignoring system instructions. We include these mitigations because they measurably reduce attack success rates in testing, but we do not claim they are security boundaries. The security boundaries are the hard mitigations: owner identity in Go context, nsjail namespaces, iptables rules, and rate limiters.

βš–οΈ Side-by-side: Continua vs. OpenClaw

The "Agents of Chaos" study used OpenClaw with Claude Opus and Kimi K2.5. Here is a point-by-point comparison of how each attack surface is handled.

Attack SurfaceOpenClawContinua
Owner identity No verification. Any Discord/email user can instruct the agent. Identity spoofing trivial. M1 OwnerOnlyPolicy β€” cryptographic identity via context. Non-owner requests rejected at the policy layer before reaching the LLM.
Sensitive file access Agent has unrestricted filesystem access including .env, SSH keys, API tokens. M3 NoSensitivePathAccess β€” blocks reads/writes to credential files (.env, *.pem, id_rsa, etc.) via regex policy on the filesystem tool. OnlyOwnedCredentials scopes the credential tool to services the user has actually connected (CON-1885).
DNS rebinding No protection. Agent can be tricked into accessing internal services via DNS rebinding. M2 Post-resolution IP check rejects private/loopback addresses after DNS resolution. Combined with NoPrivateNetworkAccess policy.
Credentials in responses No filtering. Agent freely returns SSNs, credit card numbers, API keys found in files. M4 FIRM RedactPII credential/secret filter β€” regex patterns for SSNs, credit cards, API keys, JWTs, private keys. Known limitations: regex catches structured formats (XXX-XX-XXXX, AKIA...) but not creative spelling ("my SSN is three hundred"), Base64-encoded secrets, or secrets split across multiple messages. Available via dynamic config (pii_filter_enabled). Broader PII (emails, phone numbers) is intentionally not filtered β€” personal agents need to share contact information for coordination and collaboration use cases.
Web content trust No provenance tracking. web_fetch output treated same as trusted data. M5 SOFT Content trust tagging β€” all web_fetch output tagged with [UNTRUSTED: web_fetch]. UGC platforms (Reddit, HN, Pastebin, etc.) get additional ⚠ UGC platform warnings. Known limitations: tags are advisory to the LLM. Sophisticated indirect prompt injection in fetched content can override tag semantics. This is defense-in-depth, not a security boundary.
Skill/config corruption Agent can modify its own AGENTS.md, SOUL.md, etc. Attackers injected persistent backdoors. M6 FIRM SkillWriteValidationPolicy β€” validates SKILL.md content at write time and at shared registry publish time (CON-1913). Uses deterministic scanning for known-bad patterns plus semantic classification for ambiguous poisoning intent. Semantic decisions are normalized to allow|deny|escalate with cache + micro-batch economics controls. Publish rate limited to 20/day per user. Agent cannot modify its own system prompt or builtin files.
Resource loops No detection. Agents entered 9+ hour loops, spawned infinite cron jobs and background processes. M7 LoopDetector β€” per-thread sliding window (5 calls/10 min for same tool+args). Hard cap at 20 total repetitions. Plus M8 per-user automation cap (10 scripts/day).
Outbound phishing / social pressure No structured outbound policy kernel. Social-engineering checks are mostly prompt-level and ad hoc. Hybrid Message safety uses deterministic rules (NoPhishingInMessages) plus semantic classification for ambiguous/high-risk content. Decisions are normalized to allow|deny|escalate with reason codes + audit metadata. Semantic checks are cached and micro-batched to keep latency/cost bounded.
Cross-agent propagation Agents on shared Discord freely influenced each other's behavior and shared unsafe practices. M9 SOFT Claim source tracking with trust levels (TrustOwner / TrustNonOwner). Memory claims from non-owner sources are marked and treated with lower authority. Known limitations: the tagging is code-enforced (hard), but the LLM's use of trust levels is soft β€” prompt injection could convince it to treat non-owner claims as authoritative.
Disproportionate actions Agent disabled its own email client (nuclear option) in response to a minor confidentiality concern. M10 SOFT Proportionality prompt injected into system context. Instructs the agent to prefer minimal, reversible actions and never disable its own capabilities. Known limitations: this is a system prompt instruction, not a code-enforced boundary. It reduces the frequency of disproportionate actions in testing but cannot prevent them against determined prompt injection.
Destructive commands Unrestricted shell with sudo. Agent could rm -rf /, fork-bomb, install rootkits. Pre-existing NoDestructiveCommands policy blocks rm -rf, fork bombs, disk wipe commands. NoNetworkAttacks blocks nmap, reverse shells, crypto miners.
Disk exhaustion No limits. 20 GB volumes could be filled by a single agent or attacker. Storage limits Per-user workspace quotas: 100 MB (free) / 500 MB (pro) / 2 GB (premium). Per-file limits. File count limits. Exec output capped at 100 KB. Action log rotation at 5 MB.
Cross-user isolation Each agent runs in its own Fly.io VM β€” strong isolation but expensive ($20+/agent/month). nsjail Exec subprocesses run in nsjail sandboxes: mount namespace (only user workspace visible), PID namespace (can't see other processes), per-sandbox /tmp, rlimits. Verified on Cloud Run β€” cold exec <1 ms. Premium tier: Firecracker microVM for full memory isolation.
Current posture (verified February 24, 2026)

All exec subprocesses run inside nsjail sandboxes with mount namespaces (only user workspace visible), PID namespaces (can't see other processes), per-sandbox /tmp, and rlimits β€” verified end-to-end on Cloud Run gen2 (gVisor). Cloud Run's gVisor hypervisor provides additional syscall filtering at the kernel level. Cold exec latency: <1 ms.

The remaining architectural difference with OpenClaw: they provide full VM-level memory isolation (separate address space per user), while our goroutine pool shares a single Go process. Go's memory safety guarantees (no pointer arithmetic, bounds checking, GC) prevent cross-goroutine memory reads through normal code; a Go runtime CVE would be required to break this boundary. For users who need hardware-level isolation, we offer a Firecracker premium tier. Where we are materially stronger: every other attack category. OpenClaw has zero policy-layer defenses β€” no owner verification, no PII filtering, no content trust, no loop detection, no resource limits, no skill validation.

Important caveat on soft mitigations: M5 (content trust), M9 (claim source tracking), and M10 (proportionality) are prompt-level defenses. They measurably improve safety in testing but are not security boundaries β€” see the classification table.

πŸ”’ The 10 mitigations (M1–M10)

Each mitigation is a pure-Go policy function applied in the tool execution pipeline. Policies are composable, zero-allocation on the hot path, and add microsecond-scale latency relative to LLM call latency (500 ms – 3 s).

M1: Owner-Only Policy

Paper reference: Case Study #2 (Compliance with Non-Owner), Case Study #8 (Identity Spoofing)

Every tool call carries a requester identity in the Go context. OwnerOnlyPolicy extracts this identity and compares it to the workspace owner. Non-owner requests are rejected before the tool executes. This is enforced at the infrastructure level β€” the LLM cannot override it, and prompt injection cannot bypass it.

// Simplified β€” actual implementation in tools/policy.go
func OwnerOnlyPolicy() Policy {
    return Policy{
        Name: "owner-only",
        Pre: func(ctx context.Context, tool string, args map[string]any) error {
            requester := RequesterFromContext(ctx)
            owner := OwnerFromContext(ctx)
            if requester != "" && owner != "" && requester != owner {
                return fmt.Errorf("only the workspace owner can use this tool")
            }
            return nil
        },
    }
}

M2: DNS Rebinding Protection

Attack: Attacker controls a domain that resolves to 169.254.169.254 (cloud metadata) or 127.0.0.1 (local services). Agent fetches the URL, believing it's an external site.

Defense: Post-DNS-resolution IP validation. After the DNS lookup but before the HTTP connection, we check if the resolved IP is in a private/loopback/link-local range. If so, the request is blocked. Combined with NoPrivateNetworkAccess (blocks curl 169.254.* in exec) and the safe proxy (transparent HTTP proxy that checks all subprocess traffic).

M3: Sensitive Path Blocking + Credential Scoping

Paper reference: Case Study #3 (Disclosure of Sensitive Information)

Compiled regex patterns match credential files: .env, .env.local, id_rsa, id_ed25519, *.pem, .aws/credentials, .ssh/config, .netrc, .pgpass, and more. NoSensitivePathAccess is applied on the filesystem tool, forcing credential access through the dedicated credential tool.

Credential scoping (CON-1885): The credential tool is further scoped by OnlyOwnedCredentials β€” it only allows access to services the user has actually connected (e.g., Gmail, Google Calendar). If you haven't connected GitHub, the agent can't retrieve GitHub credentials even via the credential tool. The connected services list is looked up from the connection store at request time.

M4: Credential/Secret Redaction

Response-path filter for credential and secret patterns. Controllable via dynamic config (pii_filter_enabled). Regex patterns detect and redact:

Design choice: credentials, not all PII

This filter intentionally targets credentials and financial secrets, not general personal information like email addresses or phone numbers. Continua agents need to work with contact information for legitimate use cases: coordinating schedules between agents, sharing meeting details in group chats, finding a place to eat with friends. Blocking emails and phone numbers would break core product functionality. The right boundary is: redact things that should never appear in conversation (API keys, SSNs, private keys), and use OwnerOnlyPolicy + OnlyOwnedCredentials to control who can access sensitive data.

M5: Content Trust Tagging

Paper reference: Case Study #10 (Agent Corruption via external content)

All web_fetch results are wrapped with provenance tags. Content from user-generated content platforms (Reddit, Hacker News, Pastebin, Stack Overflow, GitHub Gists, etc.) receives an additional ⚠ UGC platform warning. This gives the LLM signal to treat fetched content as potentially adversarial β€” reducing the effectiveness of indirect prompt injection.

M6: Skill Write Validation

Paper reference: Case Study #10 (Agent Corruption β€” attacker modified agent configuration)

When the filesystem tool writes a SKILL.md file, validation runs in two layers: deterministic scanning for known-dangerous patterns (shell injection, credential harvesting, exfiltration URLs, prompt-injection markers, disallowed tool usage) and a semantic intent classifier for ambiguous poisoning attempts that evade regexes. The semantic path returns allow|deny|escalate; deny/escalate block the write and require explicit owner review. Classifier verdicts are cached and micro-batched to keep added latency and model cost bounded.

M7: Loop Detection

Paper reference: Case Study #4 (Waste of Resources β€” 9+ hour loop)

Per-thread sliding window rate limiter. If the same tool is called with the same arguments more than 5 times within 10 minutes, the call is blocked. Hard cap of 20 total repetitions for any tool+args pair per thread. This catches both conversational loops (agent calling the same API repeatedly) and automation loops (cron jobs, background processes).

M8: Automation Cap

Paper reference: Case Study #5 (Denial of Service)

Per-user daily limit of 10 automation scripts (cron jobs, scheduled tasks, background processes). Plus a full script audit: ScriptAudit() function that scans exec commands for script creation patterns and counts them against the user's daily budget.

M9: Claim Source Tracking

Paper reference: Case Study #9 (Cross-agent knowledge sharing), Case Study #11 (Libelous claims)

Every memory claim is tagged with its source and a trust level: TrustOwner (from the authenticated owner) or TrustNonOwner (from agents, emails, web content, or non-owner humans). When the agent uses claims for reasoning, it can weight owner claims more heavily and treat non-owner claims with appropriate skepticism.

M10: Proportionality Prompt SOFT

Paper reference: Case Study #1 (Disproportionate Response β€” agent disabled its own email)

System prompt instruction that instructs the agent to prefer minimal, reversible actions. Key rules: never disable your own capabilities, never delete configuration files, prefer the least-destructive path, and escalate to the owner when uncertain.

Soft mitigation β€” not a security boundary

This is a system prompt instruction. Prompt injection is an unsolved problem β€” if an attacker (or confused agent) is sufficiently motivated, the prompt cannot guarantee compliance. We include it because it measurably reduces disproportionate actions in our testing (the agent almost always follows these rules), but we do not claim it prevents a determined attack. The hard mitigations (M3 sensitive path blocking, M7 loop detection, M8 automation cap) provide the actual boundaries.

πŸ“Š Production observability & alerting

CON-1883: All 10 safety policies were invisible in production β€” no metrics, no alerting, no way to know if PII redaction was firing or loops were being detected. Now every safety event emits a structured JSON log line that Cloud Logging ingests and Cloud Monitoring turns into real-time metrics and alerts.

Structured safety events

Every policy block, PII redaction, loop detection, automation cap hit, sandbox fallback, and semantic-classifier batch/cache event emits a safety_event JSON entry:

{"safety_event": {
  "kind": "policy_block",
  "policy": "no-destructive-commands",
  "tool": "exec",
  "user_id": "user-abc",
  "detail": "command contains blocked pattern \"rm -rf /\"",
  "timestamp": "2026-02-24T20:57:24Z"
}}

Event kinds:

KindWhenKey fields
policy_blockAny safety policy Pre/Post returns an errorpolicy, tool, detail
pii_redactionPII pattern matched and redacted in agent outputpattern (e.g., [SSN REDACTED])
loop_detectedAgent-to-agent loop threshold or rate limit exceededthread_id, detail
automation_cap_hitUser hit their max concurrent automationsdetail (e.g., 10/10 automations)
rate_limit_hitPer-user per-tool rate limit exceededtool
sandbox_fallbackExec fell back to cold nsjail path (no warm namespace)β€”
workspace_limit_hitWorkspace size or file count limit reacheddetail
semantic_classifier_batchOne semantic safety classifier batch completedmodel, batch_size, latency_ms, input_tokens, output_tokens, estimated_cost_usd
semantic_classifier_cache_hitPolicy/content cache hit for semantic safety classificationcache_type (policy or content)

Cloud Monitoring alerts

Core safety metrics feed three alert policies:

AlertThresholdPurpose
Policy Block Spike>50 blocks in 5 minActive attack or false-positive surge needing dynamic config tuning
PII Redaction Spike>20 redactions in 5 minAgents reading sensitive files, or regex false positives
Loop Detection Events>10 in 5 minRunaway agent-to-agent loops

Dashboard

Safety dashboard widgets now include both enforcement and semantic-classifier economics:

Live debugging endpoint

GET /admin/safety-metrics returns a real-time JSON snapshot of all safety counters:

{
  "total_blocks": 42,
  "pii_redactions": 7,
  "loop_detections": 2,
  "rate_limit_hits": 15,
  "automation_cap_hits": 1,
  "sandbox_fallbacks": 0,
  "workspace_limit_hits": 3,
  "semantic_classifier_calls": 19,
  "semantic_classifier_classified_items": 44,
  "semantic_classifier_cache_hits": 61,
  "semantic_classifier_policy_cache_hits": 35,
  "semantic_classifier_content_cache_hits": 26,
  "semantic_classifier_cache_hit_rate": 0.58,
  "semantic_classifier_input_tokens": 9821,
  "semantic_classifier_output_tokens": 1432,
  "semantic_classifier_estimated_cost_usd": 0.0127,
  "semantic_classifier_cost_by_user_usd": {
    "alice@example.com": 0.0068,
    "bob@example.com": 0.0059
  },
  "semantic_classifier_added_latency_p50_ms": 37,
  "semantic_classifier_added_latency_p95_ms": 84,
  "policy_blocks": {
    "no-destructive-commands": 12,
    "no-network-attacks": 8,
    "no-sensitive-path-access": 15,
    "no-phishing-in-messages": 5,
    "deno-sandbox": 2
  }
}
Hot-path overhead stays minimal

InstrumentedGuarded remains identical to Guarded on the happy path for deterministic policies. Semantic safety checks only run for ambiguous/high-risk outbound messages and are amortized with verdict/content caches plus micro-batching. Most benign tool calls still incur only local policy checks + counters.

πŸŽ›οΈ Dynamic config for safety tuning

CON-1883: All safety thresholds were hardcoded β€” if PII redaction had false positives or loop detection was too aggressive, the only fix was a code change + redeploy. Safety behavior is now tuneable at runtime via dynamic config (GCS blob, polled every 30 seconds).

KnobConfig keyDefaultUse case
Loop thresholdloop_detector_threshold20Raise if agents legitimately need long back-and-forth threads
Loop rate limitloop_detector_rate_limit5 per windowTighten if loops are consuming resources before threshold
Loop rate windowloop_detector_rate_window_seconds600 (10 min)Widen/narrow the sliding window
Automation capautomation_max_per_user10Raise for power users, lower under load
PII filter togglepii_filter_enabledtrueEmergency kill switch if false positives are too high
Classifier modelclassifier_modelmain modelRoute semantic safety checks to a cheaper/faster model (e.g., Flash Lite)
Workspace max sizeworkspace_max_size_bytestier-basedOverride for specific deployments
Workspace max filesworkspace_max_filestier-basedOverride for specific deployments
Sandbox pool minsandbox_pool_min_size5Pre-warm more namespaces during peak hours
Sandbox pool maxsandbox_pool_max_size200Cap memory usage from warm namespaces
Sandbox idle timeoutsandbox_session_idle_seconds600 (10 min)Recycle faster under memory pressure
Sandbox togglesandbox_enabledtrueEmergency fallback to unsandboxed exec

All knobs use zero-value = code default semantics. Setting a knob to 0 (or omitting it) uses the hardcoded default. This means an empty config file produces the same behavior as before the feature was added β€” no migration needed.

Kill switches

pii_filter_enabled and sandbox_enabled are emergency kill switches. Disabling them reduces security posture and should only be done to mitigate a confirmed false-positive surge or sandbox instability, with a plan to re-enable promptly.

πŸ—οΈ User isolation architecture

Our runtime uses a goroutine-per-user architecture within a shared Go process on Cloud Run. This is a deliberate trade-off: cheaper per-user cost (~$0 marginal cost vs. $20+/VM for OpenClaw) at the expense of weaker OS-level isolation.

Defense-in-depth layers

πŸ›‘οΈ
Policy layer (M1–M10)
Pre/Post policies on every tool call. Owner verification, credential scoping, credential/secret redaction, content trust, loop detection, resource limits. 2–5 Β΅s per call.
πŸ“
Application-layer jail
Per-user workspace directories. resolve() rejects absolute paths and traversals. Exec blocks commands referencing sibling workspaces. Workspace size quotas.
πŸ”
nsjail sandbox (Linux)
Every exec subprocess runs inside an nsjail namespace. Mount namespace (only user workspace visible), PID namespace (can't see other processes), per-sandbox /tmp (10 MB tmpfs), rlimits (512 MB / 30s CPU / 50 MB file / 64 procs). Cloud Run's gVisor provides additional syscall filtering at the hypervisor level. Verified on Cloud Run gen2: cold exec <1 ms.
☁️
Infrastructure
Cloud Run with safe proxy (transparent URL checking on all subprocess HTTP traffic). Google Web Risk integration. Rate limiting (per-user, per-tool). Async audit logging.

What we verified (and what we didn't)

We ran a 14-test isolation probe suite that confirms:

Gaps closed by nsjail (February 2026):

Shared-process memory isolation (what's actually true):

Residual risk (documented honestly):

πŸ’Ύ Storage exhaustion defense

Paper reference: Case Study #5 (Denial of Service β€” disk filled with output)

In the "Agents of Chaos" study, attackers filled the 20 GB volume by having the agent generate large outputs or write many files. Our runtime enforces storage limits at three levels:

ResourceFreeProPremiumEnforcement
Workspace total100 MB500 MB2 GBWorkspaceSizePolicy pre-write check
File count5,00020,000100,000Same policy, directory walk with cache
Single file10 MB50 MB100 MBContent-length check before write
Exec output100 KBExecOutputSizePolicy post-exec truncation
Exec subprocess memory512 MBRLIMIT_AS (Linux)
Exec CPU time30 secondsRLIMIT_CPU (Linux)
Exec file write50 MBRLIMIT_FSIZE (Linux)
Exec processes64RLIMIT_NPROC (Linux)
Action log5 MBHalf-file rotation on overflow
Audit log (memory)50,000 entriesFIFO ring buffer
Daily agent runs505005,000Quota enforcer

Workspace usage is scanned lazily (at write time) and cached for 30 seconds. The cache is invalidated after every successful write. This keeps the hot-path overhead to 31 ns (cache hit) while still catching workspace overflow.

⚑ Performance benchmarks

All mitigations run in the tool-call hot path. We benchmark each one to ensure they add negligible latency relative to LLM inference (500 ms – 3 s). Measured on Apple M4 Max with -benchmem and -count=3.

Workspace size (cache hit)
31 ns
DNS rebinding check
71 ns
Owner-only policy
85 ns
Filesystem resolve + jail
140 ns
Content trust tagging
164 ns
Loop detector
550 ns
Full policy chain
2.6 Β΅s
Network attack policies
4.9 Β΅s
PII filter (with PII)
16 Β΅s

Sandbox manager overhead

Sandbox acquire (hot reuse)
113 ns / 0 allocs
Sandbox acquire + release
225 ns
nsjail cold exec (Cloud Run)
<1 ms
Bottom line

Total safety overhead per tool call: < 25 Β΅s (policies) + <1 ms (nsjail sandbox). Against a typical LLM latency of 1,000,000 Β΅s (1 second), this is <0.1% overhead. Zero-allocation on the policy hot path. Sandbox acquire is 113 ns on hot reuse (0 allocations). Safety is not a tax on performance.

πŸ§ͺ Adversarial test suite

We maintain a 33-test adversarial suite covering 16 attack categories, designed to mirror the attacks from the "Agents of Chaos" paper plus additional vectors specific to our architecture. All tests run with Go's -race detector enabled.

CategoryTestsAttacks AttemptedResult
Path traversal3Absolute paths, ../ escape, symlink attacksBLOCKED
Command injection3rm -rf /, fork bombs, :(){ :|:& };:BLOCKED
Network attacks3Reverse shells, port scans, crypto minersBLOCKED
SSRF / DNS rebinding2Metadata server, private IPs, DNS rebindingBLOCKED
Credential theft2Read .env, id_rsa, .aws/credentialsBLOCKED
PII exfiltration3SSNs, credit cards, API keys, JWTs in outputREDACTED
Prompt injection via web2Malicious instructions in fetched HTMLTAGGED
Skill poisoning2Shell injection in SKILL.md, exfiltration URLsBLOCKED
Non-owner impersonation2Tool calls with non-owner requester identityBLOCKED
Resource loops2Infinite tool call loops, cron job floodingCAPPED
Rate limit bypass1Burst tool calls exceeding limitsTHROTTLED
Identity spoofing1Claim injection with fake trust levelsTRACKED
Cross-user FS access3Read/write/list sibling workspacesBLOCKED
Cross-user exec access2cat other user's files, enumerate usersBLOCKED
Storage exhaustion2Write oversized files, exhaust file countCAPPED
Exec sandbox escape2Read /etc/passwd, enumerate PIDsBLOCKED (nsjail)
Memory access (CON-1913)2/dev/mem, /proc/self/mem, /proc/*/environ, /proc/kallsyms, /proc/kcoreBLOCKED
Builtins write (CON-1913)2Write to /builtins/IDENTITY.md, copy-then-modify evasionBLOCKED
Skill poisoning (CON-1913)4Malicious skill publish (injection, exfil, shell pipe, env harvest), publish rate limitBLOCKED

Result: 41 of 41 attacks fully mitigated. The original 33 tests plus 8 new tests for memory access paths, builtin write protection, and skill poisoning (CON-1913).

πŸš€ Cloud Run safety e2e tests

CON-1883: The adversarial tests above run in-process. They verify policy logic but not the full deployed stack: iptables REDIRECT β†’ safe proxy β†’ nsjail namespace β†’ policies β†’ PII response filter. We now have 9 deployed-instance integration tests that run against a live Cloud Run instance and verify end-to-end enforcement.

# Run against staging:
go test -v -tags=e2e ./personal_ai_runtime/e2e/ \
  -run TestSafety \
  -staging-url=https://gravity-staging-329044329780.us-central1.run.app \
  -timeout=5m
TestWhat it verifiesStack layers exercised
TestSafety_MetadataServerBlocked GCP metadata server (169.254.169.254) unreachable from exec Policy regex + iptables DROP + nsjail network
TestSafety_PIIRedactionOnResponse SSNs and API keys redacted in agent response PII response filter (post-exec)
TestSafety_DestructiveCommandBlocked rm -rf / blocked before execution NoDestructiveCommands policy
TestSafety_PrivateNetworkBlocked RFC 1918 private IPs blocked from exec NoPrivateNetworkAccess policy
TestSafety_SafeProxyURLFiltering Known-malicious URLs blocked by safe proxy Safe proxy + Google Web Risk API
TestSafety_SensitivePathBlocked Credential files (credentials/) inaccessible NoSensitivePathAccess policy
TestSafety_MetricsEndpoint /admin/safety-metrics returns valid JSON Safety metrics collector + HTTP endpoint
TestSafety_NsjailNamespaceIsolation No host processes visible from ps aux nsjail PID namespace (CLONE_NEWPID)
TestSafety_PhishingMessageBlocked Phishing patterns in outbound messages blocked NoPhishingInMessages policy
Full-stack verification

These tests send real HTTP requests to a deployed Cloud Run instance, ask the agent to perform dangerous actions, and verify the responses show proper blocking. They exercise the complete request path: load balancer β†’ Cloud Run β†’ Go HTTP handler β†’ agent pool β†’ LLM β†’ tool call β†’ policy chain β†’ nsjail sandbox β†’ response filter β†’ HTTP response. This is the test that catches "works in unit tests, broken in production" gaps.

βœ… Cloud Run verification (February 24, 2026)

We built a Docker image with nsjail, pushed it to Google Artifact Registry, and ran end-to-end tests on Cloud Run gen2 (which runs containers inside gVisor). These are the actual results from production infrastructure β€” not local dev or simulated environments.

What works on Cloud Run gVisor

FeatureClone flagStatusVerified behavior
Mount namespaceCLONE_NEWNS WORKS User workspace visible; /etc/passwd, /home, /var all hidden. Host filesystem invisible to subprocess.
PID namespaceCLONE_NEWPID WORKS Process sees NSpid: 21 2 β€” PID 2 in its own namespace. Cannot see or signal other processes.
Per-sandbox tmpfsβ€” WORKS 10 MB tmpfs mounted at /tmp. Isolated from host /tmp and other sandboxes.
rlimitsβ€” WORKS Memory (512 MB), CPU (30s), file size (50 MB), nproc (64), nofile (256) all enforced.
Workspace bind mountβ€” WORKS User workspace read-write; can read own files, write output. Cannot access other users' directories.
Cross-user write blockingβ€” WORKS Attempting to write to another user's workspace: nonexistent directory. Mount namespace hides it entirely.

What gVisor blocks (and why that's fine)

FeatureClone flagStatusWhy it doesn't matter
User namespaceCLONE_NEWUSER BLOCKED gVisor itself provides equivalent isolation at the hypervisor level. User mapping not needed when running as a single UID.
chroot / pivot_rootβ€” BLOCKED Not needed β€” mount namespace provides filesystem isolation without chroot. bind-mounts only what the subprocess needs.
seccomp-bpfβ€” REDUNDANT gVisor intercepts all syscalls at the hypervisor level. Adding seccomp inside gVisor would be defense-in-depth with no practical benefit and added complexity.

Measured latency

nsjail cold exec (Cloud Run)
<1 ms
Sandbox acquire+release
225 ns
Sandbox acquire (hot reuse)
113 ns / 0 allocs
Key finding

gVisor's clone() implementation is extremely fast β€” creating a mount+PID namespace takes less than 1 millisecond. This means the warm pool's primary value shifts from latency reduction to lifecycle management (automatic cleanup, adaptive sizing, quiescence-driven recycling). The security benefit is real and the performance cost is effectively zero.

πŸ”₯ Premium isolation: Firecracker microVM

The standard tier provides strong isolation for all users: nsjail namespaces for subprocesses, Go's memory safety for the shared process, and /proc//dev exclusion from sandboxed exec. The residual risk is a Go runtime memory corruption CVE β€” low-probability but nonzero. For users who handle high-value credentials (banking APIs, healthcare data, enterprise secrets) and need hardware-level memory isolation, we offer Firecracker microVM isolation as a premium tier.

Standard (all users)Premium (opt-in)
Subprocess isolationnsjail (mount/PID namespace + gVisor syscall filter)Firecracker microVM (separate kernel)
Memory isolationShared Go process (memory-safe runtime; no pointer arithmetic, bounds-checked, GC-managed)Separate address space per user (hardware-level)
Network isolationDefault-deny iptables + safe proxy (CON-1884)Dedicated network namespace + firewall
KernelShared host kernelDedicated guest kernel (5.10 minimal)
Exec cold start<1 ms cold (verified on Cloud Run)0 ms (pre-warmed VM) / 125–500 ms (cold)
Per-user cost~$0~$0.08/mo
Gaps closed5 of 66 of 6

How warm pools work

Both tiers use adaptive warm pools to minimize latency:

Agent quiescence signaling

Agents signal when they're done making tool calls (quiescence). This lets the sandbox manager recycle namespaces/VMs earlier, keeping the warm pool healthy. External signals (incoming email, webhooks, scheduled proactive tasks) pre-warm sandboxes before the agent run starts β€” so the namespace is ready when the first exec call arrives.

Adaptive to traffic patterns

The warm pool grows during peak hours and shrinks during quiet periods. The system measures demand rate over a 5-minute sliding window and targets enough warm capacity for a 2-minute burst. Combined with proactive pre-warming from external signals, most exec calls hit the warm path even during traffic spikes.

Verified on Cloud Run gen2 (gVisor) β€” February 2026

E2E tested on Cloud Run with real nsjail sandboxes: mount namespace (user workspace visible, /etc/passwd hidden, /home hidden), PID namespace (process sees only itself), cross-user write blocking (can't write to another user's workspace), per-sandbox /tmp, rlimits (memory, CPU, file size, nproc all enforced). Cold exec latency: <1 ms (gVisor's namespace implementation is near-zero overhead). User namespace (CLONE_NEWUSER) is not available on gVisor β€” gVisor itself provides equivalent isolation at the hypervisor level.

πŸ—ΊοΈ Completed & remaining

GapStatusHow
Safety policy observability CLOSED Structured safety_event logging, Cloud Monitoring alerts, dashboard widgets, and /admin/safety-metrics endpoint. Includes semantic-classifier cache/cost/latency telemetry for hybrid safety checks (CON-1883 + CON-2116).
Runtime tunability CLOSED Dynamic config knobs for loop detector, automation cap, PII filter, semantic classifier model, workspace limits, and sandbox pool tuning (CON-1883 + CON-2116)
Full-stack e2e verification CLOSED Deployed-instance integration tests verify iptables β†’ safe proxy β†’ nsjail β†’ policies β†’ response filters on Cloud Run, including metrics endpoint schema checks and phishing policy behavior (CON-1883 + CON-2116).
System file reads (/etc/passwd) CLOSED nsjail mount namespace β€” only user workspace visible
PID enumeration (ps aux) CLOSED nsjail PID namespace
Shared /tmp CLOSED Per-sandbox tmpfs (10 MB)
Dangerous syscalls CLOSED gVisor hypervisor-level syscall filtering (Cloud Run gen2)
Sibling workspace bypass CLOSED nsjail mount namespace + application-layer blocking
Network egress isolation CLOSED Default-deny iptables egress for appuser (CON-1884). Only loopback (safe proxy), established connections, and Cloud SQL IPs whitelisted. DNS (port 53) blocked to prevent exfiltration. HTTP/HTTPS redirected through safe proxy + Web Risk.
Credential scoping CLOSED OnlyOwnedCredentials policy scopes the credential tool to services the user has connected. NoSensitivePathAccess wired on the filesystem tool to block direct credential file reads (CON-1885).
Memory access from exec CLOSED nsjail mount namespace excludes /proc and /dev. NoMemoryAccess policy blocks /dev/mem, /proc/*/mem, /proc/*/environ in exec commands (defense-in-depth). Adversarial tests verify both layers (CON-1913).
Skill publish validation CLOSED Security scanner + rate limiting (20/day) at SharedSkillStore.Publish() and SkillRegistry.Publish() time. Content validated at install time too (defense-in-depth). Skill trust tiers and provenance tracking (CON-1913).
Shared process memory MITIGATED + PREMIUM Standard tier: Go memory safety prevents cross-goroutine reads through normal code. Residual risk: Go runtime CVE (low-probability). Premium tier: Firecracker microVM provides hardware-level address space isolation.
Disclosure

If you find a security issue, please email security@continua.ai. We follow responsible disclosure and will acknowledge your report within 48 hours.