Overhaul of Reverse Tunnel Code

This commit is contained in:
2025-12-06 20:07:08 -07:00
parent 737bf1faef
commit 178257c588
42 changed files with 1240 additions and 357 deletions

View File

@@ -20,6 +20,9 @@ Use this doc for agent-only work (Borealis agent runtime under `Data/Agent` →
- Validates script payloads with backend-issued Ed25519 signatures before execution.
- Outbound-only; API/WebSocket calls flow through `AgentHttpClient.ensure_authenticated` for proactive refresh. Logs bootstrap, enrollment, token refresh, and signature events in `Agent/Logs/`.
## Reverse Tunnels
- Design, orchestration, domains, limits, and lifecycle are documented in `Docs/Codex/REVERSE_TUNNELS.md`. Agent role implementation lives in `Data/Agent/Roles/role_ReverseTunnel.py` with per-domain protocol handlers under `Data/Agent/Roles/Reverse_Tunnels/`.
## Execution Contexts & Roles
- Auto-discovers roles from `Data/Agent/Roles/`; no loader changes needed.
- Naming: `role_<Purpose>.py` with `ROLE_NAME`, `ROLE_CONTEXTS`, and optional hooks (`register_events`, `on_config`, `stop_all`).

View File

@@ -23,6 +23,10 @@ Use this doc for Engine work (successor to the legacy server). For shared guidan
- Enrollment: operator approvals, conflict detection, auditor recording, pruning of expired codes/refresh tokens.
- Background jobs and service adapters maintain compatibility with legacy DB schemas while enabling gradual API takeover.
## Reverse Tunnels
- Full design and lifecycle are in `Docs/Codex/REVERSE_TUNNELS.md` (domains, limits, framing, APIs, stop path, UI hooks).
- Engine orchestrator is `Data/Engine/services/WebSocket/Agent/reverse_tunnel_orchestrator.py` with domain handlers under `Data/Engine/services/WebSocket/Agent/Reverse_Tunnels/`.
## WebUI & WebSocket Migration
- Static/template handling: `Data/Engine/services/WebUI`; deployment copy paths are wired through `Borealis.ps1` with TLS-aware URL generation.
- Stage 6 tasks: migration switch in the legacy server for WebUI delegation and porting device/admin API endpoints into Engine services.

View File

@@ -1,302 +0,0 @@
# Codex Prompt
Read `Docs/Codex/FEATURE_IMPLEMENTATION_TRACKING/Agent_Reverse_Tunneling.md` and follow the checklist. Preserve existing Engine/Agent behavior, reuse TLS/identity, and implement the dedicated reverse tunnel system (WebSocket-over-TLS) with ephemeral Engine ports and browser-based operator access. Update this ledger with progress, deviations, and next steps before and after changes.
# Context & Goals
- Build a high-performance reverse tunnel between Engine and Agent to carry interactive protocols (first target: interactive PowerShell) while keeping the existing control Socket.IO untouched.
- Agents use a single fixed outbound port to reach the Engines tunnel listener; Engine allocates ephemeral ports from 3000040000 per session.
- Tunnels are operator-triggered, ephemeral, and torn down after inactivity (1h idle) or if the agent stays offline beyond the grace window (1h).
- WebUI-only operator experience: Engine exposes a WebSocket endpoint that bridges browser sessions to agent channels.
- Security: reuse pinned TLS bundle + Ed25519 device identity and short-lived signed tokens per tunnel/channel.
- Concurrency: per-domain caps (one PowerShell session per agent; RDP/VNC/WebRTC grouped to one concurrent session across that domain; WinRM/SSH can scale higher later).
- Logging: Device Activity entries for session start/stop (with operator identity when available) plus `reverse_tunnel.log` on both Agent and Engine.
- Transport: WebSocket-over-TLS for the tunnel socket to keep proxy-friendliness and reuse libraries.
- Motivation: Provide NAT-friendly, high-throughput, operator-initiated remote access (PowerShell first) without touching the existing lightweight control Socket.IO; agents remain outbound-only, Engine leases high ports on demand.
# Background for a New Codex Agent (assumes you read AGENTS.md first)
- AGENTS.md already points you to `BOREALIS_ENGINE.md`, `BOREALIS_AGENT.md`, and shared UI docs; follow their logging paths, runtime locations, and UI rules.
- Keep the existing Socket.IO control channel untouched; this tunnel is a new dedicated listener/port.
- Reuse security: pinned TLS bundle + Ed25519 identity + existing token signing. Agent stays outbound-only; no inbound openings on devices.
- UI reminders specific to this feature: PowerShell page should mirror `Assemblies/Assembly_Editor.jsx` syntax highlighting and general layout from `Admin/Page_Template.jsx` per UI doc.
- Licensing: project is AGPL; PowerShell runs in pipe mode (no pywinpty/ConPTY dependency).
- Non-destructive: new code must be gated/dormant until invoked; avoid regressions to existing roles/pages.
# Non-Destructive Expectations
- Do not break existing Agent/Engine comms. New code should be additive and dormant until wired.
- WebUI additions must not impact current pages; new routes/components should be isolated.
- Note any temporary breakage during development here before committing.
# Architecture Plan (High Level)
- Transport: Dedicated WebSocket-over-TLS listener (Engine) on fixed port; Agents dial outbound only.
- Handshake: API on port 443 negotiates an ephemeral tunnel port + token/lease; Agent opens tunnel socket to that port; Engine maps operator channels to agent channels.
- Framing: Binary frames `version | msg_type | channel_id | flags | length | payload`; supports heartbeat, back-pressure, close codes, and resize events for terminals.
- Lease/idle: 1h idle timeout; 1h grace if agent drops mid-session before freeing port.
- PowerShell v1: Agent spawns PowerShell via pipes (no ConPTY), Engine provides browser terminal bridge with syntax highlighting like `Assembly_Editor.jsx`.
# Terminology & IDs
- agent_id: existing composed ID (hostname + GUID + scope).
- tunnel_id: UUID per tunnel lease.
- channel_id: uint32 per logical stream inside a tunnel (PowerShell uses one channel for stdio + control subframes for resize).
- lease: mapping of tunnel_id -> agent_id -> assigned_port -> expiry/idle timers -> domain/protocol metadata -> token.
# Handshake / Flow (end-to-end)
1) Operator in WebUI clicks “Remote PowerShell” on a device.
2) WebUI calls Engine API (port 443) `POST /api/tunnel/request` with agent_id, protocol=ps, domain=ps, operator_id/context. API:
- Consults lease manager (port pool 3000040000). Enforce domain concurrency per agent (PowerShell max 1).
- Allocates port, tunnel_id, expiry (idle 1h, grace 1h), signs token (binds agent_id, tunnel_id, port, protocol, expires_at).
- Persists lease (in-memory + db/kv if available) and logs Device Activity (pending).
- Returns {tunnel_id, port, token, expires_at, idle_seconds, protocol, domain}.
- Spins up/keeps alive tunnel listener for that port in the async service (engine-side).
3) Engine notifies agent over existing control channel (Socket.IO or REST push) with the lease payload.
4) Agent ReverseTunnel role receives the task, validates token signature (Engine public key) + expiry, opens WebSocket-over-TLS to engine_host:port, sends CONNECT frame {agent_id, tunnel_id, token, protocol, domain, client_version}.
5) Engine listener validates token, binds the WebSocket to the lease, marks agent_connected_at, starts idle timer.
6) Operator browser opens Engine WebUI bridge WebSocket `/ws/tunnel/<tunnel_id>` (port 443) with operator auth. Engine maps browser session to lease and assigns channel 1 for PowerShell.
7) Engine sends CHANNEL_OPEN to agent for channel 1 (protocol ps). Agent starts PowerShell ConPTY, streams stdout/stderr as DATA frames, accepts stdin from browser via Engine bridge.
8) Heartbeats: ping/pong every 20s; idle timer resets on traffic; if idle > 1h, Engine and Agent send CLOSE (code=idle) and free lease/port.
9) If agent disconnects mid-session, lease is held for 1h grace; if reconnects within grace, Engine allows resume, else frees port.
10) On session end (operator closes or process exits), Engine/Agent exchange CLOSE, teardown channel, release port, write Device Activity (stop) and service logs.
# Framing (binary over WebSocket)
- Fixed header (little endian): version(1) | msg_type(1) | flags(1) | reserved(1) | channel_id(4) | length(4) | payload(length).
- msg_types:
- 0x01 CONNECT (payload: JSON or CBOR {agent_id, tunnel_id, token, protocol, domain, version})
- 0x02 CONNECT_ACK / CONNECT_ERR (payload: code, message)
- 0x03 CHANNEL_OPEN (payload: protocol, metadata)
- 0x04 CHANNEL_ACK / CHANNEL_ERR
- 0x05 DATA (payload: raw bytes)
- 0x06 WINDOW_UPDATE (payload: uint32 credits) for back-pressure if needed
- 0x07 HEARTBEAT (ping/pong)
- 0x08 CLOSE (payload: code, reason)
- 0x09 CONTROL (payload: JSON for resize: {cols, rows} or future control)
- Back-pressure: default to simple socket pausing; WINDOW_UPDATE optional for high-rate protocols. Start with pause/resume read on buffers > threshold.
- Close codes: 0=ok, 1=idle_timeout, 2=grace_expired, 3=protocol_error, 4=auth_failed, 5=server_shutdown, 6=agent_shutdown, 7=domain_limit, 8=unexpected_disconnect.
# Port Lease / Domain Policy
- Port pool: 3000040000 inclusive. Persist active leases (memory + lightweight state file/db row).
- Lease fields: tunnel_id, agent_id, assigned_port, protocol, domain, operator_id, created_at, expires_at, idle_timeout=3600s, grace_timeout=3600s, state {pending, active, closing, expired}, last_activity_ts.
- Allocation: first free port in pool (wrap). Refuse if pool exhausted; emit error to operator.
- Domain concurrency per agent:
- ps domain: max 1 active tunnel.
- rdp/vnc/webrtc domain: max 1 active (future).
- ssh/winrm domain: allow >1 (configurable).
- Idle: reset on any DATA/CONTROL. After 1h idle, send CLOSE, free lease.
- Grace: if agent disconnects, hold lease for 1h; allow reconnect to same tunnel_id/port with valid token; else free.
# Engine Components (detailed)
- Async service module `Data/Engine/services/WebSocket/Agent/ReverseTunnel.py`:
- Runs asyncio/uvloop TCP/TLS listener factory for the fixed port (or on-demand per allocated port).
- Manages per-port WebSocket acceptors bound to leases.
- Validates tokens (Ed25519 or existing JWT signer); ensures agent_id/tunnel_id/port/protocol match.
- Maintains lease manager (in-memory map + persistence hook).
- Provides API helpers to allocate/release leases and to push control messages to Agent via existing Socket.IO.
- Logging to `Engine/Logs/reverse_tunnel.log`.
- Emits Device Activity entries on session start/stop (with operator id if present).
- API endpoint (port 443) `POST /api/tunnel/request`:
- Inputs: agent_id, protocol, domain, operator_id (from auth), metadata (e.g., hostname).
- Output: tunnel_id, port, token, idle_seconds, grace_seconds, expires_at.
- Checks domain limits, pool availability.
- Browser bridge endpoint `/ws/tunnel/<tunnel_id>`:
- Auth via existing operator session/JWT.
- Binds to lease, opens channel 1 to agent, relays DATA/CONTROL, handles CLOSE/idle.
- Enforces per-operator attach (one active browser per tunnel unless multi-view is allowed; start with one).
- WebUI wiring (later): PowerShell page uses this bridge; status toasts on errors/idle.
# Agent Components (detailed)
- Role file `Data/Agent/Roles/role_ReverseTunnel.py`:
- Registers control event handler to receive tunnel instructions (payload from Engine API push).
- Validates token signature/expiry and domain limits.
- Opens WebSocket-over-TLS to engine_host:assigned_port using existing TLS bundle/identity.
- Implements framing (CONNECT, CHANNEL_OPEN, DATA, HEARTBEAT, CLOSE).
- Manages sub-role registry for protocols (PowerShell first).
- Enforces per-domain concurrency; refuses new tunnel if violation (sends error).
- Heartbeat + idle tracking; stop_all closes active tunnels cleanly.
- Logging to `Agent/Logs/reverse_tunnel.log`.
- Submodules under `Data/Agent/Roles/ReverseTunnel/`:
- `tunnel_Powershell.py`: pipes-only PowerShell subprocess (stdin/stdout piping, control frames are no-ops for resize), exit codes.
- Common helpers: channel dispatcher, back-pressure (pause reads if outbound buffer high).
# PowerShell v1 (end-to-end)
- Engine:
- `Data/Engine/services/WebSocket/Agent/ReverseTunnel/Powershell.py` handles protocol-specific channel setup, translates browser resize/control messages to CONTROL frames, and passes stdin/out via DATA.
- Integrates with Device Activity logging (start/stop, operator id, agent id, tunnel_id).
- Agent:
- Spawn PowerShell in ConPTY with configurable shell path if needed; set environment minimal; capture stdout/stderr; forward to channel.
- Handle EXIT -> send CLOSE with exit code; tear down channel and release lease.
- WebUI:
- `Data/Engine/web-interface/src/ReverseTunnel/Powershell.jsx` page/modal with terminal UI, syntax highlighting like `Assemblies/Assembly_Editor.jsx`, copy support, status toasts, idle timeout banner.
- Uses `/ws/tunnel/<tunnel_id>` bridge; shows reconnect/spinner while agent connects; shows errors for domain limit or auth failure.
# Logging & Auditing
- Service logs: `Engine/Logs/reverse_tunnel.log`, `Agent/Logs/reverse_tunnel.log` with tunnel_id/channel_id/agent_id/operator_id.
- Device Activity: add entries on session start/stop with operator info if available, reason (idle timeout, exit, error).
- Metrics (optional later): count active tunnels, per-domain usage, port pool pressure.
# Config Knobs (defaults)
- fixed_tunnel_port: 8443 (or reuse 443 listener with path split if desired).
- port_range: 30000-40000.
- idle_timeout_seconds: 3600.
- grace_timeout_seconds: 3600.
- heartbeat_interval_seconds: 20.
- domain concurrency limits: ps=1, rdp/vnc/webrtc=1 shared, ssh=unbounded/configurable, winrm=unbounded/configurable.
- enable_compression: false (initially).
# Testing & Validation (detailed)
- Engine unit tests: lease manager allocations, domain limit enforcement, idle/grace expiry, token validation, port pool exhaustion.
- Engine integration: simulated agent CONNECT, channel open, data echo, idle timeout, reconnect within grace.
- Agent unit tests: role lifecycle start/stop, token rejection, domain enforcement, heartbeat handling.
- Manual plan: start Engine and Agent in dev; request tunnel via API; agent receives task; tunnel connects; run PowerShell commands; test resize; test idle timeout; test agent drop and reconnect within 1h; test domain limit error when a second PS session is requested.
- WebUI manual: open PS page, run commands, copy output, observe Device Activity entries, see idle timeout banner.
# Credits & Attribution
- Add pywinpty attribution to `Data/Engine/web-interface/src/Dialogs.jsx` CreditsDialog under “Code Shared in this Project.”
# Risks / Watchpoints
- Eventlet vs asyncio coexistence: ensure tunnel service uses dedicated loop/thread/process to avoid blocking existing Socket.IO handlers.
- Port exhaustion: detect and return meaningful errors; ensure cleanup on process exit.
- Buffer growth: enforce back-pressure; pause reads when output queue high.
- Packaging: pywinpty wheels availability for supported Python versions; note in doc if build needed.
- Security: strict token binding (agent_id, tunnel_id, port, protocol, expiry); reject mismatches; always TLS.
# Next Actions (on approval)
- Document handshake and framing (done above) — refine in code comments.
- Scaffold Engine tunnel service + lease manager + logging (no wiring to main app yet).
- Scaffold Agent role + PowerShell submodule (dormant until enabled).
- Add WebUI PowerShell page and CreditsDialog attribution.
- Wire negotiation API to lease manager; push control payload to Agent; wire browser bridge to tunnel service.
# Implementation Sequence (follow in order)
1) Read project docs (`Docs/Codex/BOREALIS_ENGINE.md`, `Docs/Codex/BOREALIS_AGENT.md`, `Docs/Codex/SHARED.md`, `Docs/Codex/USER_INTERFACE.md`).
2) Add config defaults (fixed tunnel port, port range, timeouts) without enabling by default.
3) Build Engine lease manager (allocations, domain rules, idle/grace timers, persistence hook) with unit tests.
4) Implement framing helper (encode/decode headers, heartbeats, close codes, back-pressure hooks).
5) Stand up Engine async WebSocket-over-TLS listener (fixed port), token validation, and per-lease bindings.
6) Implement negotiation API `/api/tunnel/request`; sign tokens; push control payload to Agent via existing channel.
7) Add Engine browser bridge `/ws/tunnel/<tunnel_id>`; relay to agent channel; enforce auth and single attachment.
8) Scaffold Agent role `role_ReverseTunnel.py` (token verify, connect, channel dispatch, heartbeat, stop_all, domain limits).
9) Implement Agent PowerShell submodule with ConPTY/pywinpty, resize, stdout/stderr piping, exit handling.
10) Implement Engine PowerShell handler to translate browser events and route frames.
11) Build WebUI PowerShell page `ReverseTunnel/Powershell.jsx` with terminal UI, syntax highlighting, status/idle handling.
12) Wire Device Activity logging for session start/stop; surface in Device Activity tab.
13) Add Credits dialog attribution for pywinpty.
14) Run tests (unit + manual end-to-end); verify idle/grace, domain limits, resize, reconnect.
15) Gate feature (config off by default), clean up logs, update this ledger with status and deviations.
# Handoff Notes for the Next Codex Agent
- Treat this file as the single source of truth for the tunnel feature; document deviations and progress here.
- Keep changes additive; do not modify unrelated Socket.IO handlers or existing roles/pages.
- Maintain outbound-only design for Agents; Engine listens on fixed + leased ports.
- Performance matters: use asyncio/uvloop for tunnel service; avoid blocking eventlet paths; add basic back-pressure.
- Security is mandatory: TLS + signed tokens bound to agent_id/tunnel_id/port/protocol/expiry; close on mismatch or framing errors.
# Execution Protocol for the Codex Agent
- Work one checklist item at a time.
- After finishing an item, mark it in the Detailed Checklist and briefly summarize what changed.
- Before starting the next item, ask the operator for permission to proceed. Pause if permission is not granted.
- If you must reorder items (e.g., dependency), note the rationale here before proceeding.
- Keep the codebase functional at all times. If interim work breaks Borealis, either complete the set of dependent checklist items needed to restore functionality in the same session or revert your own local changes before handing back.
- Only prompt for a GitHub sync when a tangible piece of functionality is validated (e.g., API call works, tunnel connects, UI interaction tested). Pair the prompt with the explicit question: “Did you sync a commit to GitHub?” after validation or operator testing.
# Detailed Checklist (update statuses)
- [x] Repo hygiene
- [x] Confirm no conflicting changes; avoid touching legacy Socket.IO handlers.
- [x] PowerShell transport: pipe-only (pywinpty/ConPTY removed from Agent deps).
- [x] Engine tunnel service
- [x] Add reverse tunnel config defaults (fixed port, port range, timeouts, log path) without enabling.
- [x] Create `Data/Engine/services/WebSocket/Agent/ReverseTunnel.py` (async/uvloop listener, port pool 3000040000).
- [x] Implement lease manager (DHCP-like) keyed by agent GUID, with idle/grace timers and per-domain concurrency rules.
- [x] Define handshake/negotiation API on port 443 to issue leases and signed tunnel tokens.
- [x] Implement channel framing, flow control, heartbeats, close semantics.
- [x] Logging: `Engine/Logs/reverse_tunnel.log`; audit into Device Activity (session start/stop, operator id, agent id, tunnel_id, port).
- [x] WebUI operator bridge endpoint (WebSocket) that maps browser sessions to agent channels.
- [x] Idle/grace sweeper + heartbeat wiring for tunnel sockets.
- [x] TLS-aware per-port listener and agent CONNECT_ACK handling.
- [x] Agent tunnel role
- [x] Add `Data/Agent/Roles/role_ReverseTunnel.py` (manages tunnel socket, reconnect, heartbeats, channel dispatch).
- [x] Per-protocol submodules under `Data/Agent/Roles/ReverseTunnel/` (first: `tunnel_Powershell.py`).
- [x] Enforce per-domain concurrency (one PowerShell; prevent multiple RDP/VNC/WebRTC; allow extensible policies).
- [x] Logging: `Agent/Logs/reverse_tunnel.log`; include tunnel_id/channel_id.
- [x] Integrate token validation, TLS reuse, idle teardown, and graceful stop_all.
- [ ] PowerShell v1 (feature target)
- [x] Engine side `Data/Engine/services/WebSocket/Agent/ReverseTunnel/Powershell.py` (channel server, resize handling, translate browser events).
- [x] Agent side `Data/Agent/Roles/ReverseTunnel/tunnel_Powershell.py` using pipes-only PowerShell subprocess; map stdin/stdout to frames; resize no-op.
- [ ] WebUI: `Data/Engine/web-interface/src/ReverseTunnel/Powershell.jsx` with terminal UI, syntax highlighting matching `Assemblies/Assembly_Editor.jsx`, copy support, status toasts.
- [ ] Device Activity entries and UI surface in `Devices/Device_List.jsx` Device Activity tab.
- [ ] Credits & attribution
- [x] pywinpty removed (no attribution needed); revisit if new third-party deps added.
- [ ] Testing & validation
- [ ] Unit/behavioral tests for lease manager, framing, and idle teardown (Engine side).
- [ ] Agent role lifecycle tests (start/stop, reconnect, single-session enforcement).
- [ ] Manual test plan: request port, start PowerShell session, send commands, resize, idle timeout, offline grace recovery, concurrent domain policy.
- [ ] Operational notes
- [ ] Document config knobs: fixed tunnel port, port range, idle/grace durations, domain concurrency limits.
- [ ] Warn about potential resource usage (FD count, port exhaustion) and mitigation.
## Progress Log
- 2025-11-30: Repo hygiene complete—git tree clean with no Socket.IO touches; added Windows-only `pywinpty` dependency to Agent requirements for future PowerShell ConPTY work (watch packaging/test impact). Next: start Engine tunnel service scaffolding pending operator go-ahead.
- 2025-11-30: Added reverse tunnel config defaults to Engine settings (fixed port 8443, port pool 3000040000, idle/grace 3600s, heartbeat 20s, log path Engine/Logs/reverse_tunnel.log); feature still dormant and not wired.
- 2025-11-30: Scaffolded Engine reverse tunnel service module (`Data/Engine/services/WebSocket/Agent/ReverseTunnel.py`) with domain policy defaults, port allocator, and lease manager (idle/grace enforcement). Service stays dormant; listener/bridge wiring and framing remain TODO.
- 2025-11-30: Added framing helpers (header encode/decode, heartbeat/close builders) plus negotiation API `/api/tunnel/request` (operator-authenticated) that allocates leases via the tunnel service and returns signed tokens/lease metadata; listener/bridge/logging still pending.
- 2025-11-30: Wired dedicated reverse tunnel log writer (daily rotation) and elevated lease allocation/release events to log file via `ReverseTunnelService`; Device Activity logging still pending.
- 2025-11-30: Added token decode/validation helpers (signature-aware when signer present) to `ReverseTunnelService` for future agent handshake verification; still not wiring listeners/bridge.
- 2025-11-30: Added bridge scaffolding with token validation hook and placeholder Device Activity logger; no sockets bound yet and DB-backed Device Activity still outstanding.
- 2025-11-30: Device Activity logging now writes to `activity_history` (start/stop with reverse_tunnel entries) and emits `device_activity_changed` when socketio is available; bridge uses token validation on agent attach. Listener wiring still pending.
- 2025-11-30: Added async listener hooks/bridge attach entrypoints (`handle_agent_connect`, `handle_operator_connect`) as scaffolding; still no sockets bound or frame routing.
- 2025-11-30: Moved negotiation API to `services/API/devices/tunnel.py` (device domain), injected db/socket handles into the service, and added a placeholder Socket.IO handler `tunnel_bridge_attach` that calls operator_attach (no data plane yet).
- 2025-11-30: Added bridge queues for agent/operator frames (placeholder), and ensured ReverseTunnelService is shared across API/WebSocket registration via context to avoid duplicate state; sockets/frame routing still not implemented.
- 2025-11-30: Added WebUI-facing Socket.IO namespace `/tunnel` with join/send/poll events that map browser sessions to tunnel bridges, using base64-encoded frames and operator auth from session/cookies.
- 2025-11-30: Enabled async WebSocket listener per assigned port (TLS-aware via Engine certs) for agent CONNECT frames, with frame routing between agent socket and browser bridge queues; Engine tunnel service checklist marked complete.
- 2025-11-30: Added idle/grace sweeper, CONNECT_ACK to agents, heartbeat loop, and token-touched operator sends; per-port listener now runs on dedicated loop/thread. (Original instructions didnt call out sweeper/heartbeat wiring explicitly.)
- 2025-12-01: Added Agent reverse tunnel role (`Data/Agent/Roles/role_ReverseTunnel.py`) with TLS-aware WebSocket dialer, token validation against signed leases, domain-limit guard, heartbeat/idle watchdogs, and reverse_tunnel.log status emits; protocol handlers remain stubbed until PowerShell module lands.
- 2025-12-01: Implemented Agent PowerShell channel (initially pywinpty/ConPTY path, later simplified to pipes-only) and Engine PowerShell handler with Socket.IO helpers (`ps_open`/`ps_send`/`ps_resize`/`ps_poll`); added ps channel logging and domain-aware attach. WebUI remains pending.
- 2025-12-06: Simplified PowerShell handler to pipes-only, removed pywinpty dependency, added robust handler import for non-package agent runtimes, and cleaned UI status messaging.
## Engine Tunnel Service Architecture
```mermaid
sequenceDiagram
participant UI as WebUI (Browser)
participant API as Engine API (443)
participant RTSVC as ReverseTunnelService
participant Lease as LeaseMgr/DB
participant Agent as Agent
participant Port as Ephemeral TLS WS (3000040000)
UI->>API: POST /api/tunnel/request {agent_id, protocol, domain}
API->>RTSVC: request_lease(agent_id, protocol, domain, operator_id)
RTSVC->>Lease: allocate(port, tunnel_id, token, expiries)
RTSVC-->>API: lease summary (port, token, tunnel_id, idle/grace, fixed_port)
API-->>UI: {port, token, tunnel_id, expires_at}
API-->>RTSVC: ensure shared service / listeners (context)
Agent-)Port: WebSocket TLS to assigned port
Agent->>Port: CONNECT frame {agent_id, tunnel_id, token}
Port->>RTSVC: validate token, bind bridge, Device Activity start
Port-->>Agent: CONNECT_ACK + HEARTBEATs
UI->>API: (out-of-band) receives lease payload via control push
UI->>RTSVC: Socket.IO /tunnel join (tunnel_id, operator auth)
RTSVC->>Lease: mark operator attached
UI->>RTSVC: send frames (stdin/controls)
RTSVC->>Port: enqueue to agent socket
Agent->>RTSVC: frames (stdout/stderr/resize)
RTSVC-->>UI: poll frames back to browser
RTSVC->>Lease: touch activity/idle timers
loop Heartbeats / Sweeper
RTSVC->>Agent: HEARTBEAT
RTSVC->>Lease: expire_idle()/grace sweep every 15s
end
Note over RTSVC,Lease: on idle/grace expiry -> CLOSE, release port, Device Activity stop
Note over RTSVC,Port: on agent socket close -> bridge stop, release port
```
## Future Changes in Generation 2
These items are out of scope for the current milestone but should be considered for a production-ready generation after minimum functionality is achieved in the early stages of development. This section is a place to note things that were not implemented in Generation 1, but should be added in future iterations of the Reverse Tunneling system.
- Harden operator auth/authorization: enforce per-operator session binding, ownership checks, audited attach/detach, and offer a pure WebSocket `/ws/tunnel/<tunnel_id>` bridge.
- Replace Socket.IO browser bridge with a dedicated binary WebSocket bridge for higher throughput and simpler framing.
- Back-pressure and flow control: implement window-based credits, buffer thresholds, and circuit breakers to prevent unbounded queues.
- Graceful loop/server lifecycle: join the loop thread on shutdown, await per-port server close, and expose health/metrics.
- Resilience and reconnect: agent/browser resume with sequence numbers, replay protection, and deterministic recovery within grace.
- Observability: structured metrics (active tunnels, port utilization, back-pressure events), alerting on port exhaustion/auth failures.
- Configuration and hardening: pin `websockets`, validate TLS at bootstrap, and expose feature flags/env overrides for listener enablement.

View File

@@ -0,0 +1,92 @@
# Borealis Reverse Tunnels Operator & Developer Guide
This document is the single reference for how Borealis reverse tunnels are organized, secured, and orchestrated. It is written for Codex agents extending the feature (new protocols, UI, or policy changes).
## 1) High-Level Model
- Outbound-only: Agents initiate all tunnel sockets to the Engine. No inbound openings on devices.
- Transport: WebSocket-over-TLS carrying a binary frame header (version | msg_type | flags | reserved | channel_id | length) plus payload.
- Leases: Engine issues short-lived leases per agent/domain/protocol. Each lease binds a tunnel_id to an ephemeral Engine port and a signed token.
- Domains: Concurrency “lanes” keep protocols isolated: `remote-interactive-shell` (2), `remote-management` (1), `remote-video` (2). Legacy aliases (`ps`, etc.) normalize into these lanes.
- Channels: Logical streams inside a tunnel (channel_id u32). PS uses channel 1; future protocols can open more channels per tunnel as needed.
- Tear-down: Idle/grace timeouts plus explicit operator stop. Closing a tunnel must close its protocol channel(s) and kill the agent process for interactive shells.
## 2) Engine Components
- Orchestrator: `Data/Engine/services/WebSocket/Agent/reverse_tunnel_orchestrator.py`
- Lease manager: Port pool allocator, domain limit enforcement, idle/grace sweeper.
- Token issuer/validator: Binds agent_id, tunnel_id, domain, protocol, port, expires_at.
- Bridge: Maps agent sockets ↔ operator sockets; stores per-tunnel protocol server instances.
- Logging: `Engine/Logs/reverse_tunnel.log` plus Device Activity start/stop entries.
- Stop path: `stop_tunnel` closes protocol servers, emits `reverse_tunnel_stop` to agents, releases lease/bridge.
- Protocol registry: Domain/protocol handlers under `Data/Engine/services/WebSocket/Agent/Reverse_Tunnels/`:
- `remote_interactive_shell/Protocols/Powershell.py` (live), `Bash.py` (placeholder).
- `remote_management/Protocols/SSH.py`, `WinRM.py` (placeholders).
- `remote_video/Protocols/VNC.py`, `RDP.py`, `WebRTC.py` (placeholders).
- API Endpoints:
- `POST /api/tunnel/request` → allocates lease, returns {tunnel_id, port, token, idle_seconds, grace_seconds, domain, protocol}.
- `DELETE /api/tunnel/<tunnel_id>` → operator-driven stop; pushes stop to agent and releases the lease.
- Domain default for PowerShell requests is `remote-interactive-shell` (legacy `ps` still accepted).
- Operator Socket.IO namespace `/tunnel`:
- `join`, `send`, `poll`, `ps_open`, `ps_send`, `ps_resize`, `ps_poll`.
- Operator socket disconnect triggers `stop_tunnel` if no other operators remain attached.
- WebUI (current): `Data/Engine/web-interface/src/Devices/ReverseTunnel/Powershell.jsx` requests PS leases in `remote-interactive-shell`, sends CLOSE frames, and calls DELETE on disconnect/unload.
## 3) Agent Components
- Role: `Data/Agent/Roles/role_ReverseTunnel.py`
- Validates signed lease tokens; enforces domain limits (2/1/2 with legacy fallbacks).
- Outbound TLS WS connect to assigned port; heartbeats + idle/grace watchdog; stop_all closes channels and sends CLOSE.
- Protocol registry: loads handlers from `Data/Agent/Roles/Reverse_Tunnels/*/Protocols/*` (PowerShell live; others stubbed to close unsupported channels cleanly).
- PowerShell channel: `Data/Agent/Roles/ReverseTunnel/tunnel_Powershell.py` (pipes-only, no PTY); re-exported under `Reverse_Tunnels/remote_interactive_shell/Protocols/Powershell.py`.
- Logging: `Agent/Logs/reverse_tunnel.log` with channel/tunnel lifecycle.
## 4) Framing, Heartbeats, Close
- Header: version(1) | msg_type(1) | flags(1) | reserved(1) | channel_id(u32 LE) | length(u32 LE).
- Messages: CONNECT/ACK, CHANNEL_OPEN/ACK, DATA, CONTROL (resize), WINDOW_UPDATE (reserved), HEARTBEAT (ping/pong), CLOSE.
- Close codes: ok, idle_timeout, grace_expired, protocol_error, auth_failed, server_shutdown, agent_shutdown, domain_limit, unexpected_disconnect.
- Heartbeats: Engine → Agent loop; idle/grace sweeper ~15s on Engine; Agent watchdog closes on idle/grace.
## 5) Lifecycle (PowerShell example)
1. UI calls `POST /api/tunnel/request` with agent_id, protocol=ps, domain=remote-interactive-shell.
2. Engine allocates port/tunnel_id, signs token, starts listener, pushes `reverse_tunnel_start` to agent.
3. Agent dials WS to assigned port, sends CONNECT with token. Engine validates, binds bridge, sends CONNECT_ACK + heartbeat.
4. Operator Socket.IO `/tunnel` joins; Engine attaches operator, instantiates PS server, issues CHANNEL_OPEN.
5. Agent launches PowerShell (pipes), streams stdout/stderr as DATA; operator input via `ps_send`; optional resize via `ps_resize` (no-op on agent pipes).
6. On operator Disconnect/tab close, UI sends CLOSE frame and calls DELETE; Engine stop path notifies agent (`reverse_tunnel_stop`), closes channel, releases lease/domain slot.
7. Idle/grace expiry or agent disconnect also triggers close/release; domain slots free immediately.
## 6) Security & Auth
- TLS: Reuse existing pinned bundle; outbound-only agent sockets.
- Token: short-lived, binds agent_id/tunnel_id/domain/protocol/port/expires_at; optional signature verification (Ed25519 signer when configured).
- Operator auth: uses existing Engine session/cookie/bearer for `/tunnel` namespace and API endpoints.
## 7) Configuration Knobs (defaults)
- Port pool: 3000040000; fixed port optional (context settings).
- Idle timeout: 3600s; Grace timeout: 3600s.
- Heartbeat interval: 20s (Engine → Agent).
- Domain limits: remote-interactive-shell=2, remote-management=1, remote-video=2; legacy aliases preserved.
- Log path: `Engine/Logs/reverse_tunnel.log`; `Agent/Logs/reverse_tunnel.log`.
## 8) Logs & Telemetry
- Engine: lease events, socket events, close reasons in `reverse_tunnel.log`; Device Activity start/stop with tunnel_id/operator_id when available.
- Agent: role lifecycle, channel start/stop, errors in `reverse_tunnel.log`.
## 9) Extending to New Protocols
- Add Engine handler under the appropriate domain folder and register in the orchestrators protocol registry.
- Add Agent handler under matching domain folder; update role registry to load it.
- Define channel open semantics (metadata), DATA/CONTROL usage, and close behavior.
- Update API/UI to allow selecting the protocol/domain and to send protocol-specific controls.
## 10) Outstanding Work
- Implement real handlers for Bash/SSH/WinRM/RDP/VNC/WebRTC and surface in UI.
- Add tests for DELETE stop path, per-domain limits, and browser disconnect cleanup.
- Consider a binary WebSocket browser bridge to replace Socket.IO for high-throughput protocols.
## 11) Risks & Watchpoints
- Eventlet/asyncio coexistence: tunnel loop runs on its own thread/loop; avoid blocking Socket.IO handlers.
- Port exhaustion: handle allocation failures cleanly; always release on stop/idle/grace.
- Buffer growth: add back-pressure before enabling high-throughput protocols.
- Security: strict token binding (agent_id/tunnel_id/domain/protocol/port/expiry) and TLS; reject framing errors.
## 12) Change Log (not exhaustive)
- 2025-11-30: Initial scaffold (lease manager, framing, tokens, API, Agent role, PS handlers).
- 2025-12-06: Simplified PS to pipes-only; improved handler imports; UI status tweaks.
- 2025-12-18: Domain lanes introduced (`remote-interactive-shell`, `remote-management`, `remote-video`) with limits 2/1/2; protocol handlers reorganized under `Reverse_Tunnels/*/Protocols/*`; orchestrator renamed to `reverse_tunnel_orchestrator.py`; explicit stop API/Socket.IO cleanup; WebUI Disconnect/unload calls DELETE + CLOSE for immediate teardown.