QueryMT Agent - Mesh Networking
Mesh networking enables QueryMT Agent to collaborate across multiple machines, allowing sessions to be shared, delegates to run remotely, and LLM calls to be routed to specific nodes.
Overview
Mesh networking uses the kameo actor framework with libp2p for peer-to-peer communication. This enables:
- Cross-machine sessions: Share sessions across multiple machines
- Remote agents: Access agents running on other machines
- Distributed computation: Run heavy tasks on specialized hardware
- Load balancing: Distribute work across multiple nodes
Architecture
flowchart LR
subgraph A["Machine A"]
AS[Agent Session]
AG[GPU Worker]
end
subgraph B["Machine B"]
BS[Agent Session]
BL[LLM Provider]
end
AS <-->|"Internet<br/>(libp2p Mesh)"| BS
Quick Start
Starting a Mesh Node
Connecting to a Mesh
Configuration
Basic Mesh Configuration
Configuration Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable mesh networking |
listen |
string | /ip4/0.0.0.0/tcp/0 |
Multiaddr to listen on |
transport |
string | "lan" |
Transport layer: "lan" or "iroh" |
discovery |
string | "mdns" |
Peer discovery method |
auto_fallback |
bool | false |
Allow mesh provider discovery fallback |
node_name |
string | OS hostname | Human-readable node name advertised to peers |
identity_file |
string | ~/.qmt/mesh_identity.key |
Path to ed25519 identity keypair |
request_timeout_secs |
u64 | 300 |
Timeout for non-streaming requests (seconds) |
stream_reconnect_grace_secs |
u64 | 120 |
Grace period for stream reconnection (seconds) |
invite |
string | - | Invite token to join existing mesh (supports ${VAR} interpolation) |
Transport Modes
QueryMT supports two transport modes optimized for different network environments:
LAN Transport (Default)
Traditional TCP + QUIC transport optimized for local area networks:
Characteristics: - ✅ Fast: Optimized for low-latency LAN connections - ✅ Zero-config discovery: Automatic peer discovery via mDNS - ✅ Low overhead: Direct TCP/QUIC connections - ❌ LAN only: Cannot traverse NAT or connect over internet - ❌ Subnet limitations: mDNS may not work across different subnets
Use cases: - Development environments on local network - Office networks with multiple machines - High-bandwidth, low-latency requirements
Iroh Transport
Internet-capable transport with NAT traversal using the iroh networking library:
Characteristics: - ✅ Internet-capable: Works across the internet - ✅ NAT traversal: Automatic hole punching and relay fallback - ✅ Encrypted: Built-in encryption and authentication - ✅ Relay network: Falls back to relay servers when direct connection fails - ❌ Higher latency: Slightly higher overhead than direct LAN - ❌ Requires invite: Typically uses invite tokens for secure joining
Use cases: - Remote team collaboration - Distributed nodes across different networks - Nodes behind NAT or firewalls - Cloud + on-premises hybrid setups
Comparison Table
| Feature | LAN Transport | Iroh Transport |
|---|---|---|
| Network | Local network only | Internet-wide |
| Discovery | mDNS (automatic) | Invite tokens |
| NAT Traversal | No | Yes (hole punching + relay) |
| Latency | Very low | Low-medium |
| Setup Complexity | Minimal | Requires invite workflow |
| Security | Network-level | Built-in encryption |
| Best For | Development, LAN | Production, remote teams |
Discovery Methods
mDNS (Default)
Automatic discovery on local network:
- Pros: Zero-config, automatic
- Cons: Local network only, may miss peers on different subnets
Kademlia DHT
Distributed discovery across the internet:
- Pros: Cross-subnet, internet-wide
- Cons: Requires bootstrap nodes, more complex
Manual Peers
Explicit peer connections:
- Pros: Precise control, reliable
- Cons: Manual configuration required
Iroh Relay Discovery
Automatic discovery via iroh relay network (used with iroh transport):
How it works: 1. Host creates invite token with embedded PeerId 2. Joiner connects to inviter via iroh relay 3. Relay network handles NAT traversal 4. Direct connection established when possible (hole punching) 5. Falls back to relay if direct connection fails
Pros: - Works across the internet - Automatic NAT traversal - No manual peer configuration needed - Encrypted by default
Cons: - Requires invite token workflow - Slightly higher latency than LAN - Depends on iroh relay network availability
Discovery Method Selection Guide
| Scenario | Recommended Method | Reason |
|---|---|---|
| Development on local LAN | mdns + lan transport |
Zero-config, automatic |
| Office network, multiple subnets | kademlia + lan transport |
Cross-subnet discovery |
| Remote team, internet nodes | Invite tokens + iroh transport |
NAT traversal, secure |
| Production, fixed infrastructure | none + explicit peers |
Predictable, reliable |
| Mixed environment | Multi-transport (see below) | Best of both worlds |
Multi-Transport Setup
QueryMT supports running multiple transports simultaneously, enabling nodes to participate in both LAN and internet meshes at the same time.
Multi-Transport Architecture
flowchart TB
subgraph Node["Mesh Node"]
direction TB
LAN["LAN Transport<br/>(mDNS discovery)"]
Iroh1["Iroh Scope: 'personal'<br/>(Invite-based)"]
Iroh2["Iroh Scope: 'team'<br/>(Invite-based)"]
end
LAN <--> Peers1["LAN Peers"]
Iroh1 <--> Peers2["Internet Peers<br/>(personal mesh)"]
Iroh2 <--> Peers3["Internet Peers<br/>(team mesh)"]
Basic Multi-Transport Configuration
Configuration Reference
LAN Subtable ([mesh.lan])
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable LAN transport |
listen |
string | Inherited from [mesh] listen |
LAN-specific listen address |
discovery |
string | Inherited from [mesh] discovery |
LAN discovery method |
Iroh Scope Array ([[mesh.iroh]])
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable this Iroh scope |
name |
string | - | Scope identifier (becomes mesh_id when no invite) |
invite |
string | - | Invite token to join existing mesh (supports ${VAR}) |
Multi-Transport Use Cases
Use Case 1: LAN + Personal Internet Mesh
Local development with ability to connect from home:
Benefits: - Fast LAN connections for local machines - Access from home via internet mesh - Single node participates in both networks
Use Case 2: Multiple Team Meshes
Participate in multiple internet meshes:
Benefits: - Single node accessible from multiple teams - Team-specific mesh isolation - Centralized node management
Use Case 3: LAN with Internet Fallback
Primary LAN mesh with internet access for remote workers:
Multi-Transport Routing
When a node participates in multiple meshes, routing considers:
- Direct connections: Prefer direct peer connections
- LAN priority: LAN connections preferred over internet (lower latency)
- Scope matching: Route to peers in same scope when possible
- Fallback: Use relay if direct connection fails
Multi-Transport Security
Each transport maintains separate security context:
- LAN: Network-level security (same subnet trust)
- Iroh scopes: Cryptographic isolation between scopes
- Invite tokens: Per-scope access control
- Identity: Same node identity across all transports
flowchart LR
subgraph Security["Security Layers"]
LAN["LAN<br/>(Network trust)"]
Iroh["Iroh Scopes<br/>(Crypto isolation)"]
Token["Invite Tokens<br/>(Access control)"]
end
LAN --> Iroh --> Token
Migrating to Multi-Transport
From LAN-only to Multi-Transport
Before (LAN only):
After (LAN + Internet):
From Single Iroh to Multi-Transport
Before (Single Iroh):
After (Multiple Iroh scopes):
Multi-Transport Best Practices
- Name scopes descriptively: Use meaningful names for Iroh scopes
- Environment variables: Store invites in env vars, not config files
- Scope isolation: Keep different teams/projects in separate scopes
- Monitor connectivity: Check all transports are healthy
- Test failover: Verify behavior when one transport fails
- Document topology: Keep track of which nodes are in which scopes
Multiaddr Format
Mesh addresses use libp2p multiaddr format:
Examples
Invite Token System
QueryMT provides a secure invite token system for dynamically joining meshes, particularly useful for internet-capable iroh transport. This system uses cryptographic signatures to ensure only authorized nodes can join.
Overview
Invite tokens are signed grants that allow nodes to securely join an existing mesh:
- Cryptographically signed: Each invite is signed with the host's ed25519 identity keypair
- Self-verifying: Joiners can verify the signature offline (no network required)
- Time-limited: Tokens expire after configurable duration (default: 24 hours)
- Use-limited: Tokens can be single-use or have a maximum use count
- Role-based: Tokens grant specific permissions (member vs client role)
- Shareable: Compact format fits in QR codes or URLs
Token Format
Invite tokens are encoded in multiple formats:
- Base64url string: Compact string for CLI usage (~470 characters)
- QR code: Scannable QR code (version 14, 73x73 pixels)
- URL format:
qmt://mesh/join/<base64_token> - Direct CLI: Pass token directly to
--mesh-join
Creating Invites
Via CLI
Via TOML Configuration
Joining via Invite Tokens
Via CLI
Via TOML Configuration
Invite Security Model
Ed25519 Identity Keypair
Each node has a persistent identity keypair:
- Location:
~/.qmt/mesh_identity.key(configurable viaidentity_file) - Generated: Automatically on first run
- Persistent: Same PeerId across restarts
- Secure: Private key never leaves the host
Signature Verification
The invite token verification process:
- Host creates invite: Signs invite grant with ed25519 private key
- Token shared: Base64url encoded token shared out-of-band
- Joiner receives token: Decodes and extracts grant + signature
- Offline verification: Verifies signature using inviter's public key (embedded in token)
- Connection attempt: Dials inviter's PeerId via iroh relay network
- Mesh join: Establishes encrypted connection and joins mesh
sequenceDiagram
participant Host as Mesh Host
participant Token as Invite Token
participant Joiner as Joining Node
Host->>Host: Generate invite grant
Host->>Host: Sign with ed25519 private key
Host->>Token: Encode as base64url (~470 chars)
Token->>Joiner: Share via URL/QR/CLI
Joiner->>Joiner: Decode and verify signature (offline)
Joiner->>Host: Dial PeerId via iroh relay
Host->>Joiner: Establish encrypted connection
Joiner->>Host: Request mesh membership
Host->>Joiner: Grant membership (check expiry, use count)
Invite Grant Structure
QR Code Support
Invite tokens fit in QR codes for easy mobile sharing:
- Size: QR version 14 (73x73 pixels)
- Capacity: ~470 characters
- Format:
qmt://mesh/join/<base64_token>
Invite Management
Environment Variable Interpolation
TOML configuration supports ${VAR} syntax for secure token storage:
Token Lifecycle
- Creation: Host generates signed grant with TTL and use limits
- Distribution: Token shared via secure channel (URL, QR, CLI)
- Verification: Joiner verifies signature offline
- Connection: Joiner connects via iroh relay network
- Validation: Host checks expiry and use count
- Membership: Joiner granted mesh access
- Revocation: Token expires or use limit reached
Example Workflows
Workflow 1: Quick Team Setup
Workflow 2: QR Code Sharing
Workflow 3: Environment Variable for CI/CD
Best Practices
- Use short TTLs: Default 24h is good for temporary access
- Limit uses: Set
invite_uses=1for single-use tokens - Secure distribution: Share tokens via encrypted channels
- Environment variables: Store tokens in env vars, not config files
- Rotate tokens: Create new tokens periodically
- Monitor usage: Track invite acceptance in logs
Security Considerations
✅ Strong cryptography: Ed25519 signatures (64 bytes) ✅ Offline verification: No network needed to verify token ✅ Time-limited: Tokens expire automatically ✅ Use-limited: Prevents token reuse ✅ Role-based: Granular permission control ✅ Signed: Tamper-proof (signature invalid if modified) ✅ Auditable: Each invite has unique ID for tracking
⚠️ Token exposure: Treat tokens like passwords - don't commit to git ⚠️ Transport security: Always use iroh (encrypted) for internet meshes ⚠️ Revocation: Tokens cannot be revoked before expiry (use short TTLs)
Remote Agents
Define agents that run on remote mesh nodes:
Remote Delegate
Delegates can run on remote nodes:
Behavior: - LLM calls are routed to the remote node - Tool execution happens locally on the planner node - Enables "remote model, local session" pattern
Identity Management
Each mesh node has a persistent cryptographic identity that ensures stable PeerId across restarts.
Identity Keypair
- Algorithm: Ed25519 (elliptic curve)
- Location:
~/.qmt/mesh_identity.key(configurable) - Persistence: Same PeerId across restarts
- Auto-generation: Created automatically on first run
Configuration
Custom Identity Path
Identity Management Best Practices
- Backup identity: Copy
~/.qmt/mesh_identity.keyto preserve PeerId - Secure storage: Protect identity file (chmod 600)
- Don't share: Identity contains private key
- Regeneration: Delete file to generate new identity (changes PeerId)
Node Name
Override the default hostname advertised to peers:
Useful when: - OS hostname is meaningless (e.g., "unknown" on mobile) - Multiple nodes on same machine - Want descriptive names for peers
Session Management
Creating Remote Sessions
Listing Remote Nodes
Use list_remote_nodes() to discover available mesh peers and their capabilities:
Attaching Existing Sessions
Attach to a session running on another node via the dashboard UI or the attach_remote_session API.
Forking Sessions
Create a copy of a remote session at a specific message point:
Use cases: - Branching conversations: Try different approaches from same starting point - Experimentation: Fork to test risky changes without affecting original - Collaboration: Team members fork to work on different aspects - Rollback: Fork from known good state if current path isn't working
Resuming Sessions
Reconnect to sessions across node restarts or disconnections:
Session recovery features: - Automatic reconnection: Handles transient network failures - State preservation: Session state persisted across restarts - History recovery: Full conversation history maintained - Graceful degradation: Continues working even if some peers unavailable
Remote Session Lifecycle
stateDiagram-v2
[*] --> Created: create_remote_session()
Created --> Active: First message
Active --> Active: Messages
Active --> Forked: fork_remote_session()
Active --> Suspended: Node restart/disconnect
Suspended --> Active: resume_remote_session()
Active --> Archived: Explicit close
state Active {
[*] --> Processing
Processing --> Waiting: Tool execution
Waiting --> Processing: Tool complete
Processing --> Streaming: LLM response
Streaming --> Processing: Stream complete
}
Session Persistence
Remote sessions are persisted on the host node:
Persistence features: - SQLite storage: Sessions stored in SQLite database - History preservation: Full message history with metadata - Tool results: Tool execution results preserved - Context snapshots: Compacted context stored for recovery
Error Handling
Connection Failures
Remote operations may fail due to network issues, peer unavailability, or timeout. Check logs for specific error details and adjust request_timeout_secs or stream_reconnect_grace_secs as needed.
Stream Disconnections
During stream disconnection: 1. Buffer messages: Messages buffered locally 2. Attempt reconnect: Automatic reconnection attempts 3. Grace period: Configurable wait time (default: 120s) 4. Resume or fail: Either reconnects successfully or fails gracefully
Best Practices for Remote Sessions
- Use descriptive session IDs: Makes it easier to resume
- Regular checkpoints: Fork sessions before major changes
- Monitor node health: Check node availability before delegation
- Handle failures gracefully: Implement fallback to local execution
- Set appropriate timeouts: Balance responsiveness vs reliability
- Use session names: Add metadata to identify sessions
Routing
Routing Table
The mesh maintains a routing table that maps agents to nodes:
Routing Snapshot
Use Cases
1. GPU-Accelerated Coding
flowchart LR
subgraph Local["Local Machine (CPU)"]
PA["Planner Agent<br/>(Lightweight)"]
LM[Local Model]
end
subgraph Remote["Remote Machine (GPU)"]
CA["Coder Agent<br/>(GPU-accelerated)"]
GM[GPU Model]
end
PA -->|"Delegates task"| CA
PA -->|"LLM calls<br/>(routed to GPU)"| GM
CA -->|"Fast model inference"| GM
Configuration:
2. Distributed Team Collaboration
flowchart TD
S1["Session 1<br/>(Feature A)"]
S2["Session 2<br/>(Feature B)"]
S3["Session 3<br/>(Feature C)"]
SS[("Shared State")]
S1 & S2 & S3 <--> SS
Benefits: - Share session state across team members - Collaborate on same codebase - Real-time synchronization
3. Load Distribution
flowchart LR
subgraph LB["Load Balancer Node"]
SR["Session Router<br/>- Distribute<br/>- Monitor"]
end
subgraph WN["Worker Nodes"]
W1["Worker 1<br/>(Handle Tasks)"]
W2["Worker 2<br/>(Handle Tasks)"]
end
SR --> W1 & W2
Configuration:
4. Specialized Hardware
flowchart LR
subgraph DM["Development Machine"]
DA[Development Agent]
end
subgraph SN["Specialized Nodes"]
GPU[GPU Worker]
TPU[TPU Worker]
FPGA[FPGA Worker]
end
DA --> GPU & TPU & FPGA
Streaming Stability
QueryMT includes robust streaming stability features for handling network interruptions and maintaining reliable LLM streaming across the mesh.
Stream Reconnection
When a remote streaming connection is interrupted, QueryMT automatically attempts reconnection:
Reconnection behavior:
- Detection: Stream interruption detected via heartbeat timeout
- Buffering: Messages buffered locally during disconnection
- Reconnection attempts: Automatic reconnection with exponential backoff
- Grace period: Configurable wait time before failing (default: 120s)
- Resume or fail: Stream resumes or fails gracefully after timeout
sequenceDiagram
participant Client as Client Node
participant Stream as Stream Handler
participant Remote as Remote Node
Client->>Stream: Start streaming request
Stream->>Remote: Initiate stream
loop Stream Active
Remote->>Stream: Stream chunks
Stream->>Client: Forward chunks
end
Note over Stream,Remote: Network interruption
Stream->>Stream: Detect disconnection
Stream->>Stream: Buffer messages
Stream->>Remote: Attempt reconnection
alt Reconnect Success
Remote->>Stream: Resume stream
Stream->>Client: Continue forwarding
else Reconnect Timeout
Stream->>Client: Fail gracefully
end
Streaming Timeouts
QueryMT distinguishes between different timeout scenarios for better reliability:
Request Timeout
Timeout for non-streaming mesh requests (e.g., compaction, metadata queries):
Use cases: - Agent metadata queries - Session operations (create, fork, list) - Model listing - Node info requests
Streaming Behavior
For streaming LLM responses:
- First-chunk timeout: The mesh waits for the first response chunk after initiating a stream
- Idle detection: Monitors for stalled streams during streaming
- Reconnection: If a stream disconnects, the system attempts reconnection within the
stream_reconnect_grace_secswindow
These behaviors are managed internally by the mesh transport. Configure the grace period for reconnection:
Transient Disconnect Handling
QueryMT gracefully handles temporary network issues:
Symptoms of Transient Disconnects
- Brief network interruptions (WiFi switching, VPN reconnect)
- Packet loss
- Temporary DNS failures
- NAT re-binding
Handling Strategy
Configuration for Unstable Networks
For networks with frequent interruptions:
Mesh Provider Stability
For remote model providers in the mesh:
Fallback Strategies
Configure automatic fallback when remote providers fail:
Fallback chain: 1. Try primary remote provider 2. If unavailable, try other providers in mesh 3. If all remote providers fail, use local provider 4. If local provider fails, return error
Monitoring Stream Health
Log Monitoring
Enable detailed stream logging:
Key log messages:
- Stream reconnected successfully - Good
- Stream reconnection timeout - Check network
- Provider health check failed - Provider issue
- Transient disconnect detected - Network instability
Metrics to Watch
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Stream reconnection rate | < 1% | 1-5% | > 5% |
| Average latency | < 100ms | 100-500ms | > 500ms |
| Provider availability | > 99% | 95-99% | < 95% |
| Error rate | < 0.1% | 0.1-1% | > 1% |
Best Practices for Streaming Stability
- Set appropriate timeouts: Balance responsiveness vs reliability
- Fast networks: Lower timeouts (30-60s)
-
Slow/unreliable: Higher timeouts (120-300s)
-
Monitor reconnection rates: High rates indicate network issues
-
Use health checks: Implement periodic provider health checks
-
Implement client-side buffering: Buffer messages during reconnection
-
Provide user feedback: Show reconnection status in UI
-
Test network resilience: Simulate network failures in testing
-
Log stream events: Enable detailed logging for debugging
-
Configure per-network: Adjust settings for different network conditions
-
Use exponential backoff: Prevent thundering herd on reconnection
-
Graceful degradation: Fall back to local execution when mesh fails
Troubleshooting Streaming Issues
Symptom: Frequent Stream Timeouts
Possible causes: - Network latency too high - Remote node overloaded - Timeout values too low
Solutions:
1. Increase stream_reconnect_grace_secs
2. Check remote node health
3. Reduce mesh complexity
4. Use closer geographic nodes
Symptom: Stream Never Reconnects
Possible causes: - Network partition - Remote node crashed - Identity mismatch
Solutions: 1. Check network connectivity 2. Verify remote node is running 3. Check mesh peer configuration 4. Restart mesh on affected nodes
Symptom: High Latency During Streaming
Possible causes: - Network congestion - Too many hops in mesh - Provider overload
Solutions: 1. Check bandwidth utilization 2. Optimize mesh topology 3. Load balance across providers 4. Use LAN transport when possible
Security
Peer Authentication
Mesh nodes authenticate using libp2p's built-in peer ID system:
Firewall Configuration
Required ports for mesh networking:
| Direction | Port | Protocol | Purpose |
|---|---|---|---|
| Inbound | 9000 (default) | TCP | Mesh connections |
| Outbound | Any | TCP/UDP | Peer discovery |
Example firewall rules:
NAT Traversal
For nodes behind NAT:
- Port forwarding: Forward mesh port to internal node
- UPnP: Enable UPnP for automatic port forwarding
- Relay: Use libp2p relay servers
Monitoring
Node Status
Event Logging
Enable mesh logging:
Metrics
Key metrics to monitor:
- Connected peers: Number of active connections
- Latency: Round-trip time to peers
- Bandwidth: Data transfer rates
- Session count: Number of sessions per node
Troubleshooting
Cannot Connect to Peer
Symptoms: Mesh node shows no connected peers
Solutions: 1. Check firewall allows mesh port 2. Verify peer address is correct 3. Ensure peer is running and listening 4. Check NAT/firewall configuration
High Latency
Symptoms: Slow responses from remote agents
Solutions:
1. Check network bandwidth
2. Reduce mesh complexity (fewer peers)
3. Use closer geographic nodes
4. Increase request_timeout_secs
Peer Discovery Issues
Symptoms: Cannot find peers automatically
Solutions: 1. Try explicit peer configuration 2. Check mDNS is enabled on network 3. Verify firewall allows multicast 4. Use Kademlia for cross-subnet discovery
Session Attachment Fails
Symptoms: Cannot attach to remote session
Solutions: 1. Verify session exists on remote node 2. Check peer has correct permissions 3. Ensure mesh is properly configured 4. Review error logs for details
Best Practices
Network Configuration
- Use static IPs for mesh nodes
- Configure port forwarding for NAT environments
- Monitor bandwidth usage
- Use dedicated ports for mesh traffic
Node Organization
- Group by function: Separate planner and worker nodes
- Consider geography: Place nodes close to users
- Plan for redundancy: Multiple nodes for critical tasks
- Document topology: Keep track of node roles
Security
- Use strong peer IDs: Generate unique keys
- Limit peer access: Only allow known peers
- Monitor connections: Watch for unauthorized access
- Encrypt traffic: Use TLS where possible
Examples
Full Mesh Configuration
Command Line Examples
Related Documentation
- Configuration Guide - Mesh configuration options
- Profiles - Mesh-enabled profiles
- Delegation - Remote delegation patterns
- Examples - Mesh networking examples
- API Reference - Mesh API types and functions