radicle-reticulum/audit.md

24 KiB
Raw Blame History

Design Audit: radicle-reticulum

April 2026 — based on codebase at src/radicle_reticulum/ and protocol documentation


1. What Radicle Needs (Protocol Summary)

Transport

Radicle Heartwood runs a single TCP listener on port 8776 (configurable). Every peer-to-peer session is a persistent, full-duplex TCP stream. There is no UDP, no QUIC; TCP is non-negotiable.

Session handshake

After TCP connection, the two nodes perform a Noise XK handshake — the same pattern used by the Lightning Network. The initiating node knows the responder's static public key (the Node ID / NID, a did:key:z6Mk… Ed25519 key) before connecting. After the handshake both ends have a shared symmetric key and a verified identity. Everything from that point on is encrypted and authenticated.

Application protocol on top of Noise

Over the established Noise session, Radicle runs a gossip + multiplexed git protocol:

  1. Hello / version exchange — negotiated immediately after handshake.
  2. Gossip messages (three types):
    • Node Announcement — broadcasts NID and reachable address(es); enables peer discovery and routing table updates.
    • Inventory Announcement — broadcasts the list of repository IDs (RIDs) this node hosts; received by all connected peers who relay it further.
    • Ref Announcement — broadcasts that a particular ref changed in a particular RID; relayed only to nodes that seed that RID.
  3. Git fetch / upload-pack — multiplexed over the same TCP+Noise stream using the raw Git smart-HTTP wire protocol. When a node receives a ref announcement for a repo it seeds, it opens a git-fetch sub-session to the announcing node's TCP socket.

The gossip messages are framed with a length prefix and serialised (the exact codec is CBOR in the current implementation, though the public docs describe it as "compact binary"). The Git sub-sessions use Git's native pkt-line / pack-protocol framing, negotiated inline.

What rad node connect does

rad node connect <NID>@<host>:<port> tells the local radicle-node daemon to:

  1. Open a TCP connection to <host>:<port>.
  2. Perform the Noise XK handshake using <NID> as the expected remote static key.
  3. Send a Hello message and enter the gossip loop.
  4. From that point the two nodes exchange inventory and ref announcements, and trigger git-fetches as needed.

The call is persistent-session setup, not a one-shot sync. Once two nodes are "connected peers" they stay connected and sync automatically as refs change.

What rad push / rad fetch / rad sync do

All three commands speak to the local radicle-node daemon over a Unix socket (not over the network directly). The daemon then handles the network side:

  • rad push — writes new commits to local storage, then the daemon sends a ref announcement to all connected peers.
  • rad fetch / rad sync --fetch — asks the daemon to pull refs for a given RID from known seeds; the daemon opens git-fetch sub-sessions over existing (or newly established) Noise sessions.

The user-facing commands have no network code of their own. They are thin CLI wrappers around the local daemon IPC. This is the critical insight for the bridge: the bridge only needs to make radicle-node believe it has a working TCP peer. All gossip, inventory management, and git transfer happen inside radicle-node itself once the TCP connection is in place.

Peer discovery (without the bridge)

On the public internet, radicle-node bootstraps from two hard-coded seed DNS names (seed.radicle.garden:8776, seed.radicle.xyz:8776). From those seeds it learns about other peers through Node Announcements. The seeds relay Inventory Announcements so that every node eventually knows which node hosts which repo.


2. What Reticulum Provides (Relevant Primitives)

Addressing and routing (no configuration needed)

Every RNS node has a 128-bit destination hash derived from its public key plus application-name aspects. Routing is entirely source-agnostic: transport nodes relay packets one hop closer to the destination hash without knowing the full path. A new node discovers reachable peers within ~1 minute purely through the announce mechanism.

Announce mechanism

destination.announce(app_data=...) propagates a signed packet across all interfaces with a 2% bandwidth cap per interface. Transport nodes re-broadcast with randomised delays. Any node can embed arbitrary app_data (up to ~400 bytes on LoRa) in the announce. This is free peer discovery — no tracker, no DNS seed, no configuration.

RNS.Link(destination) performs a 3-packet (297 bytes total) ECDH handshake and then provides:

  • Encrypted, forward-secret bidirectional channel.
  • Per-packet delivery confirmation via signed proof.
  • Callbacks: set_packet_callback, set_link_closed_callback, set_link_established_callback.
  • link.identify(identity) — reveal the initiator's identity to the responder inside the encrypted channel.

This is conceptually equivalent to Radicle's Noise XK session but implemented by RNS transparently.

RNS.Packet — fire-and-forget (< ENCRYPTED_MDU = 383 bytes)

Used for small messages that fit in a single LoRa frame. No delivery guarantee. RNS.Packet(dest_or_link, data).send(). Suitable for gossip notifications.

RNS.Resource — reliable large-data transfer

RNS.Resource(data_or_filehandle, link, callback=...) handles arbitrary-size reliable transfer over an established Link: automatic chunking, sequencing, compression, integrity check. This is the right primitive for git pack transfers; it avoids the per-packet 383-byte limit and provides end-to-end reliability. Not currently used anywhere in the codebase.

link.get_channel() returns a Channel for typed message exchange. RNS.Buffer.create_bidirectional_buffer(...) wraps a Channel in Python file-like IO objects (BufferedRWPair). This enables streaming reads and writes exactly like a TCP socket — a natural fit for tunnelling Radicle's persistent TCP stream.

Interface diversity

RNS handles LoRa (RNode), packet radio (KISS/AX.25), TCP, UDP, I2P, serial — through the same API. The application code does not change between interfaces.

LoRa specifics (RNode interface)

  • Typical LoRa physical data rate: 1.25.5 kbps (SF7SF12, 125 kHz BW).
  • Duty cycle limits in Europe: 110 % per sub-band, meaning an SF12 node may transmit at most ~36 seconds per hour.
  • RNS caps announce bandwidth at 2% per interface, which on a 1.2 kbps LoRa link is ~2.4 bps — one announce every few minutes.
  • RNS.Packet.ENCRYPTED_MDU = 383 bytes per packet.
  • Link establishment costs 297 bytes (23 frames at SF12).

3. Current Code: What's Right, What's Redundant, What's Missing

What's right

bridge.py — RadicleBridge

This is the core value of the project and it is essentially correct. The design — listen on TCP, accept from radicle-node, open RNS link to remote bridge, forward bytes bidirectionally — is the minimal correct architecture. Key good decisions:

  • Reuses an existing RNS instance (RNS.Reticulum.get_instance()).
  • Embeds the radicle NID in announce app_data, so remote bridges know which NID to register without a separate handshake.
  • Chunks TCP reads to RNS.Packet.ENCRYPTED_MDU (383 B) in _send_over_link — fixes the real LoRa blocker.
  • Per-bridge dedicated TCP server ports — avoids multiplexing confusion if multiple remote bridges are discovered.
  • State persistence (_save_state / _load_state) for NID-to-bridge-hash mapping survives restarts.
  • Path maintenance loop warms RNS routing table so connections are not delayed.
  • Reconnect logic in _forward_tcp_to_rns handles transient link drops.

gossip.py — GossipRelay

The gossip relay addresses a real gap: radicle-node only polls/syncs when it already knows about a peer. The gossip relay provides a lightweight side-channel to wake up remote nodes when local refs change, without sending git data over LoRa. This is good design. Specific strengths:

  • Watchdog inotify integration (_start_watcher) for instant detection on push.
  • Debounce (WATCHDOG_DEBOUNCE = 2.0s) absorbs multi-commit push bursts.
  • MDU-aware payload splitting in _build_ref_payloads — one ref change = one or a few 383-byte packets.
  • Delta vs. full broadcast distinction reduces bandwidth.
  • auto_seed and auto_discover make the seed mode zero-configuration.

seed.py — SeedNode

Correct and minimal: spawns a separate radicle-node process with its own RAD_HOME and a permissive seedingPolicy. Using DEVNULL for stdout/stderr prevents pipe buffer deadlock — a real bug that was fixed.

identity.py — RadicleIdentity

The DID ↔ RNS identity mapping is well-implemented. load_or_generate for persistent identity across restarts is exactly right. The clear documentation of the RNS identity vs. destination hash distinction is accurate.

cli.py

--lora shortcut flag for conservative announce delays and longer poll intervals is a good UX touch. cmd_setup health-checker is useful.

What's redundant

adapter.py — RNSTransportAdapter

This module is entirely superseded by bridge.py and should be deleted. It implements a parallel peer-discovery and connection mechanism using APP_NAME/ASPECT_NODE destinations that nothing in the working system uses. It creates its own RNS instance unconditionally (line 69) which will conflict if RadicleBridge is also running. The announce_repository method, ASPECT_REPO destinations, and the connect/connect_to_peer plumbing are all dead code. The cmd_node, cmd_ping, and cmd_peers CLI commands in cli.py use this adapter; these commands are vestigial and do not contribute to the bridge-based architecture.

link.py — RadicleLink

This wrapper around RNS.Link is not used by bridge.py or gossip.py. bridge.py calls RNS.Link directly and manages callbacks inline. RadicleLink adds a deque receive buffer and a recv(timeout) blocking call — which implies a request/response programming model that doesn't fit the streaming tunnel. The file can be deleted unless a future protocol layer needs it.

messages.py

The binary framing layer (NodeAnnouncement, InventoryAnnouncement, RefAnnouncement, Ping, Pong) duplicates what Radicle already does natively. radicle-node sends its own Node and Inventory Announcements over the Noise session. gossip.py uses JSON-over-RNS.Packet for its ref-change notifications (not this module). messages.py is used only by cmd_ping in cli.py — itself a dead command. This file can be deleted.

Identity mismatch

RadicleIdentity generates a fresh Ed25519 keypair for the RNS bridge identity. This RNS key has no relationship to the Radicle Node ID (NID). The bridge announces the Radicle NID as a string in app_data, but the RNS identity is unrelated. This is fine architecturally — the bridge is a transparent proxy, not a Radicle node — but it means:

  • You cannot derive the RNS bridge hash from a Radicle NID (they are unrelated).
  • The from_did path in RadicleIdentity (which tries to import a Radicle DID as an RNS identity) is impossible to use correctly: RNS needs both Ed25519 signing and X25519 encryption keys, while Radicle DIDs carry only an Ed25519 public key. This path should be removed or clearly documented as unsupported.

RNS.Packet for streaming tunnel data

bridge.py currently uses RNS.Packet(link, chunk).send() in a loop to stream TCP data over the RNS link. This works but is not optimal:

  • Each packet is a distinct encrypted unit with its own AES-256-CBC + HMAC overhead.
  • Packet ordering over a Link is guaranteed by RNS, but delivery is notRNS.Packet over a Link is best-effort (no automatic retransmission at the packet level; only Resources provide that).
  • For Radicle's Noise session over the tunnel, dropped packets will corrupt the stream, causing the Noise session to fail and radicle-node to disconnect.
  • The reconnect logic in _forward_tcp_to_rns handles link drops (link went down entirely) but not within-link packet loss.

The correct primitive for streaming byte data over an RNS Link is RNS.Buffer (wrapping RNS.Channel), which provides ordered, reliable delivery. This is the most important correctness gap in the bridge.

What's missing

  1. RNS.Buffer / RNS.Channel for the tunnel stream (critical — see above). Replace _send_over_link / _on_rns_data with a BufferedRWPair in _handle_local_connection and _on_incoming_link. The buffer handles chunking, ordering, and reliability automatically.

  2. RNS.Resource for initial git clone / large pack transfers. When a node first syncs a repo (initial clone, large commit), the pack can be tens or hundreds of megabytes. RNS.Resource handles this with compression, sequencing, and checksumming. The bridge doesn't need this if it's purely a TCP proxy (the radicle-node-to-radicle-node git transfer goes through the tunnel naturally), but it matters for LoRa where the TCP tunnel is too slow for large transfers.

  3. No flow control on the TCP→RNS direction. Currently _forward_tcp_to_rns reads TCP as fast as available and sends RNS packets without any backpressure. If the RNS path is slow (LoRa), the TCP socket buffer fills, TCP flow control kicks in against radicle-node, which may time out its side of the connection. Need either RNS.Buffer (which handles this) or explicit rate limiting.

  4. RNS.Link per TCP connection is expensive on LoRa. Every new TCP connection from radicle-node triggers a full RNS link establishment (297-byte handshake = 23 LoRa frames). For a single rad sync session, radicle-node opens one connection and keeps it, so this is fine. But if radicle-node opens multiple parallel connections (e.g., for concurrent repo syncs), each gets its own link. A future optimisation is link multiplexing via RNS.Channel streams over a single link.

  5. No handling of radicle-node restart. If radicle-node restarts, it forgets all connected peers. The bridge detects this via TCP error and closes tunnels, but it does not re-register known NIDs with the new radicle-node instance. _load_state runs on bridge startup, not on radicle-node reconnect. A watchdog that polls the Unix socket or attempts rad node connect periodically would fix this.

  6. The RNSTransportAdapter (adapter.py) and the RadicleBridge both register announce handlers via RNS.Transport.register_announce_handler. If both are running (e.g., via cmd_node started alongside the bridge), every announce fires both handlers. This is harmless but wasteful.


Minimum viable rewrite

The bridge architecture is correct. The rewrite goal is to make it more correct and simpler, not to add features. The recommended target:

radicle-node (TCP 8776)
    ↕ localhost TCP (one connection per peer session)
RadicleBridge (bridge.py — keep, refine)
    ↕ RNS.Buffer over RNS.Channel over RNS.Link
Remote RadicleBridge (same code)
    ↕ localhost TCP
radicle-node (TCP 8776)

What to keep

  • bridge.py — keep, replace RNS.Packet stream with RNS.Buffer
  • gossip.py — keep as-is; correct and complete
  • seed.py — keep as-is
  • identity.py — keep, but remove or clearly gate the from_did path (it cannot produce a usable RNS identity)
  • cli.py — keep cmd_bridge, cmd_seed, cmd_gossip, cmd_setup; remove cmd_node, cmd_ping, cmd_peers

What to cut

  • adapter.py — delete entirely
  • link.py — delete (not used by the working path)
  • messages.py — delete (binary gossip framing not used; Radicle handles this natively)
  • __init__.py — remove exports for RNSTransportAdapter, RadicleLink, MessageType, NodeAnnouncement, InventoryAnnouncement, RefAnnouncement, Ping, Pong, decode_message

The one structural fix: RNS.Buffer for the tunnel

In bridge.py, replace _send_over_link and _on_rns_data with Buffer-based IO:

# On outbound connection (_handle_local_connection):
channel = rns_link.get_channel()
buf = RNS.Buffer.create_bidirectional_buffer(
    receive_id=0, send_id=1, channel=channel,
    ready_callback=lambda n: _drain_buffer_to_tcp(tunnel_id, n)
)
# tunnel.buf = buf
# Forward TCP→RNS: tcp_socket.recv() → buf.write()
# Forward RNS→TCP: ready_callback → buf.read() → tcp_socket.sendall()

# On incoming link (_on_incoming_link): mirror with swapped IDs (receive_id=1, send_id=0)

This single change provides ordered, reliable delivery and eliminates the packet-loss-corrupts-stream problem.

Flow: what happens on rad push

  1. User runs rad push in their checkout.
  2. rad CLI writes the new commits to ~/.radicle/storage/, then tells the local radicle-node daemon via Unix socket.
  3. radicle-node sends a Ref Announcement over all its active TCP connections.
  4. The bridge has one TCP connection to local radicle-node (on the bridge's per-peer listen port). This is the incoming side of the bridge from radicle-node's perspective (radicle-node is the caller; the bridge allocated that port and registered the peer NID via rad node connect).

Wait — this is worth clarifying. There are two directions:

Outbound sync (local node to remote):

  • Remote bridge discovers local bridge via RNS announce.
  • Remote bridge calls rad node connect NID@127.0.0.1:<port> on its local radicle-node.
  • Remote radicle-node opens TCP to that port.
  • Remote bridge's accept loop picks it up, opens an RNS Link to the local bridge.
  • Local bridge's _on_incoming_link fires, opens TCP to local radicle-node at port 8776.
  • The session is now: remote radicle-node ↔ remote bridge ↔ RNS ↔ local bridge ↔ local radicle-node.

Push propagation:

  • Local user runs rad push → local radicle-node emits Ref Announcement on all sessions.
  • One of those sessions goes through the bridge tunnel.
  • Remote radicle-node receives the Ref Announcement; if it seeds the repo, it initiates git-fetch back on the same session (same TCP connection, multiplexed by the Radicle protocol).
  • GossipRelay detects the local ref change independently (via inotify or poll), sends a lightweight RNS packet to all known gossip peers.
  • Remote gossip relay receives this, calls rad sync --fetch --rid <RID> → remote radicle-node pulls via the existing bridge session.

The gossip layer is a belt-and-suspenders trigger: if the bridge TCP session is active, radicle-node gets the Ref Announcement natively and syncs automatically. The gossip relay is useful for nodes that are not currently bridged (bridge is down, no active RNS link) — they receive the gossip packet and re-establish the bridge + do a manual sync.

What radicle-rns should expose

The minimal UX is:

radicle-rns seed     # on always-on nodes (combines radicle-node + bridge + gossip)
radicle-rns bridge   # on user laptops (bridge only; user's radicle-node handles their own storage)

rad push, rad fetch, rad sync require no changes. They speak to the local daemon as always. The daemon believes it has normal TCP peers. The bridge is invisible.


5. LoRa-Specific Considerations

What is realistic over LoRa

LoRa at SF7/125 kHz gives about 5.5 kbps physical; SF12 (max range) is about 290 bps. After duty cycle (1% in EU868), RNS overhead, and RNS announce cap (2%), the effective throughput for application data is:

SF Physical rate Practical throughput Time for 1 MB
SF7 5.5 kbps ~4 kbps ~33 min
SF10 1.2 kbps ~800 bps ~2.5 hr
SF12 290 bps ~180 bps ~11 hr

Feasible over LoRa:

  • Gossip ref-change notifications (a few hundred bytes per event) — always feasible
  • Small commits with small pack objects (< 50 KB) — feasible at SF7, slow at SF12
  • Ref announcements and node discovery — feasible, handled by RNS announce
  • Link establishment (297 bytes) — 3 frames at SF12, under 1 second at SF7

Not feasible over LoRa:

  • Initial clone of any non-trivial repository (pack objects typically 1100 MB)
  • Large commit batches (many changed files, binary assets)
  • Frequent polling (gossip poll interval should be >= 120s on LoRa, not the default 30s — the --lora flag correctly sets this)
  1. Initial clone via fast link (WiFi, Ethernet, internet): rad clone rad:<RID> in the normal way.
  2. Incremental sync over LoRa: subsequent rad push / rad sync for small commits. A 1-commit diff is typically 550 KB of pack data — a few minutes at SF10.
  3. Gossip relay always running: on LoRa, the gossip relay is more important than the bridge because it can send a 300-byte "go fetch" signal even when the bridge TCP session is not live. The bridge then re-establishes only when there is data to transfer.
  4. LoRa-safe announce delays: the --lora flag sets announce delays to 60,300,900 seconds. This matters because LoRa duty cycle limits mean frequent announces drain the airtime budget.

The case for RNS.Resource on LoRa

For medium-sized pack objects (1500 KB), streaming them as raw TCP through the bridge is fragile: if a single RNS packet drops, the Noise session errors and radicle-node disconnects. RNS.Resource retransmits failed segments automatically. For the LoRa case, the recommended approach is:

  1. Intercept git pack data at the bridge layer (parse git pkt-line to detect pack boundaries).
  2. Transfer pack objects as RNS.Resource instead of streaming TCP bytes.
  3. Re-inject on the remote side before forwarding to radicle-node.

This is a significant complexity increase and may not be worth it for an initial version. The simpler alternative is to rely on TCP retransmission: if the tunnel drops, TCP times out, radicle-node retries, and the bridge re-establishes the link. This works but results in poor user experience (multi-minute timeouts on LoRa).

The minimum correct fix (RNS.Buffer instead of RNS.Packet for streaming) makes the bridge reliable over all media including LoRa, because RNS.Channel (which Buffer uses) provides per-message acknowledgement and retransmission.

Airtime budget example (EU868, SF10)

  • RNS link establishment: 297 bytes → ~2 seconds airtime
  • Gossip ref notification: ~300 bytes → ~2 seconds airtime
  • 10 KB pack object: ~26 frames × 2s = ~52 seconds airtime (inside 1% duty cycle budget for 5200 seconds)
  • 100 KB pack object: ~10 minutes of airtime, needs ~17 hours of calendar time at 1% duty cycle

These numbers confirm: LoRa is viable for ref notifications and small commits, and impractical for initial clones or large repos.


6. Summary Table

Item Status Recommendation
bridge.py — TCP↔RNS tunnel Correct architecture, packet-loss gap Keep; switch to RNS.Buffer
bridge.py — announce + NID in app_data Correct Keep
bridge.py — per-bridge TCP ports Correct Keep
bridge.py — state persistence Correct Keep
bridge.py — path maintenance loop Correct Keep
bridge.py — reconnect on link drop Correct Keep
gossip.py — ref-change relay Correct and necessary Keep
gossip.py — inotify + debounce Correct Keep
gossip.py — MDU-aware splitting Correct Keep
seed.py — seed node manager Correct Keep
identity.py — RNS identity persistence Correct Keep; remove from_did path
adapter.py — RNSTransportAdapter Dead code, conflicts Delete
link.py — RadicleLink wrapper Dead code Delete
messages.py — binary gossip frames Dead code, duplicates Radicle Delete
cmd_node, cmd_ping, cmd_peers Use dead adapter Remove
Custom gossip protocol Radicle handles natively over bridge Remove (messages.py)
RNS.Packet for stream data Best-effort, packet loss corrupts stream Replace with RNS.Buffer
RNS.Resource for large transfers Not implemented Consider for LoRa path
Initial clone over LoRa Impractical Document; clone over fast link first