Debugging an SSH Timeout Over Tailscale Userspace: Anatomy of a Transport Problem

Table of Contents

The Context
#

I was working on the gradual migration of TazLab cluster secrets from Infisical to a Vault instance running on a Hetzner VM. The project — 09-vault-k8s-integration-prep — was well advanced: the VM was operational, Vault was initialized and running, the Tailscale hostname converged to the MagicDNS lushycorp-vault.magellanic-gondola.ts.net, and the Ansible playbook had completed several full destroy/create cycles.

But the pipeline was not yet solid enough. Every so often, create.sh would hang. Not randomly — almost always at the same points: a Vault service restart, a fetch task, an apt install. At first they looked like playbook issues: systemd_service handlers that never returned, “dead SSH session during restart.”

After converting all 7 synchronous systemd_service tasks to async and splitting the playbook into three separate ones — one for installation, one for convergence, one for post-convergence finalization — things improved, but the problem did not disappear entirely.

Something still did not add up.

The 76-Second Threshold
#

The next step was to make the problem measurable. I built a systematic test matrix from inside the TazPod container against the Hetzner VM, using raw SSH:

Operation	Result	Time
`echo ok`	✅	immediate
`sleep 120`	✅	2 minutes
continuous output for 2 minutes	✅	2 minutes
`curl` download of an 8 MB .deb file	✅	~1 minute
throttled `curl` at 100k/s (80s)	✅	80 seconds
`apt-get update`	✅	4 seconds
`apt-get download awscli`	✅	2 seconds
`sudo apt-get install --reinstall -y awscli`	❌	~76 seconds

The pattern was very clear: any operation exceeding a certain I/O pattern during package installation would collapse the SSH connection.

To rule out Ansible from the diagnosis, I repeated the test with raw SSH and verbose logging:

ssh -vvv -o ProxyCommand="tailscale --socket=... nc %h %p" \
  admin@<tailnet-ip> \
  "sudo apt-get install --reinstall -y awscli"

Same result: dead after ~76 seconds, identical whether via Ansible or direct SSH.

The -vvv log showed a precise sequence:

debug1: channel 0: new session
debug1: Entering interactive session.
debug2: exec request accepted on channel 0
debug2: channel 0: read failed ... Broken pipe
debug2: channel 0: send eof
debug3: send packet: type 80
debug3: send packet: type 80
...
Timeout, server <ip> not responding.

The connection entered interactive session, the command started, then the SSH channel broke with Broken pipe, followed by repeated keepalive attempts and eventually timeout.

What Works and What Doesn’t
#

The pattern ruled out many hypotheses:

It was not session duration: sleep 120 completed without issues
It was not traffic volume: an 8 MB curl download for 80 seconds completed
It was not reduced bandwidth: throttled curl at 100k/s completed
It was not Ansible: raw SSH failed the same way
It was not apt itself: apt-get update and apt-get download worked
It was not the last playbook task: the problem appeared even in early tasks after connection

The most important logical leap was this: failure happened specifically during apt install, not during download, upload, or long idle periods. There was something in the I/O pattern generated by installation — disk writes, post-install scripts, dpkg database updates — that collapsed the SSH transport over tailscale nc.

Suspecting the Transport Layer
#

The TazPod container runs Tailscale in a particular configuration. When Docker creates it, there is no /dev/net/tun — so Tailscale must run in userspace networking mode, a software loop that emulates WireGuard without a kernel interface. SSH reaches the VM through a ProxyCommand:

ssh -o ProxyCommand="tailscale nc %h %p" ...

This tells SSH not to connect directly to the VM, but to go through tailscale nc, which forwards TCP traffic over the tailnet using the userspace stack.

The combination of three layers — Docker bridge networking + Tailscale userspace + ProxyCommand “nc” — was a functional architecture for short commands, but proved fragile for operations requiring a stable multi-minute connection with I/O bursts.

The strongest confirmation came when I compared the Tailscale peer state between “good” and “bad” sessions. In historical logs from previous successful create runs, the peer was often in active; direct 178.104.84.205:41641 state — direct WireGuard. In problematic sessions, the peer appeared in an ambiguous state, often via DERP relay, sometimes with inconsistent metadata between ping and status.

This did not prove DERP was the cause, but it suggested the transport path was not clean.

The Definitive Test: Escaping Userspace
#

At this point I decided to change one variable at a time. The biggest one was: “What happens if we run Tailscale in kernel mode, with a real /dev/net/tun, instead of userspace?”

I prepared a test container with a different configuration:

docker run -d --name tazpod-test \
  --network host \
  --cap-add NET_ADMIN \
  --device /dev/net/tun \
  tazzo/tazpod-ai:latest \
  sleep infinity

Then I started Tailscale in normal TUN mode, using a helper script now included in the image:

tazpod-tailscale-up

And repeated the exact same command that had always failed:

ansible ... -m shell -a \
  'sudo DEBIAN_FRONTEND=noninteractive apt-get install --reinstall -y awscli'

Result: completed in 9 seconds.

Same VM, same command, same Ansible, same secrets. The only difference was the transport: no more tailscaled --tun=userspace-networking + ProxyCommand tailscale nc, but a direct tailnet connection through the kernel WireGuard.

The problem was not in the playbook, not in Ansible, not in apt or dpkg. It was in the combination of userspace networking and ProxyCommand via nc which, for reasons still worth deeper investigation, could not sustain the package installation workload.

The Pipeline Revived
#

With the cause isolated, the changes were surprisingly minimal.

The TazPod container runtime now defaults to:

--network host — no Docker bridge
--cap-add NET_ADMIN — required for TUN
--device /dev/net/tun — the kernel interface

The tazpod-tailscale-up helper starts tailscaled in the background, generates an auth key using the same OAuth credentials (with API key fallback) already in the vault, and connects the container to the tailnet.

The Ansible inventory is generated dynamically: if /dev/net/tun is present, it uses direct tailnet SSH without ProxyCommand; otherwise it falls back to the old tailscale nc path. This auto-detection logic lives in the render-tailscale-inventory.sh helper.

The Vault playbook, previously a monolith, was split into three phases with separate timing:

Phase	Duration
Runtime installation (packages, config, service)	175s
Convergence (classification, restore, unseal, health)	90s
Post-convergence (token, backup, persistence)	38s
Total	~344s (5.7 min)

The previous best time was around 1200 seconds (20 minutes) with frequent hangs. The gap is substantial and, more importantly, the pipeline is now deterministic: zero timeouts, zero UNREACHABLE, zero manual intervention.

What We Learned
#

The first lesson was methodological. The problem was buried under at least three layers of abstraction: I created the container with TazPod, which launched Docker, which had no /dev/net/tun, so Tailscale fell back to userspace mode, which required a ProxyCommand via nc, and that could not sustain certain traffic patterns. We were so accustomed to this configuration that we no longer considered it a possible cause.

The second lesson is that isolation tests work. Reducing the problem to raw SSH, then comparing different transports (public SSH vs tailnet SSH, userspace vs TUN) gave a clear answer in a few hours. Had I kept “fixing” the playbook, I would still be circling.

The third lesson is that Tailscale’s userspace-networking mode, while remarkably useful for environments without kernel privileges (PaaS containers, Lambda, CI/CD), has operational limits that surface only after the setup has been running for hours. It is not a bug in Tailscale per se. It is a combination of layers that together become fragile: native Docker bridge + userspace + ProxyCommand = a dependency chain that is hard to debug.

Current State
#

The Hetzner Vault VM is operational and the create pipeline is now stable and measurable. The 09-vault-k8s-integration-prep project closed Phase 1 (runtime convergence + transport validation) successfully with a known execution time.

The next step — Phase 2 — is the cluster side: configuring CoreDNS to correctly resolve lushycorp-vault.magellanic-gondola.ts.net inside the tailnet, creating the ClusterSecretStore in Kubernetes to read secrets from Vault, and verifying the whole setup with an ESO smoke test.

But that is another day’s work.

The Context#

The 76-Second Threshold#

What Works and What Doesn’t#

Suspecting the Transport Layer#

The Definitive Test: Escaping Userspace#

The Pipeline Revived#

What We Learned#

Current State#