Introduction: The Persistence Paradox#
In the Cloud Native paradigm, we treat workloads as cattle, not pets. Pods are ephemeral, expendable, and stateless. However, operational reality imposes an inescapable constraint: state must reside somewhere. Whether it is a database, system logs, or, as in our specific case, SSL certificates dynamically generated by an Ingress Controller, the need for persistent and distributed Block Storage is the first real obstacle that transforms a “toy” cluster into a production infrastructure.
The goal of this session was not trivial: to implement Longhorn, SUSE/Rancher’s distributed storage engine, on an immutable operating system like Talos Linux. The challenge is twofold: Talos, by design, prevents modification of the root filesystem and runtime package installation. This makes installing storage drivers (like iSCSI) an operation that must be planned at the OS image architecture level, not via simple apt or yum commands.
This chronicle documents the infrastructure hardening process, physical storage provisioning on Proxmox, and a complex troubleshooting session related to persistent volume permissions during integration with Traefik.
Phase 1: The Immutability Obstacle and System Extensions#
The first technical barrier encountered concerns the very nature of Longhorn. To function, Longhorn creates a virtual block device on each node, which is then mounted by the Pod. This operation relies heavily on the iSCSI (Internet Small Computer Systems Interface) protocol.
In a traditional Linux distribution (Ubuntu, CentOS), the Longhorn installation would check for the presence of open-iscsi and, if missing, the administrator would install it. On Talos Linux, this is impossible. The filesystem is read-only; there is no package manager.
Analysis and Solution: Sidero Image Factory#
A preliminary check on the cluster revealed the lack of necessary extensions:
talosctl get extensions
# Output: No critical extensions installedWithout iscsi-tools and util-linux-tools, Longhorn pods would have remained indefinitely in the ContainerCreating state, unable to mount volumes.
The architectural solution adopted was the use of Sidero Image Factory. Instead of modifying the existing node, we generated a new OS image definition (a “schematic”) that natively included the required drivers.
The selected extensions were:
siderolabs/iscsi-tools: The daemon and user-space utilities for iSCSI.siderolabs/util-linux-tools: Essential filesystem management utilities for automatic volume formatting.siderolabs/qemu-guest-agent: To improve integration with the Proxmox hypervisor.
The update was performed in “rolling” mode, one node at a time, ensuring the cluster remained operational (or nearly so) during the transition.
# Example of the surgical upgrade command
talosctl upgrade --image factory.talos.dev/installer/[ID_SCHEMA]:v1.12.0 --preserve=trueThis step highlights a fundamental lesson of modern DevOps: infrastructure is managed declaratively. You don’t “patch” servers; you replace the images that govern them.
Phase 2: Physical Storage Provisioning (Proxmox & Talos)#
Once the software was enabled to “speak” to the storage, we needed to provide the physical storage. Although it is possible to use the main operating system disk for data, this is a discouraged practice (anti-pattern) for several reasons:
- I/O Contention: System logs or etcd operations must not compete with database writes.
- Lifecycle: Reinstalling the operating system (e.g., a Talos reset) could result in formatting the
/varpartition, deleting persistent data.
The Dedicated Disk Strategy#
We opted to add a second virtual disk (virtio-scsi or virtio-blk) on each Proxmox VM. Here a critical operational risk emerged: device identification.
On Linux, device names (/dev/sda, /dev/sdb, /dev/vda) are not guaranteed to be persistent or deterministic, especially in virtualized environments where boot order can vary. Applying a Talos configuration that formats /dev/sdb when /dev/sdb is actually the system disk would lead to catastrophe (total data loss).
Mitigation Technique: Identification via Size#
To mitigate this risk, we adopted a hardware “flagging” technique. Instead of creating disks identical to the system ones (34GB), we resized the new data disks to 43GB.
# Pre-formatting verification
NODE DISK SIZE TYPE
192.168.1.127 sda 34 GB QEMU HARDDISK (OS)
192.168.1.127 vda 43 GB (Data Target)Only after unequivocally confirming that /dev/vda was the 43GB disk on all nodes did we apply the Talos MachineConfig to partition, format in XFS, and mount the disk at /var/mnt/longhorn.
The Kubelet Mount Trick#
A technical detail often overlooked is mount visibility. The Kubelet runs inside an isolated container. Mounting a disk on the host at /var/mnt/longhorn does not automatically make it visible to the Kubelet.
We had to explicitly configure extraMounts with rshared propagation:
kubelet:
extraMounts:
- destination: /var/lib/longhorn
type: bind
source: /var/mnt/longhorn
options:
- bind
- rshared
- rwWithout rshared, Longhorn would have attempted to mount the volumes, but the Kubelet would not have been able to pass them to the Pods, resulting in “MountPropagation” errors.
Phase 3: Longhorn Installation and Configuration#
Installation via Helm was relatively painless, thanks to meticulous preparation. However, configuring Longhorn in a two-node environment (one Control Plane and one Worker) requires specific compromises.
Replica Configuration#
By default, Longhorn tries to maintain 3 replicas of data on different nodes to guarantee High Availability (HA). In a 2-node cluster, this requirement is impossible to satisfy (Hard Anti-Affinity).
We had to reduce the numberOfReplicas to 2. This configures a “minimum fault tolerance” situation: if a node goes down, data is still accessible on the other, but redundancy is lost until recovery. This is an acceptable trade-off for a Homelab environment, but critical to understand for production.
Additionally, we customized the defaultDataPath to point to /var/lib/longhorn (the path internal to the Kubelet container that maps our dedicated disk), ensuring data never touched the OS disk.
Phase 4: Traefik Integration and the Permissions Nightmare#
The real technical battle began when we attempted to use this new storage to persist Traefik SSL certificates (acme.json file).
The Problem: Init:CrashLoopBackOff#
After configuring Traefik to use a Longhorn PVC, the pod entered a continuous crash loop.
Log analysis revealed:
chmod: /data/acme.json: Operation not permitted
Root Cause Analysis#
The conflict arose from three contrasting security vectors:
- Kubernetes
fsGroup: We instructed Kubernetes to mount the volume making it writable for group65532(Traefik’s non-root user). This sets permissions to660(Read/Write for User and Group). - Let’s Encrypt / Traefik: For security, Traefik demands that the
acme.jsonfile has very strict permissions:600(Only the owner user can read/write). If permissions are more open (e.g.,660), Traefik refuses to start. - HostNetwork & Privileged Ports: Since we are using
hostNetwork: trueto expose Traefik directly on the node IP, Traefik must be able to bind to ports 80 and 443. On Linux, ports under 1024 require Root privileges (or theNET_BIND_SERVICEcapability).
The Infinite Troubleshooting Loop#
Initially, we tried to force permissions with an initContainer. Failed: the initContainer did not have root privileges on the mounted filesystem.
We then tried changing the user (runAsUser: 65532), but this prevented binding to port 80 (bind: permission denied).
The situation was paradoxical:
- If we ran as Root, we could open port 80, but Kubernetes (via
fsGroup) altered file permissions to660, angering Traefik. - If we ran as Non-Root, we could not open port 80.
The Definitive Solution: “Clean Slate”#
Resolution required a radical approach:
- Removal of
fsGroup: We removed everyfsGroupdirective from the Helmvalues.yaml. This tells Kubernetes: “Mount the volume as is, do not touch file permissions”. - Execution as Root (Temporary): We configured Traefik to run as
runAsUser: 0(Root). This resolves the port 80 binding problem. - Volume Reset: Since the existing
acme.jsonfile was by now “corrupted” by previous attempts (it had660permissions), Traefik continued to fail even with the new configuration. We had to manually delete the file (rm /data/acme.json) from inside the pod.
At the next restart, Traefik (running as Root) created a new acme.json. Since there was no fsGroup interfering, the file was created with the correct default permissions (600).
The final log was a relief:
Testing certificate renew... Register... providerName=myresolver.acme
Post-Lab Reflections#
Implementing Longhorn on a bare-metal (or low-level virtualized) Kubernetes cluster is an exercise that exposes the hidden complexity of distributed storage. It is not enough to “install the chart”. One must understand how the operating system manages devices, how the Kubelet manages mount points, and how containers manage user permissions.
Lessons Learned:
- Immutability requires planning: On systems like Talos, kernel and userspace dependencies must be “baked” into the image, not installed retrospectively.
- Permissions in persistent storage are tricky: Kubernetes’
fsGroupmechanism is useful for standard databases but can be destructive for applications requiring paranoid file permissions (like Traefik/ACME or SSH keys). - Hardware Identification: Never trust device names (
/dev/sda). Use UUIDs or, during provisioning, unique disk sizes to avoid catastrophic human errors.
The cluster now possesses a resilient persistence layer. The next logical step will be to remove the dependency on hostNetwork and Root by introducing a BGP Load Balancer like MetalLB, allowing Traefik to run as an unprivileged user and completing the security architecture.
Generated via Gemini CLI


