Files
proxmox-infra/docs/DECISIONS.md
kavren e0a64b1b92 docs: Add DHCP-based network isolation strategy
- Document OPNsense WAN configuration (pm4 vmbr1 with USB NIC)
- Add DHCP-based isolation workaround for unmanaged Gigabyte switches
- Plan subnet scheme: LAN (10.4.2.0/24), IoT (10.4.10.0/24), Guest (10.4.20.0/24)
- Document planned OPNsense firewall rules for isolation
- Update tasks with OPNsense migration and isolation steps
- Fix Claude Code hooks settings (remove matcher from Stop hook)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 19:20:07 -05:00

8.7 KiB
Raw Blame History

Architecture Decisions & Patterns

Purpose: Record of important decisions, patterns, and "why we do it this way" Update Frequency: When making significant architectural choices

Service Organization

Authentication Strategy

Decision: Services use their own built-in authentication, not Authelia Reason: Most *arr services and media tools have robust auth systems Exception: Consider Authelia for future services that lack authentication

LXC vs Docker

Keep in Docker:

  • NZBGet (requires specific volume mapping, works well in Docker)
  • Multi-container stacks
  • Services requiring Docker-specific features

Migrate to LXC:

  • Single-purpose services (Sonarr, Radarr, etc.)
  • Services benefiting from isolation
  • Stateless applications

File Permissions

Media Files

Standard: All media files and folders must be 777 Reason:

  • NFS mounts between multiple systems with different UID mappings
  • Jellyfin runs in LXC with UID namespace mapping (100107)
  • Sonarr runs in LXC with different UID mapping
  • NZBGet runs in Docker with UID 1000

Implementation:

  • NZBGet: UMask=0000 to create files with 777
  • Sonarr: Media management → Set permissions → chmod 777
  • Manual fixes: chmod -R 777 on media directories as needed

Network Architecture

Network Isolation Strategy

Goal: Isolate IoT (KavCorp-IOT) and Guest (KavCorp-Guest) WiFi networks from the main LAN, while allowing Smart Home VMs to access IoT devices.

Constraint: Unmanaged Gigabyte Switches

The Gigabyte 10G switches provide 10G backhaul and 2.5G PoE to UniFi APs, but they are unmanaged and don't support VLAN tagging. This means VLAN tags from UniFi APs are stripped when traffic passes through.

Workaround: DHCP-based isolation (L3 firewall rules instead of L2 VLANs)

IP Subnet Scheme

Subnet Range Purpose DHCP Source
Main LAN 10.4.2.0/24 Trusted devices, Proxmox hosts, services OPNsense
IoT 10.4.10.0/24 KavCorp-IOT SSID devices OPNsense or UniFi
Guest 10.4.20.0/24 KavCorp-Guest SSID devices OPNsense or UniFi

OPNsense Firewall Rules (Planned)

Source Destination Action Notes
10.4.10.0/24 (IoT) 10.4.2.0/24 (LAN) Block Isolate IoT from LAN
10.4.20.0/24 (Guest) 10.4.2.0/24 (LAN) Block Isolate Guest from LAN
10.4.20.0/24 (Guest) 10.4.10.0/24 (IoT) Block Isolate Guest from IoT
Smart Home VMs 10.4.10.0/24 (IoT) Allow Home Assistant → IoT devices
10.4.10.0/24 (IoT) Internet Allow IoT internet access
10.4.20.0/24 (Guest) Internet Allow Guest internet access

Limitations of DHCP Workaround

  • Not true L2 isolation: All traffic on same broadcast domain
  • IP spoofing possible: Malicious device could use LAN IP range
  • Sufficient for: IoT devices and guests (low threat actors)
  • Future upgrade: Replace Gigabyte switches with managed 2.5G PoE switches for proper VLANs

VLAN IDs (For Future Reference)

VLAN Name Subnet Purpose
1 Default 10.4.2.0/24 Management, trusted PCs, Proxmox hosts
10 IoT 10.4.10.0/24 IoT devices, cameras, smart home
20 Guest 10.4.20.0/24 Guest WiFi, isolated

Router/Firewall

Decision: OPNsense VM 130 on pm4 (server closet) Status: Deployed, pending WAN cutover

Reason:

  • Free, full-featured firewall/router
  • Inter-subnet firewall rules for IoT/Guest isolation
  • IDS/IPS capability
  • pm4 is in server closet next to AT&T modem (avoids routing WAN over backhaul)

Network Interfaces (VM 130):

Interface Bridge Purpose Status
net0 vmbr0 LAN (10.4.2.0/24) Configured
net1 vmbr1 WAN (to AT&T modem) Configured

pm4 Bridge Configuration:

Bridge Physical NIC Purpose
vmbr0 eno1 (Intel I226-V) LAN - all VMs/LXCs
vmbr1 enx6c1ff76e4d47 (USB 2.5G) WAN - OPNsense only

HA/Failover Consideration:

  • Current: Single OPNsense on pm4 (SPOF)
  • Future options:
    1. OPNsense HA with CARP (requires second USB NIC on another node)
    2. Keep current router as cold standby (swap cables if pm4 fails)

Alternative Considered: Ubiquiti Dream Machine

  • Rejected due to cost and ecosystem lock-in
  • OPNsense more flexible for homelab

Alternative Considered: OPNsense on Elantris (basement)

  • Rejected because WAN would need to traverse 10G backhaul
  • Would require managed switches for WAN VLAN isolation

10G Backhaul (Planned)

Decision: 10G RJ45 between server closet and basement Hardware: 2× GiGaPlus 6-Port 10G PoE switches ($101 each) Why GiGaPlus over UniFi:

  • Native 10G RJ45 (no SFP+ transceivers needed)
  • Includes PoE for APs
  • $202 total vs $800+ for UniFi equivalent
  • Cat6 can handle 10G at house distances (<55m)

WiFi (Planned)

Decision: UniFi APs with mixed models Hardware:

  • 1× U6 Enterprise (existing) - server closet/upstairs
  • 2× U7 Pro ($189 each) - basement + main floor

Why UniFi:

  • Multiple SSIDs mapped to VLANs
  • Seamless roaming between APs
  • Centralized management via controller
  • Better than Asus mesh for VLAN support

Controller: LXC on Proxmox (free) via community helper script

Reverse Proxy

Decision: Single Traefik instance handles all external access Location: LXC 104 on pm2 Benefits:

  • Single point for SSL/TLS management
  • Automatic Let's Encrypt certificate renewal
  • Centralized routing configuration
  • DNS-01 challenge for wildcard certificates

Service Domains

Pattern: <service>.kavcorp.com DNS: All subdomains point to public IP (99.74.188.161) Routing: Traefik inspects Host header and routes internally

Storage Architecture

Media Storage

Decision: NFS mount from elantris for all media Path: /mnt/pve/elantris-media → elantris /el-pool/media Reason:

  • Centralized storage
  • Accessible from all cluster nodes
  • Large capacity (24TB ZFS pool)
  • Easy to backup/snapshot

LXC Root Filesystems

Decision: Store on KavNas NFS for most services Reason:

  • Easy backups
  • Portable between nodes
  • Network storage sufficient for most workloads

Exception: High I/O services use local-lvm

Monitoring & Maintenance

Configuration Management

Decision: Manual configuration with documentation Reason: Small scale doesn't justify Ansible/Terraform complexity Trade-off: Requires disciplined documentation updates

Backup Strategy

Decision: Proxmox built-in backup to KavNas Frequency: [To be determined] Retention: [To be determined]

Common Patterns

Adding a New Service Behind Traefik

  1. Deploy service with static IP in 10.4.2.0/24 range
  2. Create Traefik config in /etc/traefik/conf.d/<service>.yaml
  3. Use pattern:
    http:
      routers:
        <service>:
          rule: "Host(`<service>.kavcorp.com`)"
          entryPoints: [websecure]
          service: <service>
          tls:
            certResolver: letsencrypt
      services:
        <service>:
          loadBalancer:
            servers:
              - url: "http://<ip>:<port>"
    
  4. Traefik auto-reloads (no restart needed)
  5. Update docs/INFRASTRUCTURE.md with service details

Troubleshooting Permission Issues

  1. Check file ownership: ls -la /path/to/file
  2. Check if 777: stat /path/to/file
  3. Fix permissions: chmod -R 777 /path/to/directory
  4. For NZBGet: Verify UMask=0000 in nzbget.conf
  5. For Sonarr/Radarr: Check Settings → Media Management → Set Permissions

Node SSH Access

From local machine:

  • User: kavren
  • Key: ~/.ssh/id_ed25519

Between cluster nodes:

  • User: root
  • Each node has other nodes' keys in /root/.ssh/authorized_keys
  • Proxmox web UI uses node SSH for shell access

Known Issues & Workarounds

Jellyfin Not Seeing Media After Import

Symptom: Files imported to /media/tv but Jellyfin shows empty Cause: Jellyfin LXC mount not active or permissions wrong Fix:

  1. Restart Jellyfin LXC: pct stop 121 && pct start 121
  2. Verify mount inside LXC: pct exec 121 -- ls -la /media/tv/
  3. Fix permissions if needed: chmod -R 777 /mnt/pve/elantris-media/tv/

Sonarr/Radarr Import Failures

Symptom: "Access denied" errors in logs Cause: Permission mismatch between download client and *arr service Fix: Ensure download folder has 777 permissions

Future Considerations

  • Automated backup strategy
  • Monitoring/alerting system (Prometheus + Grafana?)
  • Consider Authelia for future services without built-in auth
  • Document disaster recovery procedures
  • Consider consolidating Docker hosts