# Architecture Decisions & Patterns > **Purpose**: Record of important decisions, patterns, and "why we do it this way" > **Update Frequency**: When making significant architectural choices ## Service Organization ### Authentication Strategy **Decision**: Services use their own built-in authentication, not Authelia **Reason**: Most *arr services and media tools have robust auth systems **Exception**: Consider Authelia for future services that lack authentication ### LXC vs Docker **Keep in Docker**: - NZBGet (requires specific volume mapping, works well in Docker) - Multi-container stacks - Services requiring Docker-specific features **Migrate to LXC**: - Single-purpose services (Sonarr, Radarr, etc.) - Services benefiting from isolation - Stateless applications ## File Permissions ### Media Files **Standard**: All media files and folders must be 777 **Reason**: - NFS mounts between multiple systems with different UID mappings - Jellyfin runs in LXC with UID namespace mapping (100107) - Sonarr runs in LXC with different UID mapping - NZBGet runs in Docker with UID 1000 **Implementation**: - NZBGet: `UMask=0000` to create files with 777 - Sonarr: Media management → Set permissions → chmod 777 - Manual fixes: `chmod -R 777` on media directories as needed ## Network Architecture ### Network Isolation Strategy **Goal**: Isolate IoT (KavCorp-IOT) and Guest (KavCorp-Guest) WiFi networks from the main LAN, while allowing Smart Home VMs to access IoT devices. **Status**: Implemented via OPNsense VLANs and firewall rules. #### VLAN Architecture Unmanaged Gigabyte switches pass VLAN tags through (they just don't understand them). UniFi APs tag traffic per SSID, OPNsense receives tagged traffic on VLAN interfaces. | VLAN | Interface | Subnet | Gateway | Purpose | |------|-----------|--------|---------|---------| | - | vtnet0 (LAN) | 10.4.2.0/24 | 10.4.2.1 | Infrastructure (Proxmox, core services) | | 10 | vlan01 | 10.4.10.0/24 | 10.4.10.1 | Trusted (user devices) | | 20 | vlan02 | 10.4.20.0/24 | 10.4.20.1 | IoT (KavCorp-IOT SSID) | | 30 | vlan03 | 10.4.30.0/24 | 10.4.30.1 | Guest (KavCorp-Guest SSID) | #### DHCP Configuration All DHCP served by OPNsense: - LAN: 10.4.2.100-200, DNS: 10.4.2.129 (Pi-hole) - Trusted: 10.4.10.100-200, DNS: 10.4.2.129 - IoT: 10.4.20.100-200, DNS: 10.4.2.129 - Guest: 10.4.30.100-200, DNS: 10.4.2.129 #### OPNsense Firewall Rules (Implemented) | Rule | Source | Destination | Action | |------|--------|-------------|--------| | Allow DNS | IoT/Guest | 10.4.2.129:53 | Pass | | Block IoT→LAN | 10.4.20.0/24 | 10.4.2.0/24 | Block | | Block Guest→LAN | 10.4.30.0/24 | 10.4.2.0/24 | Block | | Block Guest→IoT | 10.4.30.0/24 | 10.4.20.0/24 | Block | | Allow LAN→IoT | 10.4.2.0/24 | 10.4.20.0/24 | Pass | | Allow IoT Internet | 10.4.20.0/24 | any | Pass | | Allow Guest Internet | 10.4.30.0/24 | any | Pass | **Note**: LAN→IoT rule allows Home Assistant, Frigate, and other LAN services to access IoT devices (cameras, sensors, etc.). #### Network Segmentation Philosophy | Network | Contains | Access Level | |---------|----------|--------------| | 10.4.2.0/24 (LAN) | Proxmox hosts, OPNsense, Pi-hole, Traefik, NAS | Full infrastructure access | | 10.4.10.0/24 (Trusted) | User PCs, laptops | Full access to LAN and services | | 10.4.20.0/24 (IoT) | Smart devices, cameras | Internet + DNS only, no LAN access | | 10.4.30.0/24 (Guest) | Guest WiFi | Internet + DNS only, no local access | #### Future Considerations - Consider adding a **Servers VLAN** to isolate services (media stack, Bitwarden) from infrastructure - Consider OPNsense HA (CARP) with second USB NIC on another node for failover ### Router/Firewall **Decision**: OPNsense VM 130 on pm4 (server closet) **Status**: Deployed, pending WAN cutover **Reason**: - Free, full-featured firewall/router - Inter-subnet firewall rules for IoT/Guest isolation - IDS/IPS capability - pm4 is in server closet next to AT&T modem (avoids routing WAN over backhaul) **Network Interfaces (VM 130)**: | Interface | Bridge | Purpose | Status | |-----------|--------|---------|--------| | net0 | vmbr0 | LAN (10.4.2.0/24) | Configured | | net1 | vmbr1 | WAN (to AT&T modem) | Configured | **pm4 Bridge Configuration**: | Bridge | Physical NIC | Purpose | |--------|--------------|---------| | vmbr0 | eno1 (Intel I226-V) | LAN - all VMs/LXCs | | vmbr1 | enx6c1ff76e4d47 (USB 2.5G) | WAN - OPNsense only | **HA/Failover Consideration**: - Current: Single OPNsense on pm4 (SPOF) - Future options: 1. OPNsense HA with CARP (requires second USB NIC on another node) 2. Keep current router as cold standby (swap cables if pm4 fails) **Alternative Considered**: Ubiquiti Dream Machine - Rejected due to cost and ecosystem lock-in - OPNsense more flexible for homelab **Alternative Considered**: OPNsense on Elantris (basement) - Rejected because WAN would need to traverse 10G backhaul - Would require managed switches for WAN VLAN isolation ### 10G Backhaul (Planned) **Decision**: 10G RJ45 between server closet and basement **Hardware**: 2× GiGaPlus 6-Port 10G PoE switches ($101 each) **Why GiGaPlus over UniFi**: - Native 10G RJ45 (no SFP+ transceivers needed) - Includes PoE for APs - $202 total vs $800+ for UniFi equivalent - Cat6 can handle 10G at house distances (<55m) ### WiFi (Planned) **Decision**: UniFi APs with mixed models **Hardware**: - 1× U6 Enterprise (existing) - server closet/upstairs - 2× U7 Pro ($189 each) - basement + main floor **Why UniFi**: - Multiple SSIDs mapped to VLANs - Seamless roaming between APs - Centralized management via controller - Better than Asus mesh for VLAN support **Controller**: LXC on Proxmox (free) via community helper script ### OPNsense Configuration Patterns **Interface Names in config.xml** (IMPORTANT): | UI Name | config.xml | Physical | Subnet | |---------|------------|----------|--------| | LAN | opt1 | vtnet0 | 10.4.2.0/24 | | WAN | wan | vtnet1 | DHCP | | Trusted | opt2 | vlan01 | 10.4.10.0/24 | | IoT | opt3 | vlan02 | 10.4.20.0/24 | | Guest | opt4 | vlan03 | 10.4.30.0/24 | **Why This Matters**: When editing config.xml directly, use `opt1` not `lan`. Using the wrong name causes rules to fail silently. **Firewall Rule Reload Commands**: ```bash # Reload all services (safe, full reload) configctl filter reload # Check active rules pfctl -sr # Test rules file for syntax errors pfctl -nf /tmp/rules.debug # View generated rules before loading cat /tmp/rules.debug ``` **Common Gotchas**: 1. IPv6 rules with IPv4 addresses cause entire ruleset to fail loading 2. Rules added via config.xml need proper interface names (opt1, not lan) 3. After config.xml edits, run `configctl filter reload` to apply ### Reverse Proxy **Decision**: Single Traefik instance handles all external access **Location**: LXC 104 on pm2 **Benefits**: - Single point for SSL/TLS management - Automatic Let's Encrypt certificate renewal - Centralized routing configuration - DNS-01 challenge for wildcard certificates ### Service Domains **Pattern**: `.kavcorp.com` **DNS**: All subdomains point to public IP (99.74.188.161) **Routing**: Traefik inspects Host header and routes internally ## Storage Architecture ### Media Storage **Decision**: NFS mount from elantris for all media **Path**: `/mnt/pve/elantris-media` → elantris `/el-pool/media` **Reason**: - Centralized storage - Accessible from all cluster nodes - Large capacity (24TB ZFS pool) - Easy to backup/snapshot ### LXC Root Filesystems **Decision**: Store on KavNas NFS for most services **Reason**: - Easy backups - Portable between nodes - Network storage sufficient for most workloads **Exception**: High I/O services use local-lvm ## Monitoring & Maintenance ### Configuration Management **Decision**: Manual configuration with documentation **Reason**: Small scale doesn't justify Ansible/Terraform complexity **Trade-off**: Requires disciplined documentation updates ### Backup Strategy **Decision**: Proxmox built-in backup to KavNas **Frequency**: [To be determined] **Retention**: [To be determined] ## Common Patterns ### Adding a New Service Behind Traefik 1. Deploy service with static IP in 10.4.2.0/24 range 2. Create Traefik config in `/etc/traefik/conf.d/.yaml` 3. Use pattern: ```yaml http: routers: : rule: "Host(`.kavcorp.com`)" entryPoints: [websecure] service: tls: certResolver: letsencrypt services: : loadBalancer: servers: - url: "http://:" ``` 4. Traefik auto-reloads (no restart needed) 5. Update `docs/INFRASTRUCTURE.md` with service details ### Troubleshooting Permission Issues 1. Check file ownership: `ls -la /path/to/file` 2. Check if 777: `stat /path/to/file` 3. Fix permissions: `chmod -R 777 /path/to/directory` 4. For NZBGet: Verify `UMask=0000` in nzbget.conf 5. For Sonarr/Radarr: Check Settings → Media Management → Set Permissions ### Node SSH Access **From local machine**: - User: `kavren` - Key: `~/.ssh/id_ed25519` **Between cluster nodes**: - User: `root` - Each node has other nodes' keys in `/root/.ssh/authorized_keys` - Proxmox web UI uses node SSH for shell access ## Known Issues & Workarounds ### Jellyfin Not Seeing Media After Import **Symptom**: Files imported to `/media/tv` but Jellyfin shows empty **Cause**: Jellyfin LXC mount not active or permissions wrong **Fix**: 1. Restart Jellyfin LXC: `pct stop 121 && pct start 121` 2. Verify mount inside LXC: `pct exec 121 -- ls -la /media/tv/` 3. Fix permissions if needed: `chmod -R 777 /mnt/pve/elantris-media/tv/` ### Sonarr/Radarr Import Failures **Symptom**: "Access denied" errors in logs **Cause**: Permission mismatch between download client and *arr service **Fix**: Ensure download folder has 777 permissions ## Future Considerations - [ ] Automated backup strategy - [ ] Monitoring/alerting system (Prometheus + Grafana?) - [ ] Consider Authelia for future services without built-in auth - [ ] Document disaster recovery procedures - [ ] Consider consolidating Docker hosts