Allow Guest VLAN to access Traefik on port 443 so guests can use https://jellyfin.kavcorp.com etc. with valid Let's Encrypt certs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
372 lines
12 KiB
Markdown
372 lines
12 KiB
Markdown
# Architecture Decisions & Patterns
|
||
|
||
> **Purpose**: Record of important decisions, patterns, and "why we do it this way"
|
||
> **Update Frequency**: When making significant architectural choices
|
||
|
||
## Service Organization
|
||
|
||
### Authentication Strategy
|
||
|
||
**Decision**: Services use their own built-in authentication, not Authelia
|
||
**Reason**: Most *arr services and media tools have robust auth systems
|
||
**Exception**: Consider Authelia for future services that lack authentication
|
||
|
||
### LXC vs Docker
|
||
|
||
**Keep in Docker**:
|
||
- NZBGet (requires specific volume mapping, works well in Docker)
|
||
- Multi-container stacks
|
||
- Services requiring Docker-specific features
|
||
|
||
**Migrate to LXC**:
|
||
- Single-purpose services (Sonarr, Radarr, etc.)
|
||
- Services benefiting from isolation
|
||
- Stateless applications
|
||
|
||
## File Permissions
|
||
|
||
### Media Files
|
||
|
||
**Standard**: All media files and folders must be 777
|
||
**Reason**:
|
||
- NFS mounts between multiple systems with different UID mappings
|
||
- Jellyfin runs in LXC with UID namespace mapping (100107)
|
||
- Sonarr runs in LXC with different UID mapping
|
||
- NZBGet runs in Docker with UID 1000
|
||
|
||
**Implementation**:
|
||
- NZBGet: `UMask=0000` to create files with 777
|
||
- Sonarr: Media management → Set permissions → chmod 777
|
||
- Manual fixes: `chmod -R 777` on media directories as needed
|
||
|
||
## Network Architecture
|
||
|
||
### Local DNS (.kav TLD)
|
||
|
||
**Decision**: Use `.kav` as the local top-level domain for internal services
|
||
**Reason**:
|
||
- Unique to KavCorp network, avoids conflicts with real TLDs
|
||
- Short and memorable
|
||
- Works without additional configuration
|
||
- Pi-hole handles resolution via `dns.hosts` in pihole.toml
|
||
|
||
**Alternatives Considered**:
|
||
- `.lan` - Common but can conflict with some routers
|
||
- `.local` - Conflicts with mDNS/Bonjour
|
||
- `.home.arpa` - RFC 8375 compliant but verbose
|
||
|
||
**Usage**:
|
||
- **HTTPS (recommended)**: `https://<service>.kavcorp.com` - valid Let's Encrypt certs, works internally and externally
|
||
- **HTTP (optional)**: `http://<service>.kav:8080/` - internal only, no certs needed
|
||
|
||
**Internal DNS Configuration**:
|
||
- Pi-hole resolves `*.kavcorp.com` to Traefik (10.4.2.10) for internal HTTPS access
|
||
- Pi-hole resolves `.kav` domains to Traefik for HTTP:8080 access
|
||
- Direct access (no Traefik): pm1-4.kav, elantris.kav, kavnas.kav, docker hosts, mqtt.kav, zwave.kav
|
||
|
||
### SSH Access Policy
|
||
|
||
**Decision**: SSH from workstation only, no container-to-container SSH
|
||
**Reason**:
|
||
- Reduces attack surface
|
||
- Single key to manage
|
||
- Containers don't need to communicate via SSH
|
||
|
||
**Implementation**:
|
||
- Workstation ed25519 key added to all containers
|
||
- `PermitRootLogin prohibit-password` (key-only)
|
||
- Provisioning script: `scripts/provisioning/setup-ssh-access.sh`
|
||
|
||
### IP Allocation Scheme
|
||
|
||
**Decision**: Organized IP ranges by service type
|
||
**Reason**: Easy to identify service type from IP, logical grouping
|
||
|
||
| Range | Purpose |
|
||
|-------|---------|
|
||
| 10.4.2.1 | Gateway (OPNsense) |
|
||
| 10.4.2.2-9 | Proxmox nodes |
|
||
| 10.4.2.10-19 | Core infrastructure |
|
||
| 10.4.2.20-29 | Media stack |
|
||
| 10.4.2.30-39 | Other services |
|
||
| 10.4.2.40-49 | Game servers |
|
||
| 10.4.2.50-99 | IoT / Reserved |
|
||
| 10.4.2.100-199 | DHCP pool |
|
||
| 10.4.2.200-209 | Docker hosts |
|
||
|
||
### Network Isolation Strategy
|
||
|
||
**Goal**: Isolate IoT (KavCorp-IOT) and Guest (KavCorp-Guest) WiFi networks from the main LAN, while allowing Smart Home VMs to access IoT devices.
|
||
|
||
**Status**: Implemented via OPNsense VLANs and firewall rules.
|
||
|
||
#### VLAN Architecture
|
||
|
||
Unmanaged Gigabyte switches pass VLAN tags through (they just don't understand them). UniFi APs tag traffic per SSID, OPNsense receives tagged traffic on VLAN interfaces.
|
||
|
||
| VLAN | Interface | Subnet | Gateway | Purpose |
|
||
|------|-----------|--------|---------|---------|
|
||
| - | vtnet0 (LAN) | 10.4.2.0/24 | 10.4.2.1 | Infrastructure (Proxmox, core services) |
|
||
| 10 | vlan01 | 10.4.10.0/24 | 10.4.10.1 | Trusted (user devices) |
|
||
| 20 | vlan02 | 10.4.20.0/24 | 10.4.20.1 | IoT (KavCorp-IOT SSID) |
|
||
| 30 | vlan03 | 10.4.30.0/24 | 10.4.30.1 | Guest (KavCorp-Guest SSID) |
|
||
|
||
#### DHCP Configuration
|
||
|
||
All DHCP served by OPNsense:
|
||
- LAN: 10.4.2.100-200, DNS: 10.4.2.11 (Pi-hole)
|
||
- Trusted: 10.4.10.100-200, DNS: 10.4.2.11
|
||
- IoT: 10.4.20.100-200, DNS: 10.4.2.11
|
||
- Guest: 10.4.30.100-200, DNS: 10.4.2.11
|
||
|
||
#### OPNsense Firewall Rules (Implemented)
|
||
|
||
| Rule | Source | Destination | Action |
|
||
|------|--------|-------------|--------|
|
||
| Allow DNS | IoT/Guest | 10.4.2.11:53 | Pass |
|
||
| Allow Guest→Traefik | 10.4.30.0/24 | 10.4.2.10:443 | Pass |
|
||
| Allow Guest→Media | 10.4.30.0/24 | 10.4.2.25, 10.4.2.26 | Pass |
|
||
| Block IoT→LAN | 10.4.20.0/24 | 10.4.2.0/24 | Block |
|
||
| Block Guest→LAN | 10.4.30.0/24 | 10.4.2.0/24 | Block |
|
||
| Block Guest→IoT | 10.4.30.0/24 | 10.4.20.0/24 | Block |
|
||
| Allow LAN→IoT | 10.4.2.0/24 | 10.4.20.0/24 | Pass |
|
||
| Allow IoT Internet | 10.4.20.0/24 | any | Pass |
|
||
| Allow Guest Internet | 10.4.30.0/24 | any | Pass |
|
||
|
||
**Note**: LAN→IoT rule allows Home Assistant, Frigate, and other LAN services to access IoT devices (cameras, sensors, etc.).
|
||
|
||
#### Network Segmentation Philosophy
|
||
|
||
| Network | Contains | Access Level |
|
||
|---------|----------|--------------|
|
||
| 10.4.2.0/24 (LAN) | Proxmox hosts, OPNsense, Pi-hole, Traefik, NAS | Full infrastructure access |
|
||
| 10.4.10.0/24 (Trusted) | User PCs, laptops | Full access to LAN and services |
|
||
| 10.4.20.0/24 (IoT) | Smart devices, cameras | Internet + DNS only, no LAN access |
|
||
| 10.4.30.0/24 (Guest) | Guest WiFi | Internet + DNS only, no local access |
|
||
|
||
#### Future Considerations
|
||
|
||
- Consider adding a **Servers VLAN** to isolate services (media stack, Bitwarden) from infrastructure
|
||
- Consider OPNsense HA (CARP) with second USB NIC on another node for failover
|
||
|
||
### Router/Firewall
|
||
|
||
**Decision**: OPNsense VM 130 on pm4 (server closet)
|
||
**Status**: Deployed, pending WAN cutover
|
||
|
||
**Reason**:
|
||
- Free, full-featured firewall/router
|
||
- Inter-subnet firewall rules for IoT/Guest isolation
|
||
- IDS/IPS capability
|
||
- pm4 is in server closet next to AT&T modem (avoids routing WAN over backhaul)
|
||
|
||
**Network Interfaces (VM 130)**:
|
||
| Interface | Bridge | Purpose | Status |
|
||
|-----------|--------|---------|--------|
|
||
| net0 | vmbr0 | LAN (10.4.2.0/24) | Configured |
|
||
| net1 | vmbr1 | WAN (to AT&T modem) | Configured |
|
||
|
||
**pm4 Bridge Configuration**:
|
||
| Bridge | Physical NIC | Purpose |
|
||
|--------|--------------|---------|
|
||
| vmbr0 | eno1 (Intel I226-V) | LAN - all VMs/LXCs |
|
||
| vmbr1 | enx6c1ff76e4d47 (USB 2.5G) | WAN - OPNsense only |
|
||
|
||
**HA/Failover Consideration**:
|
||
- Current: Single OPNsense on pm4 (SPOF)
|
||
- Future options:
|
||
1. OPNsense HA with CARP (requires second USB NIC on another node)
|
||
2. Keep current router as cold standby (swap cables if pm4 fails)
|
||
3. Protectli Vault as backup router (limited by port speeds)
|
||
|
||
**Alternative Considered**: Ubiquiti Dream Machine
|
||
- Rejected due to cost and ecosystem lock-in
|
||
- OPNsense more flexible for homelab
|
||
|
||
**Alternative Considered**: OPNsense on Elantris (basement)
|
||
- Rejected because WAN would need to traverse 10G backhaul
|
||
- Would require managed switches for WAN VLAN isolation
|
||
|
||
### 10G Backhaul (Planned)
|
||
|
||
**Decision**: 10G RJ45 between server closet and basement
|
||
**Hardware**: 2× GiGaPlus 6-Port 10G PoE switches ($101 each)
|
||
**Why GiGaPlus over UniFi**:
|
||
- Native 10G RJ45 (no SFP+ transceivers needed)
|
||
- Includes PoE for APs
|
||
- $202 total vs $800+ for UniFi equivalent
|
||
- Cat6 can handle 10G at house distances (<55m)
|
||
|
||
### WiFi (Planned)
|
||
|
||
**Decision**: UniFi APs with mixed models
|
||
**Hardware**:
|
||
- 1× U6 Enterprise (existing) - server closet/upstairs
|
||
- 2× U7 Pro ($189 each) - basement + main floor
|
||
|
||
**Why UniFi**:
|
||
- Multiple SSIDs mapped to VLANs
|
||
- Seamless roaming between APs
|
||
- Centralized management via controller
|
||
- Better than Asus mesh for VLAN support
|
||
|
||
**Controller**: LXC on Proxmox (free) via community helper script
|
||
|
||
### OPNsense Configuration Patterns
|
||
|
||
**Interface Names in config.xml** (IMPORTANT):
|
||
| UI Name | config.xml | Physical | Subnet |
|
||
|---------|------------|----------|--------|
|
||
| LAN | opt1 | vtnet0 | 10.4.2.0/24 |
|
||
| WAN | wan | vtnet1 | DHCP |
|
||
| Trusted | opt2 | vlan01 | 10.4.10.0/24 |
|
||
| IoT | opt3 | vlan02 | 10.4.20.0/24 |
|
||
| Guest | opt4 | vlan03 | 10.4.30.0/24 |
|
||
|
||
**Why This Matters**: When editing config.xml directly, use `opt1` not `lan`. Using the wrong name causes rules to fail silently.
|
||
|
||
**Firewall Rule Reload Commands**:
|
||
```bash
|
||
# Reload all services (safe, full reload)
|
||
configctl filter reload
|
||
|
||
# Check active rules
|
||
pfctl -sr
|
||
|
||
# Test rules file for syntax errors
|
||
pfctl -nf /tmp/rules.debug
|
||
|
||
# View generated rules before loading
|
||
cat /tmp/rules.debug
|
||
```
|
||
|
||
**Common Gotchas**:
|
||
1. IPv6 rules with IPv4 addresses cause entire ruleset to fail loading
|
||
2. Rules added via config.xml need proper interface names (opt1, not lan)
|
||
3. After config.xml edits, run `configctl filter reload` to apply
|
||
4. NAT port range rules: `<local-port>` must be just the starting port, not the full range
|
||
- Correct: `<port>2223-2323</port>` with `<local-port>2223</local-port>`
|
||
- Wrong: `<port>2223-2323</port>` with `<local-port>2223-2323</local-port>` (rule will be commented out)
|
||
5. NAT reflection requires `enablenatreflectionhelper` (not just purenat) when clients and servers are on the same subnet - pure NAT doesn't source-NAT so return traffic bypasses OPNsense
|
||
|
||
### Reverse Proxy
|
||
|
||
**Decision**: Single Traefik instance handles all external access
|
||
**Location**: LXC 104 on pm2
|
||
**Benefits**:
|
||
- Single point for SSL/TLS management
|
||
- Automatic Let's Encrypt certificate renewal
|
||
- Centralized routing configuration
|
||
- DNS-01 challenge for wildcard certificates
|
||
|
||
### Service Domains
|
||
|
||
**Pattern**: `<service>.kavcorp.com`
|
||
**DNS**: All subdomains point to public IP (99.74.188.161)
|
||
**Routing**: Traefik inspects Host header and routes internally
|
||
|
||
## Storage Architecture
|
||
|
||
### Media Storage
|
||
|
||
**Decision**: NFS mount from elantris for all media
|
||
**Path**: `/mnt/pve/elantris-media` → elantris `/el-pool/media`
|
||
**Reason**:
|
||
- Centralized storage
|
||
- Accessible from all cluster nodes
|
||
- Large capacity (24TB ZFS pool)
|
||
- Easy to backup/snapshot
|
||
|
||
### LXC Root Filesystems
|
||
|
||
**Decision**: Store on KavNas NFS for most services
|
||
**Reason**:
|
||
- Easy backups
|
||
- Portable between nodes
|
||
- Network storage sufficient for most workloads
|
||
|
||
**Exception**: High I/O services use local-lvm
|
||
|
||
## Monitoring & Maintenance
|
||
|
||
### Configuration Management
|
||
|
||
**Decision**: Manual configuration with documentation
|
||
**Reason**: Small scale doesn't justify Ansible/Terraform complexity
|
||
**Trade-off**: Requires disciplined documentation updates
|
||
|
||
### Backup Strategy
|
||
|
||
**Decision**: Proxmox built-in backup to KavNas
|
||
**Frequency**: [To be determined]
|
||
**Retention**: [To be determined]
|
||
|
||
## Common Patterns
|
||
|
||
### Adding a New Service Behind Traefik
|
||
|
||
1. Deploy service with static IP in 10.4.2.0/24 range
|
||
2. Create Traefik config in `/etc/traefik/conf.d/<service>.yaml`
|
||
3. Use pattern:
|
||
```yaml
|
||
http:
|
||
routers:
|
||
<service>:
|
||
rule: "Host(`<service>.kavcorp.com`)"
|
||
entryPoints: [websecure]
|
||
service: <service>
|
||
tls:
|
||
certResolver: letsencrypt
|
||
services:
|
||
<service>:
|
||
loadBalancer:
|
||
servers:
|
||
- url: "http://<ip>:<port>"
|
||
```
|
||
4. Traefik auto-reloads (no restart needed)
|
||
5. Update `docs/INFRASTRUCTURE.md` with service details
|
||
|
||
### Troubleshooting Permission Issues
|
||
|
||
1. Check file ownership: `ls -la /path/to/file`
|
||
2. Check if 777: `stat /path/to/file`
|
||
3. Fix permissions: `chmod -R 777 /path/to/directory`
|
||
4. For NZBGet: Verify `UMask=0000` in nzbget.conf
|
||
5. For Sonarr/Radarr: Check Settings → Media Management → Set Permissions
|
||
|
||
### Node SSH Access
|
||
|
||
**From local machine**:
|
||
- User: `kavren`
|
||
- Key: `~/.ssh/id_ed25519`
|
||
|
||
**Between cluster nodes**:
|
||
- User: `root`
|
||
- Each node has other nodes' keys in `/root/.ssh/authorized_keys`
|
||
- Proxmox web UI uses node SSH for shell access
|
||
|
||
## Known Issues & Workarounds
|
||
|
||
### Jellyfin Not Seeing Media After Import
|
||
|
||
**Symptom**: Files imported to `/media/tv` but Jellyfin shows empty
|
||
**Cause**: Jellyfin LXC mount not active or permissions wrong
|
||
**Fix**:
|
||
1. Restart Jellyfin LXC: `pct stop 121 && pct start 121`
|
||
2. Verify mount inside LXC: `pct exec 121 -- ls -la /media/tv/`
|
||
3. Fix permissions if needed: `chmod -R 777 /mnt/pve/elantris-media/tv/`
|
||
|
||
### Sonarr/Radarr Import Failures
|
||
|
||
**Symptom**: "Access denied" errors in logs
|
||
**Cause**: Permission mismatch between download client and *arr service
|
||
**Fix**: Ensure download folder has 777 permissions
|
||
|
||
## Future Considerations
|
||
|
||
- [ ] Automated backup strategy
|
||
- [ ] Monitoring/alerting system (Prometheus + Grafana?)
|
||
- [ ] Consider Authelia for future services without built-in auth
|
||
- [ ] Document disaster recovery procedures
|
||
- [ ] Consider consolidating Docker hosts
|