Files
proxmox-infra/docs/DECISIONS.md
kavren e93030ba9b docs: Complete OPNsense VLAN and firewall configuration
- Updated CHANGELOG with implemented VLAN config (VLANs 10, 20, 30)
- Updated DECISIONS with complete VLAN architecture and firewall rules
- Updated INFRASTRUCTURE with VLANs/subnets table and bridge configs
- Updated TASKS to mark VLAN/firewall work complete, add UniFi VLAN tasks
- Updated README last updated date

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 20:52:38 -05:00

277 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Decisions & Patterns
> **Purpose**: Record of important decisions, patterns, and "why we do it this way"
> **Update Frequency**: When making significant architectural choices
## Service Organization
### Authentication Strategy
**Decision**: Services use their own built-in authentication, not Authelia
**Reason**: Most *arr services and media tools have robust auth systems
**Exception**: Consider Authelia for future services that lack authentication
### LXC vs Docker
**Keep in Docker**:
- NZBGet (requires specific volume mapping, works well in Docker)
- Multi-container stacks
- Services requiring Docker-specific features
**Migrate to LXC**:
- Single-purpose services (Sonarr, Radarr, etc.)
- Services benefiting from isolation
- Stateless applications
## File Permissions
### Media Files
**Standard**: All media files and folders must be 777
**Reason**:
- NFS mounts between multiple systems with different UID mappings
- Jellyfin runs in LXC with UID namespace mapping (100107)
- Sonarr runs in LXC with different UID mapping
- NZBGet runs in Docker with UID 1000
**Implementation**:
- NZBGet: `UMask=0000` to create files with 777
- Sonarr: Media management → Set permissions → chmod 777
- Manual fixes: `chmod -R 777` on media directories as needed
## Network Architecture
### Network Isolation Strategy
**Goal**: Isolate IoT (KavCorp-IOT) and Guest (KavCorp-Guest) WiFi networks from the main LAN, while allowing Smart Home VMs to access IoT devices.
**Status**: Implemented via OPNsense VLANs and firewall rules.
#### VLAN Architecture
Unmanaged Gigabyte switches pass VLAN tags through (they just don't understand them). UniFi APs tag traffic per SSID, OPNsense receives tagged traffic on VLAN interfaces.
| VLAN | Interface | Subnet | Gateway | Purpose |
|------|-----------|--------|---------|---------|
| - | vtnet0 (LAN) | 10.4.2.0/24 | 10.4.2.1 | Infrastructure (Proxmox, core services) |
| 10 | vlan01 | 10.4.10.0/24 | 10.4.10.1 | Trusted (user devices) |
| 20 | vlan02 | 10.4.20.0/24 | 10.4.20.1 | IoT (KavCorp-IOT SSID) |
| 30 | vlan03 | 10.4.30.0/24 | 10.4.30.1 | Guest (KavCorp-Guest SSID) |
#### DHCP Configuration
All DHCP served by OPNsense:
- LAN: 10.4.2.100-200, DNS: 10.4.2.129 (Pi-hole)
- Trusted: 10.4.10.100-200, DNS: 10.4.2.129
- IoT: 10.4.20.100-200, DNS: 10.4.2.129
- Guest: 10.4.30.100-200, DNS: 10.4.2.129
#### OPNsense Firewall Rules (Implemented)
| Rule | Source | Destination | Action |
|------|--------|-------------|--------|
| Allow DNS | IoT/Guest | 10.4.2.129:53 | Pass |
| Block IoT→LAN | 10.4.20.0/24 | 10.4.2.0/24 | Block |
| Block Guest→LAN | 10.4.30.0/24 | 10.4.2.0/24 | Block |
| Block Guest→IoT | 10.4.30.0/24 | 10.4.20.0/24 | Block |
| Allow Home Assistant→IoT | 10.4.2.62 | 10.4.20.0/24 | Pass |
| Allow IoT Internet | 10.4.20.0/24 | any | Pass |
| Allow Guest Internet | 10.4.30.0/24 | any | Pass |
#### Network Segmentation Philosophy
| Network | Contains | Access Level |
|---------|----------|--------------|
| 10.4.2.0/24 (LAN) | Proxmox hosts, OPNsense, Pi-hole, Traefik, NAS | Full infrastructure access |
| 10.4.10.0/24 (Trusted) | User PCs, laptops | Full access to LAN and services |
| 10.4.20.0/24 (IoT) | Smart devices, cameras | Internet + DNS only, no LAN access |
| 10.4.30.0/24 (Guest) | Guest WiFi | Internet + DNS only, no local access |
#### Future Considerations
- Consider adding a **Servers VLAN** to isolate services (media stack, Bitwarden) from infrastructure
- Consider OPNsense HA (CARP) with second USB NIC on another node for failover
### Router/Firewall
**Decision**: OPNsense VM 130 on pm4 (server closet)
**Status**: Deployed, pending WAN cutover
**Reason**:
- Free, full-featured firewall/router
- Inter-subnet firewall rules for IoT/Guest isolation
- IDS/IPS capability
- pm4 is in server closet next to AT&T modem (avoids routing WAN over backhaul)
**Network Interfaces (VM 130)**:
| Interface | Bridge | Purpose | Status |
|-----------|--------|---------|--------|
| net0 | vmbr0 | LAN (10.4.2.0/24) | Configured |
| net1 | vmbr1 | WAN (to AT&T modem) | Configured |
**pm4 Bridge Configuration**:
| Bridge | Physical NIC | Purpose |
|--------|--------------|---------|
| vmbr0 | eno1 (Intel I226-V) | LAN - all VMs/LXCs |
| vmbr1 | enx6c1ff76e4d47 (USB 2.5G) | WAN - OPNsense only |
**HA/Failover Consideration**:
- Current: Single OPNsense on pm4 (SPOF)
- Future options:
1. OPNsense HA with CARP (requires second USB NIC on another node)
2. Keep current router as cold standby (swap cables if pm4 fails)
**Alternative Considered**: Ubiquiti Dream Machine
- Rejected due to cost and ecosystem lock-in
- OPNsense more flexible for homelab
**Alternative Considered**: OPNsense on Elantris (basement)
- Rejected because WAN would need to traverse 10G backhaul
- Would require managed switches for WAN VLAN isolation
### 10G Backhaul (Planned)
**Decision**: 10G RJ45 between server closet and basement
**Hardware**: 2× GiGaPlus 6-Port 10G PoE switches ($101 each)
**Why GiGaPlus over UniFi**:
- Native 10G RJ45 (no SFP+ transceivers needed)
- Includes PoE for APs
- $202 total vs $800+ for UniFi equivalent
- Cat6 can handle 10G at house distances (<55m)
### WiFi (Planned)
**Decision**: UniFi APs with mixed models
**Hardware**:
- 1× U6 Enterprise (existing) - server closet/upstairs
- 2× U7 Pro ($189 each) - basement + main floor
**Why UniFi**:
- Multiple SSIDs mapped to VLANs
- Seamless roaming between APs
- Centralized management via controller
- Better than Asus mesh for VLAN support
**Controller**: LXC on Proxmox (free) via community helper script
### Reverse Proxy
**Decision**: Single Traefik instance handles all external access
**Location**: LXC 104 on pm2
**Benefits**:
- Single point for SSL/TLS management
- Automatic Let's Encrypt certificate renewal
- Centralized routing configuration
- DNS-01 challenge for wildcard certificates
### Service Domains
**Pattern**: `<service>.kavcorp.com`
**DNS**: All subdomains point to public IP (99.74.188.161)
**Routing**: Traefik inspects Host header and routes internally
## Storage Architecture
### Media Storage
**Decision**: NFS mount from elantris for all media
**Path**: `/mnt/pve/elantris-media` → elantris `/el-pool/media`
**Reason**:
- Centralized storage
- Accessible from all cluster nodes
- Large capacity (24TB ZFS pool)
- Easy to backup/snapshot
### LXC Root Filesystems
**Decision**: Store on KavNas NFS for most services
**Reason**:
- Easy backups
- Portable between nodes
- Network storage sufficient for most workloads
**Exception**: High I/O services use local-lvm
## Monitoring & Maintenance
### Configuration Management
**Decision**: Manual configuration with documentation
**Reason**: Small scale doesn't justify Ansible/Terraform complexity
**Trade-off**: Requires disciplined documentation updates
### Backup Strategy
**Decision**: Proxmox built-in backup to KavNas
**Frequency**: [To be determined]
**Retention**: [To be determined]
## Common Patterns
### Adding a New Service Behind Traefik
1. Deploy service with static IP in 10.4.2.0/24 range
2. Create Traefik config in `/etc/traefik/conf.d/<service>.yaml`
3. Use pattern:
```yaml
http:
routers:
<service>:
rule: "Host(`<service>.kavcorp.com`)"
entryPoints: [websecure]
service: <service>
tls:
certResolver: letsencrypt
services:
<service>:
loadBalancer:
servers:
- url: "http://<ip>:<port>"
```
4. Traefik auto-reloads (no restart needed)
5. Update `docs/INFRASTRUCTURE.md` with service details
### Troubleshooting Permission Issues
1. Check file ownership: `ls -la /path/to/file`
2. Check if 777: `stat /path/to/file`
3. Fix permissions: `chmod -R 777 /path/to/directory`
4. For NZBGet: Verify `UMask=0000` in nzbget.conf
5. For Sonarr/Radarr: Check Settings → Media Management → Set Permissions
### Node SSH Access
**From local machine**:
- User: `kavren`
- Key: `~/.ssh/id_ed25519`
**Between cluster nodes**:
- User: `root`
- Each node has other nodes' keys in `/root/.ssh/authorized_keys`
- Proxmox web UI uses node SSH for shell access
## Known Issues & Workarounds
### Jellyfin Not Seeing Media After Import
**Symptom**: Files imported to `/media/tv` but Jellyfin shows empty
**Cause**: Jellyfin LXC mount not active or permissions wrong
**Fix**:
1. Restart Jellyfin LXC: `pct stop 121 && pct start 121`
2. Verify mount inside LXC: `pct exec 121 -- ls -la /media/tv/`
3. Fix permissions if needed: `chmod -R 777 /mnt/pve/elantris-media/tv/`
### Sonarr/Radarr Import Failures
**Symptom**: "Access denied" errors in logs
**Cause**: Permission mismatch between download client and *arr service
**Fix**: Ensure download folder has 777 permissions
## Future Considerations
- [ ] Automated backup strategy
- [ ] Monitoring/alerting system (Prometheus + Grafana?)
- [ ] Consider Authelia for future services without built-in auth
- [ ] Document disaster recovery procedures
- [ ] Consider consolidating Docker hosts