Initial commit: KavCorp infrastructure documentation
- CLAUDE.md: Project configuration for Claude Code - docs/: Infrastructure documentation - INFRASTRUCTURE.md: Service map, storage, network - CONFIGURATIONS.md: Service configs and credentials - CHANGELOG.md: Change history - DECISIONS.md: Architecture decisions - TASKS.md: Task tracking - scripts/: Automation scripts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
163
docs/DECISIONS.md
Normal file
163
docs/DECISIONS.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Architecture Decisions & Patterns
|
||||
|
||||
> **Purpose**: Record of important decisions, patterns, and "why we do it this way"
|
||||
> **Update Frequency**: When making significant architectural choices
|
||||
|
||||
## Service Organization
|
||||
|
||||
### Authentication Strategy
|
||||
|
||||
**Decision**: Services use their own built-in authentication, not Authelia
|
||||
**Reason**: Most *arr services and media tools have robust auth systems
|
||||
**Exception**: Consider Authelia for future services that lack authentication
|
||||
|
||||
### LXC vs Docker
|
||||
|
||||
**Keep in Docker**:
|
||||
- NZBGet (requires specific volume mapping, works well in Docker)
|
||||
- Multi-container stacks
|
||||
- Services requiring Docker-specific features
|
||||
|
||||
**Migrate to LXC**:
|
||||
- Single-purpose services (Sonarr, Radarr, etc.)
|
||||
- Services benefiting from isolation
|
||||
- Stateless applications
|
||||
|
||||
## File Permissions
|
||||
|
||||
### Media Files
|
||||
|
||||
**Standard**: All media files and folders must be 777
|
||||
**Reason**:
|
||||
- NFS mounts between multiple systems with different UID mappings
|
||||
- Jellyfin runs in LXC with UID namespace mapping (100107)
|
||||
- Sonarr runs in LXC with different UID mapping
|
||||
- NZBGet runs in Docker with UID 1000
|
||||
|
||||
**Implementation**:
|
||||
- NZBGet: `UMask=0000` to create files with 777
|
||||
- Sonarr: Media management → Set permissions → chmod 777
|
||||
- Manual fixes: `chmod -R 777` on media directories as needed
|
||||
|
||||
## Network Architecture
|
||||
|
||||
### Reverse Proxy
|
||||
|
||||
**Decision**: Single Traefik instance handles all external access
|
||||
**Location**: LXC 104 on pm2
|
||||
**Benefits**:
|
||||
- Single point for SSL/TLS management
|
||||
- Automatic Let's Encrypt certificate renewal
|
||||
- Centralized routing configuration
|
||||
- DNS-01 challenge for wildcard certificates
|
||||
|
||||
### Service Domains
|
||||
|
||||
**Pattern**: `<service>.kavcorp.com`
|
||||
**DNS**: All subdomains point to public IP (99.74.188.161)
|
||||
**Routing**: Traefik inspects Host header and routes internally
|
||||
|
||||
## Storage Architecture
|
||||
|
||||
### Media Storage
|
||||
|
||||
**Decision**: NFS mount from elantris for all media
|
||||
**Path**: `/mnt/pve/elantris-media` → elantris `/el-pool/media`
|
||||
**Reason**:
|
||||
- Centralized storage
|
||||
- Accessible from all cluster nodes
|
||||
- Large capacity (24TB ZFS pool)
|
||||
- Easy to backup/snapshot
|
||||
|
||||
### LXC Root Filesystems
|
||||
|
||||
**Decision**: Store on KavNas NFS for most services
|
||||
**Reason**:
|
||||
- Easy backups
|
||||
- Portable between nodes
|
||||
- Network storage sufficient for most workloads
|
||||
|
||||
**Exception**: High I/O services use local-lvm
|
||||
|
||||
## Monitoring & Maintenance
|
||||
|
||||
### Configuration Management
|
||||
|
||||
**Decision**: Manual configuration with documentation
|
||||
**Reason**: Small scale doesn't justify Ansible/Terraform complexity
|
||||
**Trade-off**: Requires disciplined documentation updates
|
||||
|
||||
### Backup Strategy
|
||||
|
||||
**Decision**: Proxmox built-in backup to KavNas
|
||||
**Frequency**: [To be determined]
|
||||
**Retention**: [To be determined]
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Adding a New Service Behind Traefik
|
||||
|
||||
1. Deploy service with static IP in 10.4.2.0/24 range
|
||||
2. Create Traefik config in `/etc/traefik/conf.d/<service>.yaml`
|
||||
3. Use pattern:
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
<service>:
|
||||
rule: "Host(`<service>.kavcorp.com`)"
|
||||
entryPoints: [websecure]
|
||||
service: <service>
|
||||
tls:
|
||||
certResolver: letsencrypt
|
||||
services:
|
||||
<service>:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "http://<ip>:<port>"
|
||||
```
|
||||
4. Traefik auto-reloads (no restart needed)
|
||||
5. Update `docs/INFRASTRUCTURE.md` with service details
|
||||
|
||||
### Troubleshooting Permission Issues
|
||||
|
||||
1. Check file ownership: `ls -la /path/to/file`
|
||||
2. Check if 777: `stat /path/to/file`
|
||||
3. Fix permissions: `chmod -R 777 /path/to/directory`
|
||||
4. For NZBGet: Verify `UMask=0000` in nzbget.conf
|
||||
5. For Sonarr/Radarr: Check Settings → Media Management → Set Permissions
|
||||
|
||||
### Node SSH Access
|
||||
|
||||
**From local machine**:
|
||||
- User: `kavren`
|
||||
- Key: `~/.ssh/id_ed25519`
|
||||
|
||||
**Between cluster nodes**:
|
||||
- User: `root`
|
||||
- Each node has other nodes' keys in `/root/.ssh/authorized_keys`
|
||||
- Proxmox web UI uses node SSH for shell access
|
||||
|
||||
## Known Issues & Workarounds
|
||||
|
||||
### Jellyfin Not Seeing Media After Import
|
||||
|
||||
**Symptom**: Files imported to `/media/tv` but Jellyfin shows empty
|
||||
**Cause**: Jellyfin LXC mount not active or permissions wrong
|
||||
**Fix**:
|
||||
1. Restart Jellyfin LXC: `pct stop 121 && pct start 121`
|
||||
2. Verify mount inside LXC: `pct exec 121 -- ls -la /media/tv/`
|
||||
3. Fix permissions if needed: `chmod -R 777 /mnt/pve/elantris-media/tv/`
|
||||
|
||||
### Sonarr/Radarr Import Failures
|
||||
|
||||
**Symptom**: "Access denied" errors in logs
|
||||
**Cause**: Permission mismatch between download client and *arr service
|
||||
**Fix**: Ensure download folder has 777 permissions
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- [ ] Automated backup strategy
|
||||
- [ ] Monitoring/alerting system (Prometheus + Grafana?)
|
||||
- [ ] Consider Authelia for future services without built-in auth
|
||||
- [ ] Document disaster recovery procedures
|
||||
- [ ] Consider consolidating Docker hosts
|
||||
Reference in New Issue
Block a user