A question on Reddit about what a sysadmin should ask for at the handoff between Dev and Ops.
This below is my response, and I thought it good enough to continue here.
So, here's my mental check list for deploying things:
1. A list of external interfaces for the project
- DB server
- TLS certs
- External ftp / nfs / data sources
- File paths ( data files, cache files, etc.)
- Webserver (Frontend)
- Loadbalancer (Frontend)
- Web cache (Frontend)
2. A list of internal dependencies
- configuration files & paths
- Java / python / PHP versions and packages
- OS requirements ( Minimum disk in places, etc. etc.)
- Incoming ports
- Outgoing ports (Why do you have your own DNS resolver that bypasses our name server and dies because of firewall?)
- Outgoing ports ( How do I proxy this shit for logging /audit purpouses)
- Service accounts
3. A list of bad habits
- Write mounted paths in the application
- write/Execute things
- outgoing tcp /udp
- ipsec configurations
- Does it demand R/w access to configuration files
- writing it's own cron jobs
4. Maintenance areas
- Does it rotate it's own log files, or logrotate configuration
- systemd service file ( or init file. )
- which / how cron jobs should be set up
- Can cache files be cleaned out or not
- Incoming Firewall rules (persistent)
- Outgoing Firewall rules (persistent)
5. Management areas
- Does it obey SIGHUP?
- Does it reload TLS certs? ( Test this! )
- How do you monitor that it runs
- How do you monitor that it works
- When are the downtimes scheduled?
- If there's no planned downtime, make sure there is unplanned downtime every tuesday at 9 am.
- How does it upgrade
- Request an older version and a current version, demand to test an upgrade.
- How does it downgrade?
- How do you back it up?
- How do you restore it again?
- Test Upgrade & downgrade again (db migrations, etc)
- Test restore backup
7. Pre flight checklist
- Firewall everything, see that "monitor that it works" fails,
- Mount all filesystems read only. See that "monitor that it works" fails, make sure "monitor that it runs" succeeds
- Iterate over all external dependencies, block them, make sure that "Monitor that it works" fails, make sure "monitor that it runs" succeeds.
- Kill it's DNS, test that it works or fails, monitoring?
- Fill it's drives, test that it fails properly or not.
- Cold reboot server, repeatedly, ensure it comes back without issues
- How do you monitor it's performance? ( HTTP logs with latencies? Perf counters? others?)
- Load test, how do you do that?
- Passive artificial load on the system for the first week in production, then remove the load tool.
- How do you scale this in the future?
- IO / cpu / network bound?