Prelude

A question on Reddit about what a sysadmin should ask for at the handoff between Dev and Ops.

This below is my response, and I thought it good enough to continue here.

So, here's my mental check list for deploying things:

1. A list of external interfaces for the project

  • DB server
  • TLS certs
  • Redis/memcached/etc
  • External ftp / nfs / data sources
  • File paths ( data files, cache files, etc.)
  • Webserver (Frontend)
  • Loadbalancer (Frontend)
  • Web cache (Frontend)

2. A list of internal dependencies

  • configuration files & paths
  • Java / python / PHP versions and packages
  • OS requirements ( Minimum disk in places, etc. etc.)
  • Incoming ports
  • Outgoing ports (Why do you have your own DNS resolver that bypasses our name server and dies because of firewall?)
  • Outgoing ports ( How do I proxy this shit for logging /audit purpouses)
  • Service accounts

3. A list of bad habits

  • Write mounted paths in the application
  • write/Execute things
  • outgoing tcp /udp
  • ipsec configurations
  • Does it demand R/w access to configuration files
  • writing it's own cron jobs

4. Maintenance areas

  • Does it rotate it's own log files, or logrotate configuration
  • systemd service file ( or init file. )
  • which / how cron jobs should be set up
  • Can cache files be cleaned out or not
  • Incoming Firewall rules (persistent)
  • Outgoing Firewall rules (persistent)

5. Management areas

  • Does it obey SIGHUP?
  • Does it reload TLS certs? ( Test this! )
  • How do you monitor that it runs
  • How do you monitor that it works
  • When are the downtimes scheduled?
  • If there's no planned downtime, make sure there is unplanned downtime every tuesday at 9 am.
  • How does it upgrade
  • Request an older version and a current version, demand to test an upgrade.
  • How does it downgrade?

6. Backups

  • How do you back it up?
  • How do you restore it again?
  • Test Upgrade & downgrade again (db migrations, etc)
  • Test restore backup

7. Pre flight checklist

  • Firewall everything, see that "monitor that it works" fails,
  • Mount all filesystems read only. See that "monitor that it works" fails, make sure "monitor that it runs" succeeds
  • Iterate over all external dependencies, block them, make sure that "Monitor that it works" fails, make sure "monitor that it runs" succeeds.
  • Kill it's DNS, test that it works or fails, monitoring?
  • Fill it's drives, test that it fails properly or not.
  • Cold reboot server, repeatedly, ensure it comes back without issues

8. Extras

  • How do you monitor it's performance? ( HTTP logs with latencies? Perf counters? others?)
  • Load test, how do you do that?
  • Passive artificial load on the system for the first week in production, then remove the load tool.
  • How do you scale this in the future?
  • IO / cpu / network bound?