So, for my Dayjob I prepared some lectures and workshops, of which at least one will be useful for a wider audience, so I'm reproducing parts of it here.

This post will be edited for formatting at some point later.

So, in our business, we're working with _many_ servers that aren't really like eachother. For most of these there isn't much in the way of administration, which means that debugging things turn into a day to day occurrence.

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

– Brian W. Kernighan

Basics: The environment

Free disk space
\$ df -h

Free inodes
\$ df -i

CPU usage
\$ top

  • Is something using a lot of CPU?
  • Wedged process?
  • Does the list look like what you expect?
  • Load average? Over 1.0 is usually trouble
  • Memory usage:
    \$ top

  • Check if things look bogus there.
  • System logs:
    \$ less /var/log/messages

    Look out for errors / repeated things

    Network connectivity

    Some basics:
    If you can SSH into the machine, you only need to check network resolution and reachability from the outside/inside networks.

    \$ host
    => network is reachable
    => DNS works
    If you can't resolve, try internal networks.

    Local name resolving
    \$ hostname
    \$ host \$(hostname)
    Not able to resolve your own hostname? Bad news my friend. Check /etc/hosts, hostname should point at

    \$ ping
    Routing works, check for speed patterns and slow jumps.

    \$ route -n UG 0 0 0 eth0
    Do you have a routing table? Default route is goes via, if you don't have a route for, you have issues.

    Can it be reached over network?
    \$ curl -v http://my-hostname-internally/
    If not, check the server firewall
    \$ iptables -v -L

    For debugging, make sure that INPUT policy is ACCEPT.

    Further debugging for the various services you may need, check result of
    \$ ps aux |grep
    R and S states are good (running, sleeping) Z, D and T are bad (Zombie, Waiting for IO, stopped).
    Compare the running user with the expected user. Also check the timestamp for when it was started.

    Dig into the application you're looking at, config files, logs, check ports and connectivity with `curl`.

    Further explore using :
    \$ lsof -nP -p \$PID
    \$ strace -ttTf -s0 -e open,connect,send,recv -p \$PID