| kale ( @ 2009-02-20 10:08:00 |
The 10 Golden Rules for Troubleshooting Linux
10. TCP wrappers suck. If you've been hacking at an issue for over 3 hours, look to your TCP wrappers. /etc/hosts, /etc/hosts.allow and /etc/hosts.deny will hold the answer.
I'm doing a brain vomit, and you're the lucky recipient of my geek bile!
1. Man pages exist and should be used. Seriously, everything's there, from application docs to syscall docs to syntax and formatting of log files.
2. Don't reinvent the wheel. 99% of problems you're experiencing or ever will experience, somebody's already gone through it and figured it out. Google is your friend.
3. If you don't know what something's doing, or why it's not working, strace it! Nobody ever uses strace, yet I find it invaluable. Especially for Apache issues. Is your site or PHP code or whatever not working?
3a. Make sure your timeout is set to something long enough for manual human interaction, and fire up two terminal windows.
3b. SSH to server on one and gain root access.
3c. In the other window, telnet to your server on port 80. Make a GET request for the page causing you issues, such as GET /page.php HTTP/1.1
3d. Switch to SSH session. Do `netstat -plant |grep your.ip.add.ress' and find the ESTABLISHED one with an apache process attached to it.
3e. Run `strace -p (pid of apache process from above) -vvv -Ff -s 256'
3f. Back in telnet session, type "Host: domain.com" and hit enter twice.
3g. Switch back to SSH session and watch the syscalls go! The answer is held within. Always.
(note -- you can also launch Apache in debug mode (`httpd -X'), but this requires taking down the service. Debug mode sets MaxClients to 1 and it doesn't fork child processes, making it easier to strace (you don't have to switch back and forth to find the pid of the child you're connected to), but it's not feasible on a live server.)
4. Logs exist for a reason. Read them.
5. Applications crash, servers don't. If your server crashes, it's either bad hardware or a kernel bug (fairly rare on popular distros).
6. Always make backups. Always.
1. Man pages exist and should be used. Seriously, everything's there, from application docs to syscall docs to syntax and formatting of log files.
2. Don't reinvent the wheel. 99% of problems you're experiencing or ever will experience, somebody's already gone through it and figured it out. Google is your friend.
3. If you don't know what something's doing, or why it's not working, strace it! Nobody ever uses strace, yet I find it invaluable. Especially for Apache issues. Is your site or PHP code or whatever not working?
3a. Make sure your timeout is set to something long enough for manual human interaction, and fire up two terminal windows.
3b. SSH to server on one and gain root access.
3c. In the other window, telnet to your server on port 80. Make a GET request for the page causing you issues, such as GET /page.php HTTP/1.1
3d. Switch to SSH session. Do `netstat -plant |grep your.ip.add.ress' and find the ESTABLISHED one with an apache process attached to it.
3e. Run `strace -p (pid of apache process from above) -vvv -Ff -s 256'
3f. Back in telnet session, type "Host: domain.com" and hit enter twice.
3g. Switch back to SSH session and watch the syscalls go! The answer is held within. Always.
(note -- you can also launch Apache in debug mode (`httpd -X'), but this requires taking down the service. Debug mode sets MaxClients to 1 and it doesn't fork child processes, making it easier to strace (you don't have to switch back and forth to find the pid of the child you're connected to), but it's not feasible on a live server.)
4. Logs exist for a reason. Read them.
5. Applications crash, servers don't. If your server crashes, it's either bad hardware or a kernel bug (fairly rare on popular distros).
6. Always make backups. Always.
7. Always mount NFS mounts with the 'intr' option. Having to reboot because of a network blip is uncool. (Humorous aside: Macs mount bonjour-introduced mounts via AFP, which appears to have all the awesome negatives of NFS. If the mount goes away (other server going down or whatever), Finder will freak the fuck out and your programs will start having bizarre issues. My Finder was hung, and attempting to restart it failed. Then iTunes got stuck in a loop. Then Quicksilver crashed. The only thing I could do, literally, was run `shutdown -r now' in the Terminal window I had open. Lesson learned -- unmount share when doing software updates on the other server.)
8. Learn to use `grep', `sed' and `awk'. Learning to manipulate text is surprisingly important for a text-based interface.
9. Load average does not mean CPU usage. 100% memory usage does not mean you don't have any more available for new applications. You can run out of inodes before you run out of disk space.
9. Load average does not mean CPU usage. 100% memory usage does not mean you don't have any more available for new applications. You can run out of inodes before you run out of disk space.
10. TCP wrappers suck. If you've been hacking at an issue for over 3 hours, look to your TCP wrappers. /etc/hosts, /etc/hosts.allow and /etc/hosts.deny will hold the answer.