Fubra Blog

Paul Maunders

Pro-active Monitoring of a Network

Posted 6:26 PM Thursday February 7, 2008 by Paul Maunders

The Fubra Network has been growing steadily over the past few years, and we now manage over 100 physical servers across 3 geographically diverse sites.

In the past we have operated a fairly re-active strategy to fixing server problems. For example; if we noticed, or someone told us, that a site was running slowly, we would look into it. If a site went down, we would fix it. Of course, in the long run, this isn't a great way to look after your network. As the sayings go, "a stitch in time saves nine" and "a ounce of prevention is worth a pound of cure". The same is true with server hosting.

So over the last year, we have begun to implement a much improved strategy to network maintenance that involves pro-actively monitoring all our server resources and identifying potential problems before they occur.

In this blog post, we will describe a real world example of this form of problem spotting that happened to us today.

Zabbix

We now use Zabbix as our main monitoring system. Zabbix monitors indicators such as disk space, memory available, CPU usage, load average and network usage across all our physical machines. Today we spotted a Zabbix graph that looked a little extraordinary.



As you can see from the graph there was a spike in activity on an hourly basis. Look closer and you can see this was happening at 28 minutes past the hour.

From that most server admins will think cronjob! So did we, so we had a look at the server which houses several vServers and on this server we run mirror.fubra.com which is a mirror service we provide to the open source community.

Anyway one of the site's we mirror is uk3.php.net and as a PHP mirror they install webalizer on the mirror so that the traffic can be tracked. When we set the mirror up it was set-up in a bit of a hurry and we forgot something. Paul describes his chain of thought: "I was thinking to myself, how the @£@$ is that webalizer process causing a constant load avg of 4 for an hour? There are only 6000 visitors per month so it should only take a few seconds!"

Anyway, Mark checked the log files and it turns out they were not being rotated, so we have a 2.7GB log file that webalizer is processing each and every hour over an ATAoE storage network that we have set-up.

This meant this particular problem was causing network load, fileserver load, and web server load all because of a simple error forgetting to rotate a log. As it happens this didn't bring it to a halt and everything still worked just fine, both before and after we fixed the problem.

To fix things Mark simply installed logrotate, and now it is working a lot better. The result of this is that we can squeeze more value out of our existing hardware.

Becoming more proactive in the monitoring of all our services is a luxury we have been working towards. Up until recently we have been fighting just to get things up and working but now that we have some more server admins and we are looking for even more talented ones we are starting to really get proactive in this area.