Links
Previous Posts
- Q: Why so many websites? A: Toyota and Technology
- .htaccess vs httpd.conf
- Howto: Setup a Mac Mini as a BGP Router
- We are getting quicker
- Can I please have some money for my idea?
- A House Price Crash could be very close
- The Birth of our Talkon.it Network
- Livetodot is getting much better
- Fubra breaks the 3 million UV barrier!
- Installing our new Rackable racks
Archives
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- October 2007
- September 2007
- August 2007
- June 2007
- April 2007
- March 2007
- February 2007
- November 2006
- August 2006
- June 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
- June 2005
- May 2005
- April 2005

Pro-active Monitoring of a Network
Posted 6:26 PM Thursday February 7, 2008 by Paul Maunders
The Fubra Network has been growing steadily over the past few years, and we now manage over 100 physical servers across 3 geographically diverse sites.
In the past we have operated a fairly re-active strategy to fixing server problems. For example; if we noticed, or someone told us, that a site was running slowly, we would look into it. If a site went down, we would fix it. Of course, in the long run, this isn't a great way to look after your network. As the sayings go, "a stitch in time saves nine" and "a ounce of prevention is worth a pound of cure". The same is true with server hosting.
So over the last year, we have begun to implement a much improved strategy to network maintenance that involves pro-actively monitoring all our server resources and identifying potential problems before they occur.
In this blog post, we will describe a real world example of this form of problem spotting that happened to us today.
Zabbix
We now use Zabbix as our main monitoring system. Zabbix monitors indicators such as disk space, memory available, CPU usage, load average and network usage across all our physical machines. Today we spotted a Zabbix graph that looked a little extraordinary.

As you can see from the graph there was a spike in activity on an hourly basis. Look closer and you can see this was happening at 28 minutes past the hour.
From that most server admins will think cronjob! So did we, so we had a look at the server which houses several vServers and on this server we run mirror.fubra.com which is a mirror service we provide to the open source community.
Anyway one of the site's we mirror is uk3.php.net and as a PHP mirror they install webalizer on the mirror so that the traffic can be tracked. When we set the mirror up it was set-up in a bit of a hurry and we forgot something. Paul describes his chain of thought: "I was thinking to myself, how the @£@$ is that webalizer process causing a constant load avg of 4 for an hour? There are only 6000 visitors per month so it should only take a few seconds!"
Anyway, Mark checked the log files and it turns out they were not being rotated, so we have a 2.7GB log file that webalizer is processing each and every hour over an ATAoE storage network that we have set-up.
This meant this particular problem was causing network load, fileserver load, and web server load all because of a simple error forgetting to rotate a log. As it happens this didn't bring it to a halt and everything still worked just fine, both before and after we fixed the problem.
To fix things Mark simply installed logrotate, and now it is working a lot better. The result of this is that we can squeeze more value out of our existing hardware.
Becoming more proactive in the monitoring of all our services is a luxury we have been working towards. Up until recently we have been fighting just to get things up and working but now that we have some more server admins and we are looking for even more talented ones we are starting to really get proactive in this area.
In the past we have operated a fairly re-active strategy to fixing server problems. For example; if we noticed, or someone told us, that a site was running slowly, we would look into it. If a site went down, we would fix it. Of course, in the long run, this isn't a great way to look after your network. As the sayings go, "a stitch in time saves nine" and "a ounce of prevention is worth a pound of cure". The same is true with server hosting.
So over the last year, we have begun to implement a much improved strategy to network maintenance that involves pro-actively monitoring all our server resources and identifying potential problems before they occur.
In this blog post, we will describe a real world example of this form of problem spotting that happened to us today.
Zabbix
We now use Zabbix as our main monitoring system. Zabbix monitors indicators such as disk space, memory available, CPU usage, load average and network usage across all our physical machines. Today we spotted a Zabbix graph that looked a little extraordinary.

As you can see from the graph there was a spike in activity on an hourly basis. Look closer and you can see this was happening at 28 minutes past the hour.
From that most server admins will think cronjob! So did we, so we had a look at the server which houses several vServers and on this server we run mirror.fubra.com which is a mirror service we provide to the open source community.
Anyway one of the site's we mirror is uk3.php.net and as a PHP mirror they install webalizer on the mirror so that the traffic can be tracked. When we set the mirror up it was set-up in a bit of a hurry and we forgot something. Paul describes his chain of thought: "I was thinking to myself, how the @£@$ is that webalizer process causing a constant load avg of 4 for an hour? There are only 6000 visitors per month so it should only take a few seconds!"
Anyway, Mark checked the log files and it turns out they were not being rotated, so we have a 2.7GB log file that webalizer is processing each and every hour over an ATAoE storage network that we have set-up.
This meant this particular problem was causing network load, fileserver load, and web server load all because of a simple error forgetting to rotate a log. As it happens this didn't bring it to a halt and everything still worked just fine, both before and after we fixed the problem.
To fix things Mark simply installed logrotate, and now it is working a lot better. The result of this is that we can squeeze more value out of our existing hardware.
Becoming more proactive in the monitoring of all our services is a luxury we have been working towards. Up until recently we have been fighting just to get things up and working but now that we have some more server admins and we are looking for even more talented ones we are starting to really get proactive in this area.
