In every developers life, there is a must-have that we implement in architecture: a monitoring system. Or maybe two, if you add the headset in an open space office!
I have to warn you: everything in this blogpost is totally subjective and is described only from my point of view… don’t worry if you don’t agree, isn’t that the basis for a debate?
IMPORTANT QUESTIONS TO ASK YOURSELF BEFORE STARTING
When you have to choose a network and system monitoring software, there are important questions to ask yourself:
1 – What do you already know?
Indeed, are you already confident with one system or another ? Do you only believe in an open-source or proprietary system ? Did you already install nagios-like monitoring systems or have you only used SCOM (poor you…) ? Do you want to spend time on it ?
→ I’m used to Nagios and Zabbix. I don’t have the time nor the resources to learn a totally different system, but I love to play in configuration files for hours on my free time… during office hours.
2 – How is your physical architecture?
Are there several datacenters, located in different places?
→ Here in Streamdata.io, we have several places to monitor: our production datacenters, some dedicated servers (website, IT management), our headquarter.
At this point, I know I don’t want to have a monolithic system or one monitoring software with different configurations on each geographic place.
3 – What is your logical architecture?
What kind of OS / services / resources do you need to check?
→ Full linux, especially Debian/Ubuntu. Nothing special with hardware stuff for production, we will use AWS for now. Even if it can evolve in the future, we’ll still rely on an IaaS layer. Let’s live in our time and concentrate on what’s important for you, dealing with hardware is such a waste of time (I’ve been tattooed by painful nights trying to smoothly change switches in production datacenters…)
4 – How do you want to monitor?
SNMP traps, NRPE server, SSH,…
→ I prefer to use NRPE / SSH. I think SNMP is very powerful but really complicated for needs outside standard monitoring.
5 – When does a service becomes critical?
What deserves to wake you up at 2AM a lazy sunday…
→ Streamdata.io is a software editor running in the Cloud. If our software is not reachable, this is always a real and critical problem as it can potentially impact all of our customers.
If we detect that our website can’t be pinged in less than 1 second, it is critical but no need to disturb you when you sleep (and the person sleeping by your side, and your employer: the former for his peace of mind and the latter because you will not charge him… I hope for you it’s in the same order!).
6 – What kind of view is necessary?
Do you need SLA reports, business view, IT view,…
→ I know a business view is a good value for marketing people, so I think it’s a must-have nowadays. For someone who has to manage a global view, we have to instantly understand what the problem is, but it is also as important to immediately understand what are the impacts!
WHAT AND WHY
With these answers, my interest was mainly focused on 4 softwares: Zabbix, Nagios, Centreon and Shinken.
Zabbix is a great tool, very effective, but mainly used for SNMP monitoring ,and that is not what I want. It’s also not as intuitive as the 3 other softwares.
I rapidly removed Nagios from the list: no more developments in the main open-source branch, too monolithic. It is still a standard in monitoring tool and IT world, but there is better alternative like…
The all-in-one solution with Centreon is pretty cool, not so hard to install but I dislike the web-configuration principle: I’m faster when I have to play with flat files directly on the filesystem.
There is also a lot of native module I don’t need.
Here in Streamdata.io, we chose Shinken. I think it is a network monitoring software that best suits our needs for the following reasons:
This is one of the most important points for me. The software must, of course, be production ready and in that case, the community has to be active. It is one of the great quality of Shinken: Jean Gabes and his team are really involved, listening to the community, and Shinken is almost fully compatible with Nagios plugins (yes, it means you can use https://exchange.nagios.org/!).
I don’t know if big companies use it, but I believe in this project.
There is also a native packs repository, really easy to use with CLI commands: http://shinken.io/
Peace of mind:
Of course, It might be a risk to use in production modern monitoring software without big references, but Shinken is fully nagios compatible, and it will not be so difficult to rollback on a more proven software.
Here is the definition of the daemons (Source: http://shinken.readthedocs.org/en/latest)
Arbiter: Basically, it reads the configuration, cuts it into parts (N schedulers = N parts), and then sends them to all others elements. Only one by architecture.
Scheduler: The Scheduler daemon is in charge of the scheduling checks, the analysis of results and follow up actions (e.g if a service is down, ask for a host check). They do not launch checks or notifications.
Poller: They are in charge of launching plugins as requested by schedulers. When the check is finished they return the result to the schedulers. There can be many pollers.
Reactionner: The Reactionner daemon is in charge of notifications and launching event_handlers. There can be more than one Reactionner.
Broker: The Broker daemon provides access to Shinken internal data. Its role is to get data from schedulers (like status and logs) and manage them. The management is done by modules. Many different modules exists: export to graphite, export to syslog, export into ndo database (MySQL and Oracle backend), service-perfdata export, couchdb export and more.
It is a real advantage in a multi Data Center architecture: load-balancing or failover systems are easy to implement. Need to add a new site? Just add a poller in your new network!
All you conf can be easily done on the main Arbiter. You won’t need to manage X different servers, just one split in X parts.
Monitoring software is one of the most forked world in open-source… It’s always complicated to find the good one when you want or need to change. Hope this blog post can help you asking yourself good questions, and maybe find some tools that will suit your needs!
Next blog post of this series will focus on more technical aspects of the deployment (tweaks I had to do…). I will keep you posted…