Home > Merlin---ISST

Merlin---ISST

Merlin---ISST is a project mainly written in ..., it's free.

Our fork / variant of op5 Merlin - used for our distributed version of Nagios

This fork is based on Merlin stable 0.6.7.

http://www.op5.org/community/projects/merlin

CREDITS:

First and foremost, Andreas Ericsson and op5 for this project - it has a terrific database abstraction layer now and promises even better functionality in the near future with regards to creating easy distributed Nagios setups.

Our managers at Comcast, for allowing us to share our work with the open source community (thank you, this is one of the many reasons Comcast is a great place to work!):

  • Jason Livingood
  • Mike Fischer
  • Eric Scholz

Team members at Comcast who have contributed to this variant:

  • Jack Kidwell
  • Ryan Richins
  • Shaofeng Yang
  • Max Schubert

What is different between the op5 version and ours?

Major changes between the op5 Merlin and our version + our overall Nagios methodology:

Our architecture:

  • Polling tier per project - projects range from 1000 - 10000+ hosts
  • Common notification tier - all pollers send notification requests to notification tier, which has a rules engine to perform notification actions.
  • Common data trending tier - We use PNP plus a custom in-house long term data trending solution that stores raw values for very long periods of time.
  • Common database tier - H/A master-master replicated database tier with GSLB name for all pollers to use; can be expanded to master1...masterN circular replication for large horizontal scaling
  • Common web UI tier - Centralized operational view, command and control, administration for NOC, admin, configuration etc use.

In our vision the operational SAs who maintain and manage projects also have full responsibility for their pollers - they do releases of Nagios configurations when they want, they ensure their pollers are up 24x7, they re-distribute checks as needed, they watch poller capacity. We provide mentoring, training, and bake in as much information (key performance indicators, capacity warnings, etc) as we can into our software as possible to let them do just about everything they need to do with regards to monitoring without our team having to be in the middle of the process.

Methodology

  • All pollers within a project write to a common database - schema has been updated to allow for this.
  • Centralized (off poller hosts) UI does
    • Display of current host and service states
    • Remote command and control to all remote pollers
    • Re-distribution of checks
  • Centralized admin UI
    • Loads a users' objects.pre-cache into staging tables
    • Lets the user select which pollers checks should run on
    • Creates objects.pre-cache files for each poller along with a common retention.dat file
    • Pushes all code to pollers
    • Triggers restart

Re-distribution of checks

  • User creates configuration tree in SVN
  • User makes a tag of their configuration tree
  • User additionally checks in an gzipped objects.pre-cache into tag
  • From administrative UI (we are working on this now):
    • User selects their release tag from SVN
    • Admin UI pulls down release
    • Populates staging versions of host/service etc tables with objects.pre-cache
    • User distributes checks through UI
    • User triggers distribution process
  • Distribution code
    • Create per-poller objects.pre-cache file
    • Create common retention.dat file
    • Push code to all pollers affected
    • Restart all affected pollers

Database

  • Each poller updates instance_id with it's packed IP address -> unsigned long as the unique poller identifier - this lets us know where to send remote commands to from our centralized UI as well.
  • Added is_online field to program_status so that new pollers can be added to the potential poller list for our administrative UI to use for check distribution without the poller being active.
  • Object import script is only called when a team wants to do a release - instead of being run on every poller, it is run on the admin web host. It populates staging tables that then allow the administrator to distribute hosts and their related services (our distribution granularity is at the host level) among pollers that are available for polling.
  • Once distribution is done from an admin perspective, objects.cache files are created and pushed to each poller - these then drive data back into the database as the remote pollers are restarted.
  • Field naming - where possible we aligned field names with retention.dat names / objects.cache names
  • Views - database schema includes views we use in our in-process UI, all views can safely be removed from the schema without affecting Merlin operations (we have a single source of truth for our database on our middle tier / web UI project that writes out the Merlin schema to keep the two in sync).

Merlin:

  • Code added to populate instance_id based on the local hosts' IP address - the IP address is based on a PTR lookup of the host's hostname as returned by the C gethostname() call.
  • De-duplicated comment and downtime events by only responding to *_ADD NEB callback types.
Previous:mysite