This fork is based on Merlin stable 0.6.7.
http://www.op5.org/community/projects/merlin
CREDITS:
First and foremost, Andreas Ericsson and op5 for this project - it has a
terrific database abstraction layer now and promises even better functionality
in the near future with regards to creating easy distributed Nagios setups.
Our managers at Comcast, for allowing us to share our work with the open
source community (thank you, this is one of the many reasons Comcast is
a great place to work!):
- Jason Livingood
- Mike Fischer
- Eric Scholz
Team members at Comcast who have contributed to this variant:
- Jack Kidwell
- Ryan Richins
- Shaofeng Yang
- Max Schubert
What is different between the op5 version and ours?
Major changes between the op5 Merlin and our version + our overall Nagios
methodology:
Our architecture:
- Polling tier per project - projects range from 1000 - 10000+ hosts
- Common notification tier - all pollers send notification requests to
notification tier, which has a rules engine
to perform notification actions.
- Common data trending tier - We use PNP plus a custom in-house long term
data trending solution that stores raw values
for very long periods of time.
- Common database tier - H/A master-master replicated database tier with
GSLB name for all pollers to use; can be
expanded to master1...masterN circular
replication for large horizontal scaling
- Common web UI tier - Centralized operational view, command and
control, administration for NOC, admin,
configuration etc use.
In our vision the operational SAs who maintain and manage projects also have
full responsibility for their pollers - they do releases of Nagios
configurations when they want, they ensure their pollers are up 24x7, they
re-distribute checks as needed, they watch poller capacity. We provide
mentoring, training, and bake in as much information (key performance
indicators, capacity warnings, etc) as we can into our software as possible
to let them do just about everything they need to do with regards to
monitoring without our team having to be in the middle of the process.
Methodology
- All pollers within a project write to a common database - schema has been
updated to allow for this.
- Centralized (off poller hosts) UI does
- Display of current host and service states
- Remote command and control to all remote pollers
- Re-distribution of checks
- Centralized admin UI
- Loads a users' objects.pre-cache into staging tables
- Lets the user select which pollers checks should run on
- Creates objects.pre-cache files for each poller along with
a common retention.dat file
- Pushes all code to pollers
- Triggers restart
Re-distribution of checks
- User creates configuration tree in SVN
- User makes a tag of their configuration tree
- User additionally checks in an gzipped objects.pre-cache into tag
- From administrative UI (we are working on this now):
- User selects their release tag from SVN
- Admin UI pulls down release
- Populates staging versions of host/service etc tables with objects.pre-cache
- User distributes checks through UI
- User triggers distribution process
- Distribution code
- Create per-poller objects.pre-cache file
- Create common retention.dat file
- Push code to all pollers affected
- Restart all affected pollers
Database
- Each poller updates instance_id with it's packed IP address -> unsigned
long as the unique poller identifier - this lets us know where to
send remote commands to from our centralized UI as well.
- Added is_online field to program_status so that new pollers can be added
to the potential poller list for our administrative UI to use for check
distribution without the poller being active.
- Object import script is only called when a team wants to do a release -
instead of being run on every poller, it is run on the admin web host.
It populates staging tables that then allow the administrator to
distribute hosts and their related services (our distribution granularity
is at the host level) among pollers that are available for polling.
- Once distribution is done from an admin perspective, objects.cache files
are created and pushed to each poller - these then drive data back into
the database as the remote pollers are restarted.
- Field naming - where possible we aligned field names with retention.dat
names / objects.cache names
- Views - database schema includes views we use in our in-process UI, all
views can safely be removed from the schema without affecting
Merlin operations (we have a single source of truth for our
database on our middle tier / web UI project that writes out
the Merlin schema to keep the two in sync).
Merlin:
- Code added to populate instance_id based on the local hosts' IP address -
the IP address is based on a PTR lookup of the host's hostname as returned
by the C gethostname() call.
- De-duplicated comment and downtime events by only responding to *_ADD
NEB callback types.