In a modern cloud-native, microservices, container-driven, highly available (insert your favorite catchy term here) application design, monitoring is not an option, and while much time is being wasted in the technology debate (do not get me wrong, it is a good debate in which I have a bias, and it is the Elastic Stack if you care).
Instead, this blog post will focus on what I think is fairly omitted, monitoring( like every other system), requires a good understanding of its users and goals, it is a system with a normal life cycle (Dev, QA, Prod) and it will improve along not just your production but also your developments (DevOps, you know!)
Start by understanding the eco-system in which your monitoring platform will function, you have mostly 3 tenants that your monitoring system needs to ‘Enable’ (assuming large enterprise, lacking the pink unicorn DevOps talents)
Enabling your tenants
Now that your different engineering teams are identified and their requirements are clear’er’, you can start enabling those tenants, some example tasks maybe:
- Enable the Development team to emit events (Logging or otherwise) providing enough information about the SLIs. (Service Level Indicators)
For example, you may Enable Spring Boot/Node.js Development team to emit the proper logs needed for your You may build a few libraries and initializers to ensure the effective use of logging
- You may need to enable the operations team to stand up and tear down the monitoring environments at well.
- Enable the support team to achieve their target SLO’s through three main ways:
- Automatic notifications on certain events. (proactive actions)
- Automated response to certain events (when possible to avoid pager fatigue)
- Dashboards to help investigate issues and decrease your MTTR (Mean Time To Resolution/Response).
Integrating monitoring into the project life cycle
In a classical story of ‘Login’ Screen, a typical flow will typically look like the following
Once we include the Support Engineering Team and monitoring in the discussion, we start asking a question
“What could go wrong here that I may need to alert the administrators about? or simply plot on a dashboard for problem resolution ? “
The simple action of asking that question may bring new ideas to the team and instead you start to plot a story like this
By integrating monitoring, operations and support tasks into our development environment we simply gain the following
- Inclusion of exception paths that were not handled at all
- Further enforcement of the “Design for Failure”
- Faster response to problems that would have gone unnoticed or at worse took a longer time to fix.
The previous steps only cements the status of our support efforts as an “Engineering” practice!
Monitoring life cycle
Monitoring solutions involves all parties, it is important for Developers, Testers, Operations and Support, building a monitoring environment could be a great way to break the silos between those teams
As monitoring systems are enabled in each project and custom dashboards/alerts are being built for each project it does make sense to deploy those systems in Development environments and that is the perfect way to get support teams involved as early as possible.
Monitoring systems could identify bugs that regular system would not identify, Engage the support team and QA team in discussions about what is acceptable operational situations and what is not and it will evolve along your system, the learnings from one project could also flow into others as you improve your ‘base offering’ for each project over time.
Agile applies here!
Do not let the desire to engineer a great system bait you into the Analysis Paralysis trap, If you do not currently have a monitoring system, almost anything you do will help, start somewhere, use the “Build to change” versus the “Build to last” principal and remember that “Perfect is the enemy of Good”.
Anytime you have a choice of an open-source vs pay tool, I lean towards the open source, with the exception of hosted tools that may offer fast and easy starting point
Try to avoid tool lock-in (tools that require modifying your code against tools that monitor your logs for example).
Deploy your first system as soon as possible, minimum dashboards and features, collect feedback from your tenants, make small changes and redeploy, Sounds Familiar?
Monitoring the Monitors
By that, I do not mean literally to monitor monitoring systems (and yes, that is a thing!), but I mean to constantly evolve your monitoring solutions, as time goes by you will learn that some of the features your tenants asked for are not used anymore or even worse, an event trigger, for example, may trigger so many times that it is ignored.
Any dashboard that is rarely used or Event that triggers so often it is ignored should always be put up for consideration for change or removal altogether, The experience shared by organisations large such as Google and some of the startups that shared their experience, tells us that the smaller and targeted your monitoring system is the more useful it becomes.
Monitor the things you need and nothing else