Great!
I'm currently working on how to bring log- and application- performance monitoring under the same roof for cloud-native and highly distributed applications on top of OpenStack w/ Cloud Foundry or OpenShift and Kubernetes add-ons and define some best practices (needs) to build a simple, though effective cloud native application monitoring solution for BizDevOps (yet another buzz :-)).
My 10 BizDevOps needs are:
- Bring log and performance monitoring under the same roof, by providing a seamless correlation between log and performance metrics.
- Provide intuitive pre-built monitoring interfaces and dashboards for everybody and for different roles and organizations (BizDevOps) (note: people lack the time and sometimes the skills to configure a monitoring tool).
- Build dedicated dashboards for transaction and correlation analysis to figure out the usual suspects like, memory leaks, garbage collection, saturated thread pools and hundreds of unusual suspects which might be the root cause of problems.
- Enhance the quality of logs (on paas and apps level) and define custom metrics which are specific to our cloud-native applications and visualize these metrics on custom dashboards for tenants w/ different roles.
- Analyze long term-trends such as how big is my database and how fast is it growing? How quickly is my daily-active user count growing?
- Implement innovative ideas such as data mining, forecasting and advanced analytics support to provide added value to the monitoring solution.
- Get alerts on issues before customers notice, use the monitoring tool as an early warning system, and analyze application performance before and after new code deployments.
- If using remediation actions which are triggered through the monitoring solution, first require human approval before the script is executed (this provides a better understanding of the root cause of the problem and how to eliminate it in long term).
- Implement a simple, though an effective alerting system with clear alerting escalation path and low noise (rules that generate alerts for developers or operators should be simple to understand and represent a clear failure).
- Combine heavy use of white-box monitoring with modest but critical uses of black-box monitoring and learn from others like Google about how they are monitoring their highly distributed systems: https://www.oreilly.com/ideas/monitoring-distributed-systems