At Sitewards, there have been a number of monitoring systems employed over the years to attempt to get introspectability into our systems. At least the following have been used:
- New Relic
- Maybe more?
More or less, they all work fine. However, recently I have been pushing a system called Prometheus onto my colleagues. Broadly speaking, Prometheus is much the same as the aforementioned, but with a couple of exceptions:
- It's free, and always will be. It's "owned" by the CNCF (Linux Foundation)
- It has certain opinions about how to think about system performance that I like
- It's simple, and there is lots of tooling. Indeed, we use the stock Prometheus helm chart provided by Kubernetes
Prometheus is used here as our primary alerting tool. We monitor various things to various levels with it; most importantly simple "pingdom style" uptime monitoring.
At Sitewards, we do a thing called "Improvement Day". Every few weeks we take a day as a team to collectively improve our working situation. Quite some things have come out of these days, such as:
- A Wiki
- Various Kubernetes discussions
- Lots of knowledge sharing
Today, I was super excited when our CTO and a colleague decided they wanted to learn more about the systems that I have been building. In particular we discussed the following:
- The current email format is quite unclear. Naive users are left with no clear idea what to do to resolve an issue, and the content of the email is somewhat misleading.
- There is basically no new-relic style dashboards, which is the "lingua franca" of the developers currently -- to check a dashboard and see if anything looks out of the ordinary.
We set out to address these things, but were immediately hit with some challenges. They are largely of my own design, and this post is set out a somewhat a warning to cover them.
- The emails do not provide super useful info.
Emails are how alerts have been previously delivered, and they are thus still the defacto alerting vehicle. Within the emails that are sent out, there are links to:
a. The alert manager,
b. The source,
c. A run book
However, a brief glace at the email shows that the alert manager link is by far the largest. It's also not so useful in most alert situations; it is clear an alert is happening -- that's how the email was delivered!
To help address this, he email was redesigned to pass on the very minimal of content, and instead direct the user to a Playbook that further describes the alert, and provides them useful information about the queries to make and how to investigate this problem further.
- Prometheus is very, very opaque
When a user visits Prometheus they are confronted with a single input field, and a graph layout. More experienced users appreciate this minimal detail, as they will rapidly begin to make ad-hoc queries of the interface, view the graph, refine the query and repeat. But for new users, this blank page is very hard to decipher.
To help address this the process emphasis is less on the previous process of getting an alert, then opening New Relic to make a determination as to whether something is wrong, but instead to get an alert then immediately open the Playbook.
The Playbooks are currently being modified such that they prevent a very clear set of instructions, including:
a. Notify the product owners,
b. Prepare the material required to debrief after resolution,
c. Specific queries to replicate the alert condition, and advice on how to further pinpoint the issue (either in Prometheus or on the node directly)
d. Steps to resolve the issue
It is hoped by moving the emphasis away from the Prometheus console and towards a procedural set of correction steps the person actioning the alert can be much faster in making their determination and resolving the issue without an in depth knowledge of Prometheus.
- The deployment process is basically impossible to understand
The deployment to our monitoring stack is handled by a custom built helm chart, which basically wraps a whole series of community charts in some configuration. However, Kubernetes is a complex system, and those who wish to make modifications to the alerting stack behaviour must first overcome the technical barrier of deploying to it.
Little was done today to resolve this. The custom Prometheus stack previously developed was swapped for that defined by the upstream charts which is far better documented, but this still relies on a broad depth of knowledge about Kubernetes, helm, charts, etc.
In future, I hope to make this service continuously deployed as many of our other services are, and significantly reduce the barrier of deployment. Because this is a stateful service this comes at the risk of users accidentally breaking the stack, but there is monitoring on the monitoring that is carefully attended to and this should be picked up quickly.
To conclude, I am still confident that Prometheus is a good direction to move towards as a monitoring stack. However, I have significantly underestimated the investment required to develop good process around Prometheus (and operations more generally). While it will be a painful process involving lots of conversations around "what is monitoring" and "what is alerting", Prometheus exposes superb primitives to build additional human processes around and I believe is a solid tool to underpin our metrics collection stack.
Still reading? Awesome! I'm glad you enjoyed the post. If the technology is something that interests you and you want to save yourself some of the pain of discovering these problems yourself, you can hire us and we'll teach you as much as we can, and give you the tools required.
If you're not interested and just want to be a consumer of this awesome stack, I suggest you come work for us. We're passionate about building beautiful, reliable web technology and we're given the agency to actually do it.
Thanks for your time. <3