How single message broke all our monitoring and dashboards

How single message broke all our monitoring and dashboards

So.. this just happened (or depending on when you are reading, it happened some time ago) and while I look at my beautiful dashboards working again I am contemplating, despite all the modern tech, how fragile software can be these days…

For the last 1.5 years, I have been using Axiom for all of my logs ingestion, querying, and monitoring needs. It is a great product and I never had one issue with it in my time using it. Spoiler alert, even today, when it failed it was actually my fault, but let’s see what happened.

Axiom has a feature called virtual fields that allows you to compute log properties on the fly from other properties. One use case is parsing JSON payloads in logs and then expanding them with fields from that parsed JSON. For example, if I have the log entry:

{
“path”: “/example”,
“message”: “{event:impression,id:jhfdsf334j24j}”
}

I can use the virtual field to extract event property from that log entry and then filter over it as with any other “real” log field. Virtual fields are (like other queries) created in Axiom Processing Language (APL). It looks something like this:

[‘dataset’]
| extend event = iff(isnotnull(parse_json(message)), parse_json(message)[“event”], “”)

To explain it quickly:

iff returns first non null value (value of event or empty string)

isnotnull checks if the value is null, to avoid parsing fields that are not JSON

parse_json parses value as JSON

With the above I could query all of my logs in Axiom by event field which was great, up until today. I got an email from Axiom saying that one of my monitors was auto-disabled. This was due to query, that monitor was using, failing multiple times. I opened it to see what was going on and quickly closed it because I was doing something else. A few minutes passed and I noticed one of the dashboards I was using was broken, I refreshed the page thinking it was a temporary error, but it did not help. The error said, “cannot index value of type string“.

Since I did not change the dashboard or any setup on my side for weeks and after checking Axiom status page I reported an issue on their Discord server. I quickly got a reply that all systems were green from their end. One thing that Axiom representative said is to try and change the query time to something else eg. query logs from yesterday instead of the last X hours how I had it set up. To my surprise changing it to yesterday made my dashboard work again. At this point, I figured that it must be some specific log entry in some period that is causing an error. I played with FROM and TO dates and narrowed it down to the period of 1s containing 40 log entries. And yep, you read it right, in millions of logs it was down to 40 entries in the span of less than a second. Now I had to go in and look up which one of them was faulty.

Before continuing try and guess which one of the above was a culprit?

Knowing one of them was a culprit I quickly opened JavaScript console in Chrome, pipped the entries, and executed JSON.parse on them. All of them returned SyntaxError except for one entry. It was the “email” entry. If you parse that string (with double quotes) as JSON you get back a string email (without double quotes). Which perfectly matched the error from my dashboard saying “cannot index value of type string“. My virtual field was trying to extract event property from string email , something like that is a no-op. Since I am using those virtual fields in almost all the places it broke literally everything. What a mess right?

To fix this I had to replace my isnotnull check with something else since it did not cover the above case. Browsing Axiom docs I found gettype utility function that returns the type of data passed to it. I also found exact example in their docs that parse_json can return a string.

By using the above function my fixed virtual field now extracts the value only if the parsed JSON value is a dictionary (or object in JavaScript terms). It looks like this:

[‘dataset’]
| extend event = iff(gettype(parse_json(message)) == “dictionary”, parse_json(message)[“event”], “”)

And just like that, everything was working again, and he lived happily ever after…

Thanks for reading! 🙈

If you enjoyed this post I would appreciate the like or share and if you are interested here are some of my other posts/projects:

How I did 100,000 push-ups in a year 💪

Integrating Rust into Next.js – How-To Developer Guide

Get Komfy App - Stream your favorite video content from your phone directly to your PC or Mac 📺

How I hacked Wordle – the most popular game in the world ⌨️

Jawa – Visual scraper interface – exports to puppeteer script which you can run anywhere and scrape anything 🤖

Leave a Reply

Your email address will not be published. Required fields are marked *