Logging Pitfalls

Over my career, I have seen and experienced many pitfalls related to logging. Today, I'll detail some of these and jot down some of my thoughts regarding these. All opinions are my own, and these are opinions, feel free to disagree.

what logs are supposed to be

The whole point of using logs is: 1. Logs should give us enough data to debug any errors. 2. Logs give a status update on the inner workings of the system.

Sometimes these 2 points go hand in hand other times they are separate. In order to analyze improper usage of logs, we must first look at what proper usage is. We will limit these to just logging within an application since log retention, log aggregation, protection, etc. are more infrastructure related and is something that a typical engineer will not touch or be working on.

According to betterstack logs needs to: 1. establish clear objectives 2. use correct log levels 3. use structured logs 4. write meaningful logs 5. sample logs 6. use canonical logs 7. don't ignore performance cost of logging 8. don't use logs for monitoring

Note that I skipped a few steps as those deals with infrastructure. With these in mind lets explore some of the pitfalls I have witnessed in production environments.

Excessive Logging

This is a common pitfall to fall into when developing a system from scratch. It can be tempting to just log every step of the way in order to help debug the system as we develop it. In fact, I have seen many inexperienced developers log just about everything in a function to show correctness. Additionally, it is easy to demo by pointing to a log line and say the system reached here. These are fine in a local development environment, but if these log lines get deployed into production there are severe drawbacks.

Many times, these are just It reached here log lines. It gives developers confidence that a particular function has been called or a particular condition has been reached. It serves no other purpose. I have seen applications filled to the brim with these that it started to impact performance.

The only real value of these logs are for status updates. Status updates are only useful if there is a human waiting on it. One example is that when I start a docker container, I'm waiting on the docker container to come online it is useful to sate my impatience if there is a log line that indicated progress.

Otherwise, these type of log lines are rarely needed, they exist because of the lack of tests surrounding the application, with sufficient tests, we should have enough confidence about the correctness of the function or application where we do not need these type of logs.

Debug Logging

Usually INFO, WARN, ERROR, and FATAL are well established log levels. We can debate whether we need fatal or not as it signals an unrecoverable error, and it is accompanied by different stack traces up and down the stack. These are fairly intuitive to use, but what of DEBUG?

DEBUG to me should be avoided *UNLESS* there is an easy way of turning on debug logging in production. If there is no easy way of turning on debug mode in production, (e.g. if you need to redeploy the app to turn on DEBUG logging) they effectively function the same as INFO logs. Having a bunch of DEBUG logs in the codebase but not using them just adds bloat to the codebase. DEBUG logs should be used if there is an easy way of turning it on in production, even then it should be transitory. It should be removed once you are done troubleshooting.

Even if you have an easy way of enabling DEBUG level logging in production, I'd argue it's not worth it. If you need to constantly turn it on, then the application is not in a good state and needs additional time and effort to bring it to a good state. If it is only needed < 1% of the time, then do you really need to devote lines and lines of code for something that rarely happens?

Logs as Metrics

I've seen teams rely on logs as metrics. Rather than taking the time to establish a metric system the team just did an aggregation on specific log lines. The result is that the dashboard is a bit slow to respond, but it works. I've seen teams even alert on specific log lines. This works until you need to refactor the code. Changing just the text of the log can mean losing alerts and render dashboards useless.

Do *NOT* rely on log lines for metric and alerts even though it seems the same. Log text are subject to change and can be ruinous if relied upon. Furthermore, logs are not idea for metrics as they are heavier, they contain more information and therefore takes more resources to query and aggregate.

Log as Verification in Tests

Sometimes, logs are used to verify if a step has completed within a test. This should be avoided whenever possible, As with Logs as Metric section above, the text of the logs are subject to change. Although this is not as severe as any change with the text will result in a test failure, but why put ourselves into that position? If there are no other way of verifying if a function did its job other than reading the log lines, this is a code smell that a refactor is probably warranted.