Why Autometrics?

Why Metrics?

Too many logs!

Logs are great. We all use them. Printing a log line is the first thing we learn how to do in any new programming language. But our industry places too much emphasis on logs today.

The two main problems with logs are understandability and cost. When running a system in production, it can be hard to find the information you need in the mountains of logs you have probably collected. There's tons of noise for the amount of signal. Of course, many vendors promise to sift through your data to find useful insights -- for a price, of course. Processing and storing logs is not cheap!

Can we do better?

Metrics

Metrics are the next tool in the observability stack most teams turn to. Counting things. The good part about counting things is that it's a lot less expensive than producing and processing tons of text. And, they can help give a clear, overall understanding of how your system is performing.

While metrics aren't a panacea, a core hypothesis of the autometrics project is that metrics are underused relative to their potential usefulness and cost-efficiency.

Metrics today are hard to use

Metrics are a powerful and relatively inexpensive tool for understanding your system in production.

However, they are still hard to use. Developers need to:

  • Think about what metrics to track and which metric type to use (counter, histogram... 😕)
  • Figure out how to write PromQL or another query language to get some data 😖
  • Verify that the data returned actually answers the right question 😫

Why Autometrics?

Simplifying code-level observability

Many modern observability tools promise to make life "easy for developers" by automatically instrumenting your code with an agent or eBPF. Others ingest tons of logs or traces -- and charge high fees for the processing and storage.

Most of these tools treat your system as a black box and use complex and pricey processing to build up a model of your system. This, however, means that you need to map their model onto your mental model of the system in order to navigate the mountains of data.

Autometrics takes the opposite approach. Instead of throwing away valuable context and then using compute power to recreate it, it starts inside your code. It enables you to understand your production system at one of the most fundamental levels: from the function.

Standardizing function-level metrics

Functions are one of the most fundamental building blocks of code. Why not use them as the building block for observability?

A core part of autometrics is the simple idea of using standard metric names and a consistent scheme for tagging/labeling metrics.

The three metrics currently used are:

  • function.calls.count
  • function.calls.duration
  • function.calls.concurrent.

Labeling metrics with useful, low-cardinality function details

The following labels are added automatically to all three of the metrics:

  • function
  • module

For the function call counter, the following labels are also added:

  • caller - (see "Tracing Lite" below)
  • result - either ok or error
  • ok / error - the name of the concrete type of the returned value or error can also be attached as a label

Autometrics aims to only support static values as labels to avoid the footgun of attaching labels with too many possible values. The Prometheus docs (opens in a new tab) explain why this is important in the following warning:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

"Tracing Lite"

A slightly unusual idea baked into autometrics is that by tracking more granular metrics, you can debug some issues that we would traditionally need to turn to tracing for.

Autometrics can be added to any function in your codebase, from HTTP handlers down to database methods.

This means that if you are looking into a problem with a specific HTTP handler, you can browse through the metrics of the functions called by the misbehaving function.

Simply hover over the function names of the nested function calls in your IDE to look at their metrics. Or, you can directly open the chart of the request or error rate of all functions called by a specific function.