This quote from Charity Majors is probably the best summary of the current state of observability in the tech industry - a total, mass confusion. Everyone is confused. What is a trace? What is a span? Is log line a span? Do I need traces if I have logs? Why I need traces if I have great metrics? The list of questions like these goes on. Charity - together with other great folks from observability system called
Thank you so much for writing this. You are so spot on!
- Logs capture information with the intent to diagnose issues. Logs should not capture success cases. They should focus on failure cases.
- Metrics capture aggregates like counters. Metrics should be derived from Events.
- Events are data points in a series + metadata. The metadata enables slicing and dicing data points in innumerable ways.
Events are the unlock. When you start with events, metrics are easy. When you start with metrics, events are impossible. And logs ... blech. What a mess. Logs are the junk drawer of observability.
A colleague of mine recently sent me a link to your post. I couldn't agree with you more which is one reason I have been working on https://github.com/eventrelay/eventrelay.
Horses for courses. I would argue this post covers only one use case for observability. There are more. E.g. when you want to track a specific distributed transaction, you must have a full trace available and be able to look at it as a tree. It's very powerful. But you cannot just sample a random events in a trace though to make it work. Additionally, it poses very different requirements on the collection (the context needs to be passed). Also, all queries that look at traces (rather than individual spans/events) must group the spans into traces
The signals are not just about different semantics associated to them (but eventually meaning the same thing). It's also about relationship of records
Did Meta ever invest in a tool that could automatically tell you which dimension(s) had abnormality, or how much they lent to a regression? Eg, new OS versions, app versions, countries, carriers etc commonly point to the issue. Seems like doing this automatically, maybe on demand, while computationally expensive, would just give you the answer and prevent manual hunting and pecking.
Thank you so much for writing this. You are so spot on!
- Logs capture information with the intent to diagnose issues. Logs should not capture success cases. They should focus on failure cases.
- Metrics capture aggregates like counters. Metrics should be derived from Events.
- Events are data points in a series + metadata. The metadata enables slicing and dicing data points in innumerable ways.
Events are the unlock. When you start with events, metrics are easy. When you start with metrics, events are impossible. And logs ... blech. What a mess. Logs are the junk drawer of observability.
Love this. It is very difficult to make sense of the open telemetry concepts.
Great article, thanks!
Here is a link to the original Scuba paper I guess: https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/
Maybe you know other public sources with its implementation details?
scuba actually reminds me of graylog gelf logs: you can have a message with key/value pairs and then aggregate/filter/group them using graylog UI.
A colleague of mine recently sent me a link to your post. I couldn't agree with you more which is one reason I have been working on https://github.com/eventrelay/eventrelay.
I would say that Wide Events sound a lot like Structured Logging.
Horses for courses. I would argue this post covers only one use case for observability. There are more. E.g. when you want to track a specific distributed transaction, you must have a full trace available and be able to look at it as a tree. It's very powerful. But you cannot just sample a random events in a trace though to make it work. Additionally, it poses very different requirements on the collection (the context needs to be passed). Also, all queries that look at traces (rather than individual spans/events) must group the spans into traces
The signals are not just about different semantics associated to them (but eventually meaning the same thing). It's also about relationship of records
Did Meta ever invest in a tool that could automatically tell you which dimension(s) had abnormality, or how much they lent to a regression? Eg, new OS versions, app versions, countries, carriers etc commonly point to the issue. Seems like doing this automatically, maybe on demand, while computationally expensive, would just give you the answer and prevent manual hunting and pecking.
How do you deal with schemas and versioning of wide events?
Nicely written !!
Adding more fields to log.Errorf() is a pain. Although Kibana does give nice visualizations on logs.
We do logging for JSON/proto events in BQ but not sure about the UI.