25 Comments
Feb 28·edited Feb 28Liked by Ivan Burmistrov

Thank you so much for writing this. You are so spot on!

- Logs capture information with the intent to diagnose issues. Logs should not capture success cases. They should focus on failure cases.

- Metrics capture aggregates like counters. Metrics should be derived from Events.

- Events are data points in a series + metadata. The metadata enables slicing and dicing data points in innumerable ways.

Events are the unlock. When you start with events, metrics are easy. When you start with metrics, events are impossible. And logs ... blech. What a mess. Logs are the junk drawer of observability.

Expand full comment
author
Feb 28·edited Feb 28Author

> Events are the unlock. When you start with events, metrics are easy. When you start with metrics, events are impossible. And logs ... blech. What a mess. Logs are the junk drawer of observability.

Love this.

However reading HN discussion (https://news.ycombinator.com/item?id=39529775) it looks like some folks are offended by calling logs names. So I will avoid this in future. Logs are observability history :)

Expand full comment
May 31Liked by Ivan Burmistrov

Thank you so much for writing this. That`s great!

Expand full comment
Mar 10Liked by Ivan Burmistrov

Love this. It is very difficult to make sense of the open telemetry concepts.

Expand full comment

I would say that Wide Events sound a lot like Structured Logging.

Expand full comment
author

Yes. I like "wide events" term more because it has this "wide" component which actually highlights 2 things:

- the desire to attach to these events as much information as possible

- the design of the solution that would handle those events. While structured logs one can store in a row-oriented database with a few indexes, it's not wise to store wide events this way - a columnar storage is a must.

When reading some replies on HN discussion about structured logs vs wide events I confirmed this feeling that structured logs are not necessary associated with "a lot" of information. But it's just a personal preference in the end.

Expand full comment
Feb 15·edited Feb 15

Great article, thanks!

Here is a link to the original Scuba paper I guess: https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/

Maybe you know other public sources with its implementation details?

Expand full comment
author

Yep, that's the one. There is nothing more in public I think. It's weird how it's so not promoted, given that the paper is published in 2013! It's cool that the paper shows UI screenshots as well and doesn't focus on storage only. Storage-wise I think systems like ClickHouse can do the job good enough.

UI is a different story though. It's simple yet powerful, but in the open source world it's hard to find anything that comes close.

Expand full comment

Did you check Rill? https://github.com/rilldata/rill

Expand full comment
author

Literally just saw a tweet about their integration with ClickHouse, and it made me wonder :) Do you recommend?

Expand full comment

I'm biased because I work for them :) You can take a quick look at a GIF demo here: https://docs.rilldata.com/notes/0.34 which visualises pretty much the experience that you've described in the post (comparing dimensions), I believe.

Expand full comment
author

Looks nice! Would be really awesome if there was a playground, like the one Honeycomb has: https://www.honeycomb.io/sandbox Really useful to get a quick feeling

Expand full comment

scuba actually reminds me of graylog gelf logs: you can have a message with key/value pairs and then aggregate/filter/group them using graylog UI.

Expand full comment
author

Never used it, but from a quick look sounds like a similar thing yes.

However "native sampling" is a really important Scuba feature - not sure if graylog UI takes into account.

Expand full comment

A colleague of mine recently sent me a link to your post. I couldn't agree with you more which is one reason I have been working on https://github.com/eventrelay/eventrelay.

Expand full comment
author

Thank you for sharing!

Expand full comment
Feb 28·edited Feb 28

Horses for courses. I would argue this post covers only one use case for observability. There are more. E.g. when you want to track a specific distributed transaction, you must have a full trace available and be able to look at it as a tree. It's very powerful. But you cannot just sample a random events in a trace though to make it work. Additionally, it poses very different requirements on the collection (the context needs to be passed). Also, all queries that look at traces (rather than individual spans/events) must group the spans into traces

The signals are not just about different semantics associated to them (but eventually meaning the same thing). It's also about relationship of records

Expand full comment
author

Nobody is denying the importance of smart sampling. When we need that certain events always appear together, we should sample based on some common field in the event (traceId).

The Ad Impression example from the post has the same challenge. Once Ad Impression is served, there can be some events associated with this impression: clicks, shares, whatever. And it would be good to be able to see them all together. It's absolutely the same problem of the need for a smart sampling.

And fwiw I don't say that tracing is a bad concept. I'm arguing that it gets presented / defined like a completely different thing, while there is more similarities that differences between wide events / structured logs and traces / spans.

Expand full comment

Did Meta ever invest in a tool that could automatically tell you which dimension(s) had abnormality, or how much they lent to a regression? Eg, new OS versions, app versions, countries, carriers etc commonly point to the issue. Seems like doing this automatically, maybe on demand, while computationally expensive, would just give you the answer and prevent manual hunting and pecking.

Expand full comment
author

At Meta while I was there there wasn't such a thing, maybe it has developed it after that. In the tools available outside of Meta I know about BubbleUp feature in Honeycomb which does exactly this: https://docs.honeycomb.io/working-with-your-data/bubbleup/

Indeed AFAIK this feature is uber-popular among Honeycomb customers

Expand full comment

How do you deal with schemas and versioning of wide events?

Expand full comment
author

You can always have a schema describing the type of each field (may be useful for efficient storage). It just needs to be backward compatible version over version (e.g. only add fields normally with some special process for deprecation).

Some implementations infer schemas from the events itself, see

- https://vimeo.com/331143124

- https://axiom.co/blog/a-database-hacker-story

Expand full comment

Nicely written !!

Adding more fields to log.Errorf() is a pain. Although Kibana does give nice visualizations on logs.

We do logging for JSON/proto events in BQ but not sure about the UI.

Expand full comment
author

> Adding more fields to log.Errorf() is a pain

Depends actually on the language / framework. E.g. in JVM world passing various context data to los it relatively easy (you add data to some context object or whatever, and it gets passed to the logs - something like this).

> Although Kibana does give nice visualizations on logs.

Right. The UX of Kibana is whatever the opposite to "intuitive" IMO. It's not a tool for exploration. One can find the answer on a well-defined question there, but exploring unknonw stuff?.. Would be surprised if many people can do it naturally.

> We do logging for JSON/proto events in BQ but not sure about the UI.

Yep exactly. Logging to somewhere is one thing, but it doesn't give the easy-to-use exploration capabilities. Queries should be easy to make and fast.

I believe more into writing data to ClickHouse for this purpose, because it's faster than BQ. UI is still a question - not sure if there is something available.

Expand full comment