Skip to main content
Version: 1.0

Observability 2.0

Observability 2.0 represents an evolution from the foundational "three pillars" (metrics, logs, and traces) toward a unified data foundation based on high-cardinality, wide-event datasets. Instead of maintaining separate systems for each signal type, this approach emphasizes a single source of truth that enables retroactive analysis rather than pre-aggregation.

Despite its contested naming, the core concept is clear: breaking down the silos between metrics, logs, and traces to provide a more comprehensive view of modern distributed systems.

The Limits of Three Pillars

For years, observability has relied on the three pillars of metrics, logs, and traces. While these pillars spawned countless successful tools (including OpenTelemetry), their limitations become evident as systems grow in complexity:

  1. Data silos: Metrics, logs, and traces are stored separately, leading to uncorrelated datasets. Correlating a spike in error metrics with log patterns requires manual context-switching between systems.

  2. Granularity vs. cost: Traditional metrics sacrifice detail through pre-aggregation. But retaining full granularity creates millions of time-series with redundant metadata across systems, driving costs up instead of down.

  3. Unstructured logs: While logs inherently contain structured data, extracting meaning requires intensive parsing, indexing, and computational effort.

These limitations become even more pronounced in modern scenarios like AI agents and microservices, where high-dimensional, semi-structured data is the norm rather than the exception.

Wide Events: A Unified Data Model

Observability 2.0 addresses these issues by adopting wide events as its foundational data structure. A wide event is a context-rich, high-dimensional, and high-cardinality record that captures complete application state in a single event.

What is a Wide Event?

Instead of precomputing metrics or structuring logs upfront, wide events preserve raw, high-fidelity event data as the single source of truth. For example, a single wide event for a POST request might include:

  • User information and subscription data
  • Database queries with parameters
  • Cache operations
  • HTTP headers
  • Total: 2KB+ of contextual data in one record
{
"method": "POST",
"path": "/articles",
"service": "articles",
"outcome": "ok",
"status_code": 201,
"duration": 268,
"user": {
"id": "fdc4ddd4-8b30-4ee9-83aa-abd2e59e9603",
"subscription": { "plan": "free", "trial": true }
},
"db": {
"query": "INSERT INTO articles (...)",
"parameters": { "$1": "f8d4d21c-..." }
},
"cache": { "operation": "write", "key": "..." },
"headers": { "user-agent": "...", "cf-connecting-ip": "..." }
}

Metrics, Logs, and Traces as Projections

Wide events fundamentally change how we think about observability data. Metrics, logs, and traces are not separate data types—they are different projections of the same underlying events:

  • Metrics: SELECT COUNT(*) GROUP BY status, date_bin(INTERVAL '1 minute', timestamp) — aggregated projection
  • Logs: SELECT message, timestamp WHERE message @@ 'error' — text projection
  • Traces: SELECT span_id, duration WHERE trace_id = '...' — relational projection

This allows teams to perform exploratory analysis retroactively, deriving any metric, log query, or trace view from the original dataset—without pre-aggregation or code changes.

AI and the Need for Fine-Grained Observability

AI agents introduce a new level of observability complexity due to their non-deterministic behavior. Unlike traditional applications with predictable code paths, agents make dynamic decisions—choosing tools, reasoning through multi-step plans, and adapting responses based on context. Debugging "why did the agent do X?" requires preserving complete execution state: the full prompt, reasoning chain, tool calls with parameters, memory state, and quality scores—all in a single queryable record.

This is where wide events become essential. Traditional three-pillar approaches fail here: stuffing prompts into logs loses structure and makes analysis impossible, forcing tool calls into traces is too rigid for dynamic behavior, and pre-aggregating token metrics loses the critical context needed for debugging. AI agents produce high-cardinality (millions of unique sessions), high-dimensional (dozens of fields per execution), context-rich events—exactly what wide events are designed to handle. This isn't "observability for the AI age" as a marketing slogan; it's a direct technical consequence: non-deterministic systems require fine-grained, structured, retroactive analysis that only wide events can provide.

Why GreptimeDB is Built for This

GreptimeDB's architecture naturally aligns with the Observability 2.0 paradigm. Its columnar engine efficiently compresses wide events (achieving 50% storage reduction compared to Loki and ~90% compared to Elasticsearch in production), and native object storage (S3, Azure Blob, GCS) keeps costs low as wide event volumes grow. Below are the capabilities that matter most for wide events.

Unified Tag + Timestamp + Field Model

All observability data—metrics, logs, traces—share the same schema model in GreptimeDB:

  • Tags: Entity identifiers (pod_name, service, region, trace_id, session_id)
  • Timestamp: Temporal tracking
  • Fields: Multi-dimensional values (message, duration, status_code, prompts, responses)

This unified model enables cross-signal correlation in a single SQL query.

SQL + PromQL for Cross-Signal Correlation

Use one SQL query to correlate metrics spikes, log patterns, and trace latency:

SELECT
date_bin(INTERVAL '1 minute', timestamp) AS minute,
COUNT(CASE WHEN status >= 500 THEN 1 END) AS errors,
AVG(duration) AS avg_latency
FROM access_logs
WHERE timestamp >= NOW() - INTERVAL '1 hour'
AND message @@ 'timeout'
GROUP BY date_bin(INTERVAL '1 minute', timestamp);

No context-switching between systems—all signals in one database. GreptimeDB also supports PromQL for metrics queries, maintaining compatibility with existing dashboards.

Flow Engine for Real-Time Derivation

GreptimeDB's Flow Engine derives metrics from raw events in real-time without preprocessing pipelines:

CREATE FLOW http_status_count
SINK TO status_metrics
AS
SELECT
status,
COUNT(*) AS count,
date_bin('1 minute'::INTERVAL, timestamp) AS time_window
FROM access_logs
GROUP BY status, time_window;

Metrics are computed continuously from raw wide events, enabling both pre-aggregated dashboards and ad-hoc exploratory queries on the same dataset.

Wide Events in Production

Wide events are proven in production at scale:

  • Poizon (得物): One of the early production deployments of Wide Events. Flow Engine with multi-level continuous aggregation reduced P99 latency from seconds to milliseconds. Read more →

  • OB Cloud: Replaced Loki for billions of logs daily across 170+ availability zones. 10x query performance, 30% TCO reduction. Read more →

  • Trace Storage: Replaced Elasticsearch as Jaeger backend. 45x storage cost reduction, 3x faster cold queries, enabled full-volume tracing at 400B rows/day. Read more →

Getting Started

Transitioning to Observability 2.0 doesn't require ripping out your entire stack. Start from any pillar—logs, metrics, or traces—and extend naturally. GreptimeDB supports PromQL, Jaeger, OpenTelemetry, and Grafana out of the box, so existing dashboards and alerts keep working. See Why GreptimeDB for detailed migration paths.

Further Reading