Version: Nightly

Read Replica

Read Replica is a key feature in GreptimeDB's Enterprise Cluster Edition, designed to enhance the overall read-write performance and scalability of the database system.

In the Read Replica mechanism, clients write data to the Leader Region, which then synchronizes the data to Follower Regions. Follower Regions serve as read-only replicas of the Leader Region. Since Leader and Follower Regions are deployed on different Datanode nodes, read and write requests are effectively isolated, preventing resource contention and delivering a smoother experience:

read-replica-overview

NOTE

The Read Replica feature is exclusive to the GreptimeDB Enterprise Cluster Edition.

Principles

GreptimeDB's Enterprise Cluster Edition leverages its architecture to enable near-zero-cost data synchronization between replicas. Additionally, Read Replicas can access newly written data with minimal latency. Below is a brief explanation of the data synchronization and read mechanisms.

Data Synchronization

In GreptimeDB, storage and compute resources are disaggregated. All data is stored in SST files on object storage. Thus, synchronizing data between Leader and Follower Regions does not require copying SST files -- only their metadata needs to be synced. Metadata is significantly smaller than SST files, making synchronization effortless. Once metadata is synced, the Read Replica "possesses" the same SST files and can access the data:

read-replica-data-sync

In practice, SST files metadata is persisted in a special manifest file, also stored in object storage. Each manifest file has a unique version number. Synchronizing metadata between Leader and Follower Regions essentially involves syncing this version number -- a simple integer, ensuring minimal overhead. After receiving the version number, the Follower Region fetches the manifest file from object storage, thereby obtaining the SST files metadata generated by the Leader Region.

The manifest version number is synchronized via heartbeats between Regions and Metasrv. The Leader Region includes the version number in its heartbeat to Metasrv, which then forwards it to Follower Regions in their heartbeat responses:

read-replica-heartbeat

It's easy to see, if there were only SST files synchronization mechanism in place, the delay for Read Replica to access written data would be the sum of the heartbeat intervals between Leader/Follower Regions and Metasrv. For example, with a default 3-second heartbeat interval, Read Replica would only see the data that are written to SST files and flushed to object store 3 to 6 seconds prior. While this suffices for clients with relaxed freshness requirements, additional mechanisms are needed for near-real-time reads.

Data Read

Newly written data are stored in the Leader Region’s memtable. To access the latest data, Follower Region needs to request the memtable data from the Leader Region. By combining this with SST files data (obtained via data sync above), the Follower Region provides clients with a complete dataset, including the most recent writes:

read-replica-data-read

Follower Region fetch memtable data from Leader Region via an internal gRPC interface. While this imposes some read load on the Leader Region, the impact is minimal since the memtable data resides in memory and is finite in size.

Adding Read Replicas

Adding a Read Replica is as simple as executing one SQL command:

ADMIN ADD_TABLE_FOLLOWER(<table_name>, <follower_datanodes>)

This is the function in GreptimeDB for adding Read Replicas. The parameters are:

table_name: The name of the table to add Read Replicas to.
follower_datanodes: A comma-separated list of Datanode IDs where Follower Regions should be placed.

Next is an example to illustrate steps to add Read Replicas to a table.

First start a GreptimeDB Enterprise Cluster with 3 Datanodes. Then create a table:

CREATE TABLE foo (
  ts TIMESTAMP TIME INDEX,
  i INT PRIMARY KEY,
  s STRING,
) PARTITION ON COLUMNS ('i') (
  i <= 0,
  i > 0,
);

Through the information_schema, we can find the Regions' information:

SELECT table_name, region_id, peer_id, is_leader FROM information_schema.region_peers WHERE table_name = 'foo';

+------------+---------------+---------+-----------+
| table_name | region_id     | peer_id | is_leader |
+------------+---------------+---------+-----------+
| foo        | 4402341478400 |       1 | Yes       |
| foo        | 4402341478401 |       2 | Yes       |
+------------+---------------+---------+-----------+

This shows two Leader Regions on Datanodes 1 and 2.

Then to add Read Replicas:

ADMIN ADD_TABLE_FOLLOWER('foo', '0,1,2');

After the Read Replicas are added, find the Regions' information again:

SELECT table_name, region_id, peer_id, is_leader FROM information_schema.region_peers WHERE table_name = 'foo';

+------------+---------------+---------+-----------+
| table_name | region_id     | peer_id | is_leader |
+------------+---------------+---------+-----------+
| foo        | 4402341478400 |       1 | Yes       |
| foo        | 4402341478400 |       0 | No        |
| foo        | 4402341478401 |       2 | Yes       |
| foo        | 4402341478401 |       1 | No        |
+------------+---------------+---------+-----------+

Now, Follower Regions can be found on Datanodes 0 and 1.

Using Read Replicas

Clients can specify whether to read from Read Replicas like this (for JDBC connections):

-- Directly read from the Follower Regions (errors if none exist):
SET READ_PREFERENCE='follower';

-- Prefer Follower Regions, fallback to Leader Regions if unavailable (won't error out if there're no Follower Regions):
SET READ_PREFERENCE='follower_preferred';

-- Directly read from the Leader Regions:
SET READ_PREFERENCE='leader';

We still use the table in the above example to show the reads from Follower Replicas. First insert some data:

INSERT INTO foo (ts, i, s) VALUES (1, -1, 's1'), (2, 0, 's2'), (3, 1, 's3');

Set the read preference to "follower" only:

SET READ_PREFERENCE='follower';

Read the data:

SELECT * FROM foo ORDER BY ts;

+----------------------------+------+------+
| ts                         | i    | s    |
+----------------------------+------+------+
| 1970-01-01 00:00:00.001000 |   -1 | s1   |
| 1970-01-01 00:00:00.002000 |    0 | s2   |
| 1970-01-01 00:00:00.003000 |    1 | s3   |
+----------------------------+------+------+

How to verify the reads are really executed by Follower Replicas? We can use the EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT * FROM foo ORDER BY ts;

A non-zero "other_ranges" in the output confirms that Follower Region does handle the reads. If using the VERBOSE mode:

EXPLAIN ANALYZE VERBOSE SELECT * FROM foo ORDER BY ts;

There is more detailed output like this:

extension_ranges: [LeaderMemtableRange{leader: Peer { id: 1, addr: "192.168.50.189:14101" }, num_rows: 2, time_range: (1::Millisecond, 2::Millisecond)

Which can't be found if the reads are executed by Leader Replicas.

Read Replica

Principles​

Data Synchronization​

Data Read​

Adding Read Replicas​

Using Read Replicas​

Principles

Data Synchronization

Data Read

Adding Read Replicas

Using Read Replicas