Skip to main content
Version: Nightly

Using Custom Pipelines

GreptimeDB automatically parses and transforms logs into structured, multi-column data based on your pipeline configuration. When built-in pipelines cannot handle your specific log format, you can create custom pipelines to define exactly how your log data should be parsed and transformed.

Identify Your Original Log Format

Before creating a custom pipeline, it's essential to understand the format of original log data. If you're using log collectors and aren't sure about the log format, there are two ways to examine your logs:

  1. Read the collector official documentation: Configure your collector to output data to console or file to inspect the log format.
  2. Use the greptime_identity pipeline: Ingest sample logs directly into GreptimeDB using the built-in greptime_identity pipeline. The greptime_identity pipeline treats the entire text log as a single message field, which makes it very convenient to see the raw log content directly.

Once understand the log format you want to process, you can create a custom pipeline. This document uses the following Nginx access log entry as an example:

127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

Create a Custom Pipeline

GreptimeDB provides an HTTP interface for creating pipelines. Here's how to create one.

First, create an example pipeline configuration file to process Nginx access logs, naming it pipeline.yaml:

version: 2
processors:
- dissect:
fields:
- message
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true
- date:
fields:
- timestamp
formats:
- "%d/%b/%Y:%H:%M:%S %z"
- select:
type: exclude
fields:
- message
- vrl:
source: |
.greptime_table_ttl = "7d"
.

transform:
- fields:
- ip_address
type: string
index: inverted
tag: true
- fields:
- status_code
type: int32
index: inverted
tag: true
- fields:
- request_line
- user_agent
type: string
index: fulltext
- fields:
- response_size
type: int32
- fields:
- timestamp
type: time
index: timestamp

The pipeline configuration above uses the version 2 format, contains processors and transform sections that work together to structure your log data:

Processors: Used to preprocess log data before transformation:

  • Data Extraction: The dissect processor uses pattern matching to parse the message field and extract structured data including ip_address, timestamp, http_method, request_line, status_code, response_size, and user_agent.
  • Timestamp Processing: The date processor parses the extracted timestamp field using the format %d/%b/%Y:%H:%M:%S %z and converts it to a proper timestamp data type.
  • Field Selection: The select processor excludes the original message field from the final output while retaining all other fields.
  • Table Options: The vrl processor sets the table options based on the extracted fields, such as adding a suffix to the table name and setting the TTL. The greptime_table_ttl = "7d" line configures the table data to have a time-to-live of 7 days.

Transform: Defines how to convert and index the extracted fields:

  • Field Transformation: Each extracted field is converted to its appropriate data type with specific indexing configurations. Fields like http_method retain their default data types when no explicit configuration is provided.
  • Indexing Strategy:
    • ip_address and status_code use inverted indexing as tags for fast filtering
    • request_line and user_agent use full-text indexing for optimal text search capabilities
    • timestamp serves as the required time index column

For detailed information about pipeline configuration options, please refer to the Pipeline Configuration documentation.

Upload the Pipeline

Execute the following command to upload the pipeline configuration:

curl -X "POST" \
"http://localhost:4000/v1/pipelines/nginx_pipeline" \
-H 'Authorization: Basic {{authentication}}' \
-F "[email protected]"

After successful execution, a pipeline named nginx_pipeline will be created and return the following result:

{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.

You can create multiple versions for the same pipeline name. All pipelines are stored in the greptime_private.pipelines table. Refer to Query Pipelines to view pipeline data.

Ingest Logs Using the Pipeline

The following example writes logs to the custom_pipeline_logs table using the nginx_pipeline pipeline to format and transform the log messages:

curl -X POST \
"http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \
-H "Content-Type: application/json" \
-H "Authorization: Basic {{authentication}}" \
-d '[
{
"message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""
},
{
"message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""
},
{
"message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""
},
{
"message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""
}
]'

The command will return the following output upon success:

{"output":[{"affectedrows":4}],"execution_time_ms":79}

The custom_pipeline_logs table content is automatically created based on the pipeline configuration:

+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+
| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp |
+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+
| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 |
| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 |
| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 |
| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 |
+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+

For more detailed information about the log ingestion API endpoint /ingest, including additional parameters and configuration options, please refer to the APIs for Writing Logs documentation.

Query Logs

We use the custom_pipeline_logs table as an example to query logs.

Query logs by tags

With the multiple tag columns in custom_pipeline_logs, you can query data by tags flexibly. For example, query the logs with status_code 200 and http_method GET.

SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET';
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method |
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET |
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
1 row in set (0.02 sec)

For the text fields request_line and user_agent, you can use matches_term function to search logs. Remember, we created the full-text index for these two columns when creating a pipeline. This allows for high-performance full-text searches.

For example, query the logs with request_line containing /index.html or /api/login.

SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login');
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method |
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET |
| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST |
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
2 rows in set (0.00 sec)

You can refer to the Full-Text Search document for detailed usage of the matches_term function.

Benefits of Using Pipelines

Using pipelines to process logs provides structured data and automatic field extraction, enabling more efficient querying and analysis.

You can also write logs directly to the database without pipelines, but this approach limits high-performance analysis capabilities.

Direct Log Insertion (Without Pipeline)

For comparison, you can create a table to store original log messages:

CREATE TABLE `origin_logs` (
`message` STRING FULLTEXT INDEX,
`time` TIMESTAMP TIME INDEX
) WITH (
append_mode = 'true'
);

Use the INSERT statement to insert logs into the table. Note that you need to manually add a timestamp field for each log:

INSERT INTO origin_logs (message, time) VALUES
('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'),
('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'),
('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'),
('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217');

Schema Comparison: Pipeline vs Raw

In the above examples, the table custom_pipeline_logs is automatically created by writing logs using pipeline, and the table origin_logs is created by writing logs directly. Let's explore the differences between these two tables.

DESC custom_pipeline_logs;
+---------------+---------------------+------+------+---------+---------------+
| Column | Type | Key | Null | Default | Semantic Type |
+---------------+---------------------+------+------+---------+---------------+
| ip_address | String | PRI | YES | | TAG |
| status_code | Int32 | PRI | YES | | TAG |
| request_line | String | | YES | | FIELD |
| user_agent | String | | YES | | FIELD |
| response_size | Int32 | | YES | | FIELD |
| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP |
| http_method | String | | YES | | FIELD |
+---------------+---------------------+------+------+---------+---------------+
7 rows in set (0.00 sec)
DESC origin_logs;
+---------+----------------------+------+------+---------+---------------+
| Column | Type | Key | Null | Default | Semantic Type |
+---------+----------------------+------+------+---------+---------------+
| message | String | | YES | | FIELD |
| time | TimestampMillisecond | PRI | NO | | TIMESTAMP |
+---------+----------------------+------+------+---------+---------------+

Comparing the table structures shows the key differences:

The custom_pipeline_logs table (created with pipeline) automatically structures log data into multiple columns:

  • ip_address, status_code as indexed tags for fast filtering
  • request_line, user_agent with full-text indexing for text search
  • response_size, http_method as regular fields
  • timestamp as the time index

The origin_logs table (direct insertion) stores everything in a single message column.

Why Use Pipelines?

It is recommended to use the pipeline method to split the log message into multiple columns, which offers the advantage of explicitly querying specific values within certain columns. Column matching query proves superior to full-text searching for several key reasons:

  • Performance: Column-based queries are typically faster than full-text searches
  • Storage Efficiency: GreptimeDB's columnar storage compresses structured data better; inverted indexes for tags consume less storage than full-text indexes
  • Query Simplicity: Tag-based queries are easier to write, understand, and debug

Next Steps

  • Full-Text Search: Explore the Full-Text Search guide to learn advanced text search capabilities and query techniques in GreptimeDB
  • Pipeline Configuration: Explore the Pipeline Configuration documentation to learn more about creating and customizing pipelines for various log formats and processing needs