Using Custom Pipelines
GreptimeDB automatically parses and transforms logs into structured, multi-column data based on your pipeline configuration. When built-in pipelines cannot handle your specific log format, you can create custom pipelines to define exactly how your log data should be parsed and transformed.
Identify Your Original Log Format
Before creating a custom pipeline, it's essential to understand the format of original log data. If you're using log collectors and aren't sure about the log format, there are two ways to examine your logs:
- Read the collector official documentation: Configure your collector to output data to console or file to inspect the log format.
- Use the
greptime_identity
pipeline: Ingest sample logs directly into GreptimeDB using the built-ingreptime_identity
pipeline. Thegreptime_identity
pipeline treats the entire text log as a singlemessage
field, which makes it very convenient to see the raw log content directly.
Once understand the log format you want to process, you can create a custom pipeline. This document uses the following Nginx access log entry as an example:
127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
Create a Custom Pipeline
GreptimeDB provides an HTTP interface for creating pipelines. Here's how to create one.
First, create an example pipeline configuration file to process Nginx access logs,
naming it pipeline.yaml
:
version: 2
processors:
- dissect:
fields:
- message
patterns:
- '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"'
ignore_missing: true
- date:
fields:
- timestamp
formats:
- "%d/%b/%Y:%H:%M:%S %z"
- select:
type: exclude
fields:
- message
- vrl:
source: |
.greptime_table_ttl = "7d"
.
transform:
- fields:
- ip_address
type: string
index: inverted
tag: true
- fields:
- status_code
type: int32
index: inverted
tag: true
- fields:
- request_line
- user_agent
type: string
index: fulltext
- fields:
- response_size
type: int32
- fields:
- timestamp
type: time
index: timestamp
The pipeline configuration above uses the version 2 format,
contains processors
and transform
sections that work together to structure your log data:
Processors: Used to preprocess log data before transformation:
- Data Extraction: The
dissect
processor uses pattern matching to parse themessage
field and extract structured data includingip_address
,timestamp
,http_method
,request_line
,status_code
,response_size
, anduser_agent
. - Timestamp Processing: The
date
processor parses the extractedtimestamp
field using the format%d/%b/%Y:%H:%M:%S %z
and converts it to a proper timestamp data type. - Field Selection: The
select
processor excludes the originalmessage
field from the final output while retaining all other fields. - Table Options: The
vrl
processor sets the table options based on the extracted fields, such as adding a suffix to the table name and setting the TTL. Thegreptime_table_ttl = "7d"
line configures the table data to have a time-to-live of 7 days.
Transform: Defines how to convert and index the extracted fields:
- Field Transformation: Each extracted field is converted to its appropriate data type with specific indexing configurations. Fields like
http_method
retain their default data types when no explicit configuration is provided. - Indexing Strategy:
ip_address
andstatus_code
use inverted indexing as tags for fast filteringrequest_line
anduser_agent
use full-text indexing for optimal text search capabilitiestimestamp
serves as the required time index column
For detailed information about pipeline configuration options, please refer to the Pipeline Configuration documentation.
Upload the Pipeline
Execute the following command to upload the pipeline configuration:
curl -X "POST" \
"http://localhost:4000/v1/pipelines/nginx_pipeline" \
-H 'Authorization: Basic {{authentication}}' \
-F "[email protected]"
After successful execution, a pipeline named nginx_pipeline
will be created and return the following result:
{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}.
You can create multiple versions for the same pipeline name.
All pipelines are stored in the greptime_private.pipelines
table.
Refer to Query Pipelines to view pipeline data.
Ingest Logs Using the Pipeline
The following example writes logs to the custom_pipeline_logs
table using the nginx_pipeline
pipeline to format and transform the log messages:
curl -X POST \
"http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \
-H "Content-Type: application/json" \
-H "Authorization: Basic {{authentication}}" \
-d '[
{
"message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""
},
{
"message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""
},
{
"message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""
},
{
"message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""
}
]'
The command will return the following output upon success:
{"output":[{"affectedrows":4}],"execution_time_ms":79}
The custom_pipeline_logs
table content is automatically created based on the pipeline configuration:
+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+
| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp |
+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+
| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 |
| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 |
| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 |
| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 |
+-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+
For more detailed information about the log ingestion API endpoint /ingest
,
including additional parameters and configuration options,
please refer to the APIs for Writing Logs documentation.
Query Logs
We use the custom_pipeline_logs
table as an example to query logs.
Query logs by tags
With the multiple tag columns in custom_pipeline_logs
,
you can query data by tags flexibly.
For example, query the logs with status_code
200 and http_method
GET.
SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET';
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method |
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET |
+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
1 row in set (0.02 sec)
Full‑Text Search
For the text fields request_line
and user_agent
, you can use matches_term
function to search logs.
Remember, we created the full-text index for these two columns when creating a pipeline.
This allows for high-performance full-text searches.
For example, query the logs with request_line
containing /index.html
or /api/login
.
SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login');
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method |
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET |
| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST |
+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+
2 rows in set (0.00 sec)
You can refer to the Full-Text Search document for detailed usage of the matches_term
function.
Benefits of Using Pipelines
Using pipelines to process logs provides structured data and automatic field extraction, enabling more efficient querying and analysis.
You can also write logs directly to the database without pipelines, but this approach limits high-performance analysis capabilities.
Direct Log Insertion (Without Pipeline)
For comparison, you can create a table to store original log messages:
CREATE TABLE `origin_logs` (
`message` STRING FULLTEXT INDEX,
`time` TIMESTAMP TIME INDEX
) WITH (
append_mode = 'true'
);
Use the INSERT
statement to insert logs into the table.
Note that you need to manually add a timestamp field for each log:
INSERT INTO origin_logs (message, time) VALUES
('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'),
('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'),
('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'),
('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217');
Schema Comparison: Pipeline vs Raw
In the above examples, the table custom_pipeline_logs
is automatically created by writing logs using pipeline,
and the table origin_logs
is created by writing logs directly.
Let's explore the differences between these two tables.
DESC custom_pipeline_logs;
+---------------+---------------------+------+------+---------+---------------+
| Column | Type | Key | Null | Default | Semantic Type |
+---------------+---------------------+------+------+---------+---------------+
| ip_address | String | PRI | YES | | TAG |
| status_code | Int32 | PRI | YES | | TAG |
| request_line | String | | YES | | FIELD |
| user_agent | String | | YES | | FIELD |
| response_size | Int32 | | YES | | FIELD |
| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP |
| http_method | String | | YES | | FIELD |
+---------------+---------------------+------+------+---------+---------------+
7 rows in set (0.00 sec)
DESC origin_logs;
+---------+----------------------+------+------+---------+---------------+
| Column | Type | Key | Null | Default | Semantic Type |
+---------+----------------------+------+------+---------+---------------+
| message | String | | YES | | FIELD |
| time | TimestampMillisecond | PRI | NO | | TIMESTAMP |
+---------+----------------------+------+------+---------+---------------+
Comparing the table structures shows the key differences:
The custom_pipeline_logs
table (created with pipeline) automatically structures log data into multiple columns:
ip_address
,status_code
as indexed tags for fast filteringrequest_line
,user_agent
with full-text indexing for text searchresponse_size
,http_method
as regular fieldstimestamp
as the time index
The origin_logs
table (direct insertion) stores everything in a single message
column.
Why Use Pipelines?
It is recommended to use the pipeline method to split the log message into multiple columns, which offers the advantage of explicitly querying specific values within certain columns. Column matching query proves superior to full-text searching for several key reasons:
- Performance: Column-based queries are typically faster than full-text searches
- Storage Efficiency: GreptimeDB's columnar storage compresses structured data better; inverted indexes for tags consume less storage than full-text indexes
- Query Simplicity: Tag-based queries are easier to write, understand, and debug
Next Steps
- Full-Text Search: Explore the Full-Text Search guide to learn advanced text search capabilities and query techniques in GreptimeDB
- Pipeline Configuration: Explore the Pipeline Configuration documentation to learn more about creating and customizing pipelines for various log formats and processing needs