jcloude

test/jcloude

Fork 0

History

jingrow 07adcfc9c7 update Frappe - Jingrow

2025-12-23 20:42:58 +08:00

indexer

update github.com - git.jingrow.com

2025-12-23 19:31:56 +08:00

.gitignore

initial commit

2025-12-23 19:17:16 +08:00

build_helper.sh

initial commit

2025-12-23 19:17:16 +08:00

build.Dockerfile

initial commit

2025-12-23 19:17:16 +08:00

build.sh

initial commit

2025-12-23 19:17:16 +08:00

pyproject.toml

update Frappe - Jingrow

2025-12-23 20:42:58 +08:00

README.md

initial commit

2025-12-23 19:17:16 +08:00

setup.py

initial commit

2025-12-23 19:17:16 +08:00

README.md

MariaDB Binlog Indexer

This tool helps to index mariadb binlogs and store them in compact format to query faster.

File Format

DuckDB - Store all the metadata of binlogs.
Parquet - Store the actual queries.

DB Schema

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ binlog      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ db_name     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ table_name  │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ timestamp   │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
│ type        │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ row_id      │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
│ event_size  │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Usage

Before starting up please note down,

All the timestamps are in seconds UTC
Available query types are SELECT, INSERT, UPDATE, DELETE, OTHER

Also, check the python docstrings for more details.

Initialize Indexer

from mariadb_binlog_indexer import Indexer

indexer = Indexer(
    base_path="<path to store indexer related files>",
    db_name="metadata.db", # The name of the database to store metadata
)

Index New Binlog

indexer.add(
    binlog_path="<path to binlog>",
    batch_size=10000, # The batch size to insert binlog in duckdb
)

Remove Indexes of Binlog

indexer.remove(
    binlog_path="<path to binlog>",
)

Generate Timeline

This function will provide a summary of binlog event in a given time range. It will split the time range into 30 parts and provide event related information for each part.

indexer.get_timeline(
    start_timestamp=1746534427, # The start timestamp in seconds UTC
    end_timestamp=1756534427, # The end timestamp in seconds UTC,
    type="INSERT", # Optional
    database="test_db", # Optional
)

Get Row Ids

This function will help to get the row ids of each binlog based on our request. The purpose of this function is to help finding required row ids beforehand to implement pagination on the other end.

Only reason of doing this is to reduce the search in the parquet files.

indexer.get_row_ids(
    start_timestamp=1746534427, # The start timestamp in seconds UTC
    end_timestamp=1756534427, # The end timestamp in seconds UTC,
    type="INSERT", # Optional
    database="test_db", # Optional
    table="test_table", # Optional
    search_str="test", # Optional
)

Get Queries from Parquet Files

This function will help to get the queries from parquet files.

indexer.get_queries(
    row_ids={
        "binlog_1": [101, 102, 103],
        "binlog_2": [104, 105, 106],
    },
    database="test_db", # Optional
)

You need to provide database name as filtering purpose only.