MariaDB Binlog Indexer
This tool helps to index mariadb binlogs and store them in compact format to query faster.
File Format
- DuckDB - Store all the metadata of binlogs.
- Parquet - Store the actual queries.
DB Schema
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ binlog │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ db_name │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ table_name │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ timestamp │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ type │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ row_id │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ event_size │ INTEGER │ YES │ NULL │ NULL │ NULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Usage
Before starting up please note down,
- All the timestamps are in seconds UTC
- Available query types are
SELECT,INSERT,UPDATE,DELETE,OTHER
Also, check the python docstrings for more details.
Initialize Indexer
from mariadb_binlog_indexer import Indexer
indexer = Indexer(
base_path="<path to store indexer related files>",
db_name="metadata.db", # The name of the database to store metadata
)
Index New Binlog
indexer.add(
binlog_path="<path to binlog>",
batch_size=10000, # The batch size to insert binlog in duckdb
)
Remove Indexes of Binlog
indexer.remove(
binlog_path="<path to binlog>",
)
Generate Timeline
This function will provide a summary of binlog event in a given time range. It will split the time range into 30 parts and provide event related information for each part.
indexer.get_timeline(
start_timestamp=1746534427, # The start timestamp in seconds UTC
end_timestamp=1756534427, # The end timestamp in seconds UTC,
type="INSERT", # Optional
database="test_db", # Optional
)
Get Row Ids
This function will help to get the row ids of each binlog based on our request. The purpose of this function is to help finding required row ids beforehand to implement pagination on the other end.
Only reason of doing this is to reduce the search in the parquet files.
indexer.get_row_ids(
start_timestamp=1746534427, # The start timestamp in seconds UTC
end_timestamp=1756534427, # The end timestamp in seconds UTC,
type="INSERT", # Optional
database="test_db", # Optional
table="test_table", # Optional
search_str="test", # Optional
)
Get Queries from Parquet Files
This function will help to get the queries from parquet files.
indexer.get_queries(
row_ids={
"binlog_1": [101, 102, 103],
"binlog_2": [104, 105, 106],
},
database="test_db", # Optional
)
You need to provide database name as filtering purpose only.