DuckDB x TiDB -- 使用 DuckDB 唤醒 CSV 文件
- 2025-02-14 北京
本文字数:18585 字
阅读完需:约 61 分钟
作者: pepezzzz 原文来源:https://tidb.net/blog/5c2153ac
一、背景
在企业中经常以业务保护等级合规的目的设计数据库逻辑备份机制,逻辑备份生成平面文件放置在 S3 兼容的存储上。业务有需求要查询历史时段的数据,需要将备份文件重新导入数据库进行查询,由于查询的时间段很可能相对比较模糊,会出现多次的重复操作。
Dumpling 是 TiDB 的逻辑备份工具,兼容 MySQL,具有产品轻量化、使用便捷的特点,能将生产数据备份成 CSV 平面文件,具备支持自定义分隔符等特性,并兼容 S3 存储。DuckDB 是轻量的嵌入式 AP 场景数据库,能支持以外部表方式读取 S3 上面的 CSV 文件。
本次尝试通过 Dumpling 和 DuckDB 的结合,能够将 Dumpling 的备份文件交付给 DuckDB,实现快速打开、快速查询的目的。
二、环境准备
Duckdb 环境
https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
# unzip duckdb_cli-linux-amd64.zip Archive: duckdb_cli-linux-amd64.zip inflating: duckdb # vi init.sql SET extension_directory = '/data/duckdb_ext';.prompt 'duckdb> ' .timer on # ./duckdb bak.duckdb --init init.sqlduckdb> SELECT current_database();┌────────────────────┐│ current_database() ││ varchar │├────────────────────┤│ bak │└────────────────────┘Run Time (s): real 0.001 user 0.000429 sys 0.000501duckdb> .exit
Minio 单机实验环境
注:
wget https://dl.min.io/server/minio/release/linux-amd64/minio
wget https://dl.min.io/client/mc/release/linux-amd64/mc
# MINIO_ROOT_USER=admin MINIO_ROOT_PASSWORD=password ./minio server /data/miniodata/ --console-address ":9001" &[1] 9664[root@vm172-16-201-85 data]# MinIO Object Storage ServerCopyright: 2015-2025 MinIO, Inc.License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.htmlVersion: RELEASE.2024-12-18T13-15-44Z (go1.23.4 linux/amd64)
API: http://172.16.201.85:9000 http://172.17.0.1:9000 http://127.0.0.1:9000 RootUser: admin RootPass: password
WebUI: http://172.16.201.85:9001 http://172.17.0.1:9001 http://127.0.0.1:9001 RootUser: admin RootPass: password
CLI: https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart $ mc alias set 'myminio' 'http://172.16.201.85:9000' 'admin' 'password'
Docs: https://docs.min.ioWARN: Detected Linux kernel version older than 4.0 release, there are some known potential performance problems with this kernel version. MinIO recommends a minimum of 4.x linux kernel version for best performance
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= mb s3://tidbbak
数据备份
mysql> select count(1) from users10m;+----------+| count(1) |+----------+| 10000000 |+----------+1 row in set (6 min 36.33 sec)
#export AWS_ACCESS_KEY_ID=admin#export AWS_SECRET_ACCESS_KEY=password#CURDATE=$(date +%Y%m%d%H%M%S)#/root/.tiup/components/dumpling/v8.5.0/dumpling -B bakdb -u "${BAKUSER}" -P ${TIDBPORT} -h ${TIDBIP} -p "${BAKPW}" --filetype csv -t 8 -o "s3://${Bucket}/dumpling/${CURDATE}" --s3.endpoint="http://${Endpoint}" -r 200000 -F 256MiB
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= get s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csvdownload: 's3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv' -> './bakdb.users10m.0000000010000.csv' [1 of 1] 51980222 of 51980222 100% in 0s 113.40 MB/s done# more bakdb.users10m.0000000010000.csv"userid","name","phone","sex","birth","card_type","card_id","reg_date","reg_time","modify_date","unit_id","vip_level","corp_id","job_number","col00","col01","col02","col03","col04","col05","col06","col07","col08","col09","col10","col11","col12","col13","col14","col15","col16","col17","col18","col19","balance1","balance2","balance3","modify_datetime""004804a7-e96e-4cf8-8622-891e0ae9d87d","LKQKM","10240894028","女","1979-10-29","身份证","588802761557381279","2001-05-12","02:41:58","2016-07-09","hDY","4","22100","130353","EDRROXGRCYJ","TENOFL","MDBPTICNPYQKYDTRHSGC","YPUJVK","JSKWBZYHXR","IXBPDPDPEARSP","QDSLHMNZ","DJBRX","DYCCQDHJIMA","AFISVLDMXLQPZHVZZDQA","FYHYBEYQGYYDVIHG","LZ","NITVMZGXRTYKQFVPFTHB","NCFESYSN","WFSVITJTYPYBEOXYLXU","YZYYG","PZYQPLZ","GJIRATCSLPTFHCSVDH","STOBWVHFSQFIDP","KPRVIGNNFIDCIBL",2071.61,9169.1201,10000.1234,"2025-01-28 10:42:56""004804c4-3494-47de-a7ca-f5ab7ba8e135","OKTICUOYOJNNCULUCGKS","15002789114","男","1988-02-13","身份证","970410241703370828","2000-11-11","09:46:21","2013-01-15","dgjhS","2","6900","107337","SINTVGVWCUQROU","U","LRHRMWTGCKN","T","SADSQUTZD","OBJMWNAHSUERZJPMRG","TAPMJTRLQJIAYXSCAVUQ","VRJCSHSGUAOWAVBDST","OKOUTEUMWVHJYBTP","EYMJMMBWWJW","PALWUHU","NWKMJADTBYPJZ","OVZSQOMBUBXUT","XDFMQHJFIWXN","JFIA","FDD","XVQUOOODC","P","OVRLI","QKH",9722.42,5618.2924,10000.1234,"2025-01-28 10:42:56"
三、访问 S3 上 csv 备份
配置 S3 的访问密钥
安装插件和配置 s3 secret。
duckdb> INSTALL 'httpfs';duckdb> SELECT * FROM duckdb_extensions();┌──────────────────┬─────────┬───────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────────┬───────────────────┬────────────────┐│ extension_name │ loaded │ installed │ install_path │ description │ aliases │ extension_version │ install_mode │ installed_from ││ varchar │ boolean │ boolean │ varchar │ varchar │ varchar[] │ varchar │ varchar │ varchar │├──────────────────┼─────────┼───────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────────┤│ arrow │ false │ false │ │ A zero-copy data integration between Apache Arrow and DuckDB │ [] │ │ │ ││ autocomplete │ true │ true │ (BUILT-IN) │ Adds support for autocomplete in the shell │ [] │ │ STATICALLY_LINKED │ ││ aws │ false │ false │ │ Provides features that depend on the AWS SDK │ [] │ │ │ ││ azure │ false │ false │ │ Adds a filesystem abstraction for Azure blob storage to DuckDB │ [] │ │ │ ││ delta │ false │ false │ │ Adds support for Delta Lake │ [] │ │ │ ││ excel │ false │ false │ │ Adds support for Excel-like format strings │ [] │ │ │ ││ fts │ true │ true │ (BUILT-IN) │ Adds support for Full-Text Search Indexes │ [] │ v1.1.3 │ STATICALLY_LINKED │ ││ httpfs │ true │ true │ /data/duckdb_ext/v… │ Adds support for reading and writing files over a HTTP(S) connection │ [http, https, s3] │ v1.1.3 │ REPOSITORY │ core │...│ vss │ false │ false │ │ Adds indexing support to accelerate Vector Similarity Search │ [] │ │ │ │├──────────────────┴─────────┴───────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴───────────────────┴───────────────────┴───────────────────┴────────────────┤│ 24 rows 9 columns │└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.005 user 0.000519 sys 0.001037
duckdb> CREATE PERSISTENT SECRET miniosecret ( TYPE S3, KEY_ID 'admin', SECRET 'password', ENDPOINT '172.16.201.85:9000', URL_STYLE 'path', USE_SSL false );┌─────────┐│ Success ││ boolean │├─────────┤│ true │└─────────┘
duckdb> SELECT * FROM duckdb_secrets();┌─────────────┬─────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────┐│ name │ type │ provider │ persistent │ storage │ scope │ secret_string ││ varchar │ varchar │ varchar │ boolean │ varchar │ varchar[] │ varchar │├─────────────┼─────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────┤│ miniosecret │ s3 │ config │ true │ local_file │ [s3://, s3n://, s3… │ name=miniosecret;type=s3;provider=config;serializable=… │└─────────────┴─────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────┘
注:PERSISTENT SECRET 存储在 ${HOME}/.duckdb/stored_secrets 目录下。
Duckdb 访问 S3 的 CSV 备份文件
通过 Duckdb 的 S3 expansion 结合 read_csv 访问数据。
duckdb> SELECT count(*) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true);100% ▕████████████████████████████████████████████████████████████▏ ┌──────────────┐│ count_star() ││ int64 │├──────────────┤│ 10000000 │└──────────────┘Run Time (s): real 18.073 user 17.706364 sys 6.432855
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;┌──────────────────────┬─────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬────────────────────┬────────────────┬─────────────────┬──────────┬───────────┬────────────┬─────────────────────┐│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col17 │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime ││ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ varchar │ varchar │ varchar │ double │ timestamp │├──────────────────────┼─────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼────────────────────┼────────────────┼─────────────────┼──────────┼───────────┼────────────┼─────────────────────┤│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ GJIRATCSLPTFHCSVDH │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │├──────────────────────┴─────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴────────────────────┴────────────────┴─────────────────┴──────────┴───────────┴────────────┴─────────────────────┤│ 1 rows 38 columns (14 shown) │└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 3.981 user 6.215305 sys 1.525878
duckdb> .mode lineduckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1; userid = 004804a7-e96e-4cf8-8622-891e0ae9d87d name = LKQKM phone = 10240894028 sex = 女 birth = 1979-10-29 card_type = 身份证 card_id = 588802761557381279 reg_date = 2001-05-12 reg_time = 02:41:58 modify_date = 2016-07-09 unit_id = hDY vip_level = 4 corp_id = 22100 job_number = 130353 col00 = EDRROXGRCYJ col01 = TENOFL col02 = MDBPTICNPYQKYDTRHSGC col03 = YPUJVK col04 = JSKWBZYHXR col05 = IXBPDPDPEARSP col06 = QDSLHMNZ col07 = DJBRX col08 = DYCCQDHJIMA col09 = AFISVLDMXLQPZHVZZDQA col10 = FYHYBEYQGYYDVIHG col11 = LZ col12 = NITVMZGXRTYKQFVPFTHB col13 = NCFESYSN col14 = WFSVITJTYPYBEOXYLXU col15 = YZYYG col16 = PZYQPLZ col17 = GJIRATCSLPTFHCSVDH col18 = STOBWVHFSQFIDP col19 = KPRVIGNNFIDCIBL balance1 = 2071.61 balance2 = 9169.1201 balance3 = 10000.1234modify_datetime = 2025-01-28 10:42:56Run Time (s): real 2.791 user 6.307715 sys 1.495764duckdb> .mode duckbox
注:.mode line 类似于 MySQL > 的 \G 结束符的输出,.mode table 类似于 MySQL > 的默认输出,.mode duckbox 是 Duckdb 的默认输出,自适应屏幕的简略输出。
标准备份数据导入 Duckdb
默认情况下,使用 auto_detect 访问数据不一定能使用正确的字段格式,先使用 sniff_csv 生成配置模板。
duckdb> .mode lineduckdb> SELECT Prompt FROM sniff_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', sample_size = 10000);Prompt = FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\r\n', skip=0, comment='', header=true, columns={'userid': 'VARCHAR', 'name': 'VARCHAR', 'phone': 'BIGINT', 'sex': 'VARCHAR', 'birth': 'DATE', 'card_type': 'VARCHAR', 'card_id': 'VARCHAR', 'reg_date': 'DATE', 'reg_time': 'TIME', 'modify_date': 'DATE', 'unit_id': 'VARCHAR', 'vip_level': 'BIGINT', 'corp_id': 'BIGINT', 'job_number': 'BIGINT', 'col00': 'VARCHAR', 'col01': 'VARCHAR', 'col02': 'VARCHAR', 'col03': 'VARCHAR', 'col04': 'VARCHAR', 'col05': 'VARCHAR', 'col06': 'VARCHAR', 'col07': 'VARCHAR', 'col08': 'VARCHAR', 'col09': 'VARCHAR', 'col10': 'VARCHAR', 'col11': 'VARCHAR', 'col12': 'VARCHAR', 'col13': 'VARCHAR', 'col14': 'VARCHAR', 'col15': 'VARCHAR', 'col16': 'VARCHAR', 'col17': 'VARCHAR', 'col18': 'VARCHAR', 'col19': 'VARCHAR', 'balance1': 'DOUBLE', 'balance2': 'DOUBLE', 'balance3': 'DOUBLE', 'modify_datetime': 'TIMESTAMP'}, dateformat='%Y-%m-%d', sample_size=10000);Run Time (s): real 0.174 user 0.277318 sys 0.060880
使用 read_csv 的 auto_detect 的自适应失败或 sniff_csv 的 sample_size 内的数据没有对应的格式,最终会导致缺少配置项的导入过程仍有可能会失败,需要手工调整。
需要调整配置模板后导入数据,也可以使用备份时对应的 schema 文件内关于字段的描述。
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000120000.csv');Run Time (s): real 1.472 user 3.583090 sys 0.183995Conversion Error: CSV Error on Line: 154129Original Line: "04e471fa-0a6b-4181-b2a6-e4bc53af5fac","UVJXFMBEC","12308418577","女","2003-12-17","身份证","520850486289953447","2009-09-13",\N,"2012-04-30","8","0","4500","131043","EXDBHQFLZMAJZH","RVWDBE","UBLXVMPIC","CIHYSHCEQZZFBQZ","HPOARFSFFUQSPPN","PQKLEZ","DJFPVM","RBAKOUZZTN","V","PMVYO","V","BCIMNBFA","UOCMGLHVWU","JCUQJYT","KRFFJNMDIYCUBAJMFYK","OKLIMZU","VNSMFYDLH","NWJJLQZP","GKVL","UDUVEYSEBVZMQRKTLF",1606.15,9866.7268,10000.1234,"2025-01-27 09:39:44"Error when converting column "reg_time". Could not convert string "\N" to 'TIME'
Column reg_time is being converted as type TIMEThis type was auto-detected from the CSV file.
Dumpling 默认备份 CSV 时使用 ‘\N’,需要添加 nullstr=’\N’ 配置项。修改后,再导入。
duckdb> SELECT *FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='', nullstr='\N', header=true, columns = {'userid': 'VARCHAR','name': 'VARCHAR','phone': 'BIGINT','sex': 'VARCHAR','birth': 'DATE','card_type': 'VARCHAR','card_id': 'VARCHAR','reg_date': 'DATE','reg_time': 'TIME','modify_date': 'DATE','unit_id': 'VARCHAR','vip_level': 'BIGINT','corp_id': 'BIGINT','job_number': 'BIGINT','col00': 'VARCHAR','col01': 'VARCHAR','col02': 'VARCHAR','col03': 'VARCHAR','col04': 'VARCHAR','col05': 'VARCHAR','col06': 'VARCHAR','col07': 'VARCHAR','col08': 'VARCHAR','col09': 'VARCHAR','col10': 'VARCHAR','col11': 'VARCHAR','col12': 'VARCHAR','col13': 'VARCHAR','col14': 'VARCHAR','col15': 'VARCHAR','col16': 'VARCHAR','col17': 'VARCHAR','col18': 'VARCHAR','col19': 'VARCHAR','balance1': 'DOUBLE','balance2': 'DOUBLE','balance3': 'DOUBLE','modify_datetime': 'TIMESTAMP' });-- 输出略。
duckdb> create table users10m_ctas as SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true, columns = {'userid': 'VARCHAR','name': 'VARCHAR','phone': 'BIGINT','sex': 'VARCHAR','birth': 'DATE','card_type': 'VARCHAR','card_id': 'VARCHAR','reg_date': 'DATE','reg_time': 'TIME','modify_date': 'DATE','unit_id': 'VARCHAR','vip_level': 'BIGINT','corp_id': 'BIGINT','job_number': 'BIGINT','col00': 'VARCHAR','col01': 'VARCHAR','col02': 'VARCHAR','col03': 'VARCHAR','col04': 'VARCHAR','col05': 'VARCHAR','col06': 'VARCHAR','col07': 'VARCHAR','col08': 'VARCHAR','col09': 'VARCHAR','col10': 'VARCHAR','col11': 'VARCHAR','col12': 'VARCHAR','col13': 'VARCHAR','col14': 'VARCHAR','col15': 'VARCHAR','col16': 'VARCHAR','col17': 'VARCHAR','col18': 'VARCHAR','col19': 'VARCHAR','balance1': 'DOUBLE','balance2': 'DOUBLE','balance3': 'DOUBLE','modify_datetime': 'TIMESTAMP' });100% ▕████████████████████████████████████████████████████████████▏ Run Time (s): real 33.892 user 167.084411 sys 14.806282
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= du s3://tidbbak/dumpling/20250128161903/ 4913936499 42 objects s3://tidbbak/dumpling/20250128161903/
# ls -lh /data/bak.duckdb -rw-r--r-- 1 root root 2.9G Jan 28 17:12 /data/bak.duckdb
查看 CTAS 的表结构和 DDL,已经按标识的字段类型创建。
duckdb> desc users10m_ctas;┌─────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐│ column_name │ column_type │ null │ key │ default │ extra ││ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │├─────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤│ userid │ VARCHAR │ YES │ │ │ ││ name │ VARCHAR │ YES │ │ │ ││ phone │ BIGINT │ YES │ │ │ ││ sex │ VARCHAR │ YES │ │ │ ││ birth │ DATE │ YES │ │ │ ││ card_type │ VARCHAR │ YES │ │ │ ││ card_id │ VARCHAR │ YES │ │ │ ││ reg_date │ DATE │ YES │ │ │ ││ reg_time │ TIME │ YES │ │ │ ││ modify_date │ DATE │ YES │ │ │ ││ unit_id │ VARCHAR │ YES │ │ │ ││ vip_level │ BIGINT │ YES │ │ │ ││ corp_id │ BIGINT │ YES │ │ │ ││ job_number │ BIGINT │ YES │ │ │ ││ col00 │ VARCHAR │ YES │ │ │ ││ col01 │ VARCHAR │ YES │ │ │ ││ col02 │ VARCHAR │ YES │ │ │ ││ col03 │ VARCHAR │ YES │ │ │ ││ col04 │ VARCHAR │ YES │ │ │ ││ col05 │ VARCHAR │ YES │ │ │ ││ col06 │ VARCHAR │ YES │ │ │ ││ col07 │ VARCHAR │ YES │ │ │ ││ col08 │ VARCHAR │ YES │ │ │ ││ col09 │ VARCHAR │ YES │ │ │ ││ col10 │ VARCHAR │ YES │ │ │ ││ col11 │ VARCHAR │ YES │ │ │ ││ col12 │ VARCHAR │ YES │ │ │ ││ col13 │ VARCHAR │ YES │ │ │ ││ col14 │ VARCHAR │ YES │ │ │ ││ col15 │ VARCHAR │ YES │ │ │ ││ col16 │ VARCHAR │ YES │ │ │ ││ col17 │ VARCHAR │ YES │ │ │ ││ col18 │ VARCHAR │ YES │ │ │ ││ col19 │ VARCHAR │ YES │ │ │ ││ balance1 │ DOUBLE │ YES │ │ │ ││ balance2 │ DOUBLE │ YES │ │ │ ││ balance3 │ DOUBLE │ YES │ │ │ ││ modify_datetime │ TIMESTAMP │ YES │ │ │ │├─────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤│ 38 rows 6 columns │└───────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.001 user 0.001159 sys 0.000000duckdb> .mode lineduckdb> WITH table_columns AS ( SELECT name, type, CASE WHEN "notnull" = 1 THEN 'NOT NULL' ELSE '' END AS not_null FROM pragma_table_info('users10m_ctas')) SELECT 'CREATE TABLE users10m_ctas (' || string_agg(name || ' ' || type || ' ' || not_null, ', ') || ');' AS create_table_sql FROM table_columns;create_table_sql = CREATE TABLE users10m_ctas (userid VARCHAR , name VARCHAR , phone BIGINT , sex VARCHAR , birth DATE , card_type VARCHAR , card_id VARCHAR , reg_date DATE , reg_time TIME , modify_date DATE , unit_id VARCHAR , vip_level BIGINT , corp_id BIGINT , job_number BIGINT , col00 VARCHAR , col01 VARCHAR , col02 VARCHAR , col03 VARCHAR , col04 VARCHAR , col05 VARCHAR , col06 VARCHAR , col07 VARCHAR , col08 VARCHAR , col09 VARCHAR , col10 VARCHAR , col11 VARCHAR , col12 VARCHAR , col13 VARCHAR , col14 VARCHAR , col15 VARCHAR , col16 VARCHAR , col17 VARCHAR , col18 VARCHAR , col19 VARCHAR , balance1 DOUBLE , balance2 DOUBLE , balance3 DOUBLE , modify_datetime TIMESTAMP );Run Time (s): real 0.002 user 0.002692 sys 0.000146
注意:DuckDB 的 VARCHAR 没有显式长度限制,可以存储任意长度的字符串。实际使用中,字符串长度受限于内存大小和系统性能。
测试原生表和外部表的性能
duckdb> SELECT max(reg_date),min(balance1) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略};
max(reg_date) = 2009-12-30min(balance1) = 0.02Run Time (s): real 18.868 user 19.898231 sys 6.671171
duckdb> SELECT max(reg_date),min(balance1) FROM users10m_ctas;max(reg_date) = 2009-12-30min(balance1) = 0.02Run Time (s): real 0.011 user 0.051131 sys 0.001120
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略}) where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';100% userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac name = UVJXFMBEC phone = 12308418577 sex = 女 birth = 2003-12-17 card_type = 身份证 card_id = 520850486289953447 reg_date = 2009-09-13 reg_time = modify_date = 2012-04-30 unit_id = 8 vip_level = 0 corp_id = 4500 job_number = 131043 col00 = EXDBHQFLZMAJZH col01 = RVWDBE col02 = UBLXVMPIC col03 = CIHYSHCEQZZFBQZ col04 = HPOARFSFFUQSPPN col05 = PQKLEZ col06 = DJFPVM col07 = RBAKOUZZTN col08 = V col09 = PMVYO col10 = V col11 = BCIMNBFA col12 = UOCMGLHVWU col13 = JCUQJYT col14 = KRFFJNMDIYCUBAJMFYK col15 = OKLIMZU col16 = VNSMFYDLH col17 = NWJJLQZP col18 = GKVL col19 = UDUVEYSEBVZMQRKTLF balance1 = 1606.15 balance2 = 9866.7268 balance3 = 10000.1234modify_datetime = 2025-01-27 09:39:44Run Time (s): real 15.471 user 25.564100 sys 7.191754
duckdb> SELECT * FROM users10m_ctas where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac'; userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac name = UVJXFMBEC phone = 12308418577 sex = 女 birth = 2003-12-17 card_type = 身份证 card_id = 520850486289953447 reg_date = 2009-09-13 reg_time = modify_date = 2012-04-30 unit_id = 8 vip_level = 0 corp_id = 4500 job_number = 131043 col00 = EXDBHQFLZMAJZH col01 = RVWDBE col02 = UBLXVMPIC col03 = CIHYSHCEQZZFBQZ col04 = HPOARFSFFUQSPPN col05 = PQKLEZ col06 = DJFPVM col07 = RBAKOUZZTN col08 = V col09 = PMVYO col10 = V col11 = BCIMNBFA col12 = UOCMGLHVWU col13 = JCUQJYT col14 = KRFFJNMDIYCUBAJMFYK col15 = OKLIMZU col16 = VNSMFYDLH col17 = NWJJLQZP col18 = GKVL col19 = UDUVEYSEBVZMQRKTLF balance1 = 1606.15 balance2 = 9866.7268 balance3 = 10000.1234modify_datetime = 2025-01-27 09:39:44Run Time (s): real 0.006 user 0.009642 sys 0.002340
对比场景成绩
使用自定义的 CSV 格式
使用控制字符 x01,即 SOH 作为间隔符。能规避字段中包含默认定义分隔符的情况,防止数据读取分列时出错。
注意:如果使用多字符的 delimiter,duckdb 会指定提示 Invalid Input Error: The delimiter option cannot exceed a size of 1 byte. 的超长错误。
export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password; dumpling --filter 'bakdb.users10m' -u root -P 4500 -h 172.16.201.122 -p root --filetype csv -t 8 -o "s3://tidbbak/dumpling/${CURDATE}" --s3.endpoint="http://172.16.201.85:9000" -r 200000 -F 256MiB --csv-separator=$(echo -e "\x01") --escape-backslash=false --csv-null-value="NULL" --csv-line-terminator=$'\r\n' --csv-delimiter=""
# cat -e bakdb.users10m.0000000390000.csv|more userid^Aname^Aphone^Asex^Abirth^Acard_type^Acard_id^Areg_date^Areg_time^Amodify_date^Aunit_id^Avip_level^Acorp_id^Ajob_number^Acol00^Acol01^Acol02^Acol03^Acol04^Acol05^Acol06^Acol07^Acol08^Acol09^Acol10^Acol11^Acol12^Acol13^Acol14^Acol15^Acol16^Acol17^Acol18^Acol19^Abalance1^Abalance2^Abalance3^Amodify_datetime^M$1dbf161b-87e9-49e4-bd43-83e8179b4292^ACQAZEBAOQL^A14737946971^AM-gM-^TM-7^A2019-06-15^AM-hM-:M-+M-dM-;M-=M-hM-/M-^A^A599781278347424629^A2003-04-16^ANULL^A2018-04-03^Aeg^A1^A22200^A27564^ARLAWIETOMOH^ALQDFFCOHMXVOR^AYPBIIHQNIVDM^AXUUKIEXPJRVGQ^AZQMEPKMNHBACDFOYVT^AVFSWFI^ACPSIYHSUBFCI^AYGRFPCEGHAISOGAXIG^AUSAOTBNP^AWDUS^AUTYZOOMCHECUHYAB^ABS^AWJRVSSQVVUSBVZGL^AZ^AOCSGJZYQEIYIKIXVCEX^APYAXMFDV^AORLWDLPQHPMEMOB^AIZZVERY^AGCRJOF^AXKQDXOUXFXRJJYORJ^ANULL^ANULL^A10000.1234^A2025-01-27 09:39:44^M$
对应的配置 read_csv 的参数 delim=e’\x01’ 和 nullstr=‘NULL’
duckdb> create table users10m_ctas2 as SELECT * · FROM read_csv('s3://tidbbak/dumpling/20250129143910/bakdb.users10m.*.csv', union_by_name = true, delim=e'\x01', quote='', new_line='\r\n', skip=0, comment='', nullstr='NULL', header=true, · columns = {中间格式略});
duckdb> select * from users10m_ctas2 limit 10 ;┌──────────────────────┬──────────────────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬─────────────────┬──────────────────────┬──────────┬───────────┬────────────┬─────────────────────┐│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime ││ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ double │ double │ double │ timestamp │├──────────────────────┼──────────────────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼─────────────────┼──────────────────────┼──────────┼───────────┼────────────┼─────────────────────┤│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004804c4-3494-47de… │ OKTICUOYOJNNCULUCGKS │ 15002789114 │ 男 │ 1988-02-13 │ 身份证 │ 970410241703370828 │ … │ OVRLI │ QKH │ 9722.42 │ 5618.2924 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 00480505-5d3e-40f3… │ OEYHUYADGB │ 12955781898 │ 男 │ 2001-03-05 │ 身份证 │ 897826304664568897 │ … │ GXVSEAORZA │ SLDXCJYWYFZFHEHERQPV │ 6997.88 │ 8810.8585 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 0048050a-8c4f-4b68… │ QLCWZCMS │ 16300152778 │ 女 │ 2007-07-11 │ 身份证 │ 177321359793083575 │ … │ FSPAC │ MZKB │ 9225.89 │ 628.1277 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 0048050f-c7e2-4d9d… │ YLQALLEVP │ 10781833261 │ 男 │ 1981-06-23 │ 身份证 │ 620054938335154911 │ … │ CPFGWPHW │ NUMQXOZGAJODSEJYL │ 7555.27 │ 5465.2121 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 00480534-030c-4caa… │ HAWMDVNXMVR │ 17899716473 │ 女 │ 2001-12-26 │ 身份证 │ 599799321632177923 │ … │ DFCXDQQRLCQL │ PFNZ │ 2019.48 │ 8455.7013 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805c2-78c6-4d1b… │ DRGFKEKBPP │ 16538556539 │ 女 │ 2016-09-08 │ 身份证 │ 698176678748396854 │ … │ MZNAQX │ UCOBKOKZBQGWKTGBSSXF │ 2793.68 │ 3975.8125 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805d5-d97d-49db… │ Y │ 10521349859 │ 男 │ 1998-12-10 │ 身份证 │ 309479851036137986 │ … │ DMONNXPUFSXMVPR │ NRHQXIXNTHMU │ 9898.58 │ 9541.5137 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805de-0e71-425f… │ SGFAWOYJ │ 19009647824 │ 男 │ 1973-02-10 │ 身份证 │ 453108689618730419 │ … │ CU │ SKBELQNXTITOWSTLE │ 1370.42 │ 7448.9362 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805de-562d-46f1… │ NPTFMCP │ 15951915861 │ 男 │ 2005-02-02 │ 身份证 │ 140249578270439645 │ … │ ALVM │ OIOBZ │ 6013.69 │ 8486.551 │ 10000.1234 │ 2025-01-28 10:42:56 │├──────────────────────┴──────────────────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴─────────────────┴──────────────────────┴──────────┴───────────┴────────────┴─────────────────────┤│ 10 rows 38 columns (13 shown) │└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.007 user 0.007134 sys 0.000018
四、总结
备份的数据具有较高的业务价值,可以支持历史快照的即席查询,进一步分析业务的历史变化,但是由于平面文件使用不方便,在日常运营过程中,备份数据没有体现出业务价值。在 MySQL 生态中,通过 dumpling 备份成 CSV,结合 S3 兼容的对象存储,再通过 DuckDB 实现数据备份的快速打开和查询,其他的数据库也可以参考。在具体的实践过程中,可以通过自动化平台,实现按需的自助查询操作,具有较高的业务意义。
版权声明: 本文为 InfoQ 作者【TiDB 社区干货传送门】的原创文章。
原文链接:【http://xie.infoq.cn/article/19c56d6bbb27a09ae85bd470c】。文章转载请联系作者。
TiDB 社区干货传送门
TiDB 社区官网:https://tidb.net/ 2021-12-15 加入
TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/









评论