写点什么

DuckDB x TiDB -- 使用 DuckDB 唤醒 CSV 文件

  • 2025-02-14
    北京
  • 本文字数:18585 字

    阅读完需:约 61 分钟

作者: pepezzzz 原文来源:https://tidb.net/blog/5c2153ac

一、背景

在企业中经常以业务保护等级合规的目的设计数据库逻辑备份机制,逻辑备份生成平面文件放置在 S3 兼容的存储上。业务有需求要查询历史时段的数据,需要将备份文件重新导入数据库进行查询,由于查询的时间段很可能相对比较模糊,会出现多次的重复操作。


Dumpling 是 TiDB 的逻辑备份工具,兼容 MySQL,具有产品轻量化、使用便捷的特点,能将生产数据备份成 CSV 平面文件,具备支持自定义分隔符等特性,并兼容 S3 存储。DuckDB 是轻量的嵌入式 AP 场景数据库,能支持以外部表方式读取 S3 上面的 CSV 文件。


本次尝试通过 Dumpling 和 DuckDB 的结合,能够将 Dumpling 的备份文件交付给 DuckDB,实现快速打开、快速查询的目的。

二、环境准备


Duckdb 环境


https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip


# unzip duckdb_cli-linux-amd64.zip Archive:  duckdb_cli-linux-amd64.zip  inflating: duckdb    # vi init.sql SET extension_directory = '/data/duckdb_ext';.prompt 'duckdb> '       .timer on       # ./duckdb bak.duckdb --init init.sqlduckdb> SELECT current_database();┌────────────────────┐│ current_database() ││      varchar       │├────────────────────┤│ bak                │└────────────────────┘Run Time (s): real 0.001 user 0.000429 sys 0.000501duckdb> .exit
复制代码


Minio 单机实验环境


注:


wget https://dl.min.io/server/minio/release/linux-amd64/minio


wget https://dl.min.io/client/mc/release/linux-amd64/mc


# MINIO_ROOT_USER=admin MINIO_ROOT_PASSWORD=password ./minio server /data/miniodata/   --console-address ":9001" &[1] 9664[root@vm172-16-201-85 data]# MinIO Object Storage ServerCopyright: 2015-2025 MinIO, Inc.License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.htmlVersion: RELEASE.2024-12-18T13-15-44Z (go1.23.4 linux/amd64)
API: http://172.16.201.85:9000 http://172.17.0.1:9000 http://127.0.0.1:9000 RootUser: admin RootPass: password
WebUI: http://172.16.201.85:9001 http://172.17.0.1:9001 http://127.0.0.1:9001 RootUser: admin RootPass: password
CLI: https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart $ mc alias set 'myminio' 'http://172.16.201.85:9000' 'admin' 'password'
Docs: https://docs.min.ioWARN: Detected Linux kernel version older than 4.0 release, there are some known potential performance problems with this kernel version. MinIO recommends a minimum of 4.x linux kernel version for best performance
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= mb s3://tidbbak
复制代码


数据备份


mysql> select count(1) from users10m;+----------+| count(1) |+----------+| 10000000 |+----------+1 row in set (6 min 36.33 sec)
#export AWS_ACCESS_KEY_ID=admin#export AWS_SECRET_ACCESS_KEY=password#CURDATE=$(date +%Y%m%d%H%M%S)#/root/.tiup/components/dumpling/v8.5.0/dumpling -B bakdb -u "${BAKUSER}" -P ${TIDBPORT} -h ${TIDBIP} -p "${BAKPW}" --filetype csv -t 8 -o "s3://${Bucket}/dumpling/${CURDATE}" --s3.endpoint="http://${Endpoint}" -r 200000 -F 256MiB
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= get s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csvdownload: 's3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv' -> './bakdb.users10m.0000000010000.csv' [1 of 1] 51980222 of 51980222 100% in 0s 113.40 MB/s done# more bakdb.users10m.0000000010000.csv"userid","name","phone","sex","birth","card_type","card_id","reg_date","reg_time","modify_date","unit_id","vip_level","corp_id","job_number","col00","col01","col02","col03","col04","col05","col06","col07","col08","col09","col10","col11","col12","col13","col14","col15","col16","col17","col18","col19","balance1","balance2","balance3","modify_datetime""004804a7-e96e-4cf8-8622-891e0ae9d87d","LKQKM","10240894028","女","1979-10-29","身份证","588802761557381279","2001-05-12","02:41:58","2016-07-09","hDY","4","22100","130353","EDRROXGRCYJ","TENOFL","MDBPTICNPYQKYDTRHSGC","YPUJVK","JSKWBZYHXR","IXBPDPDPEARSP","QDSLHMNZ","DJBRX","DYCCQDHJIMA","AFISVLDMXLQPZHVZZDQA","FYHYBEYQGYYDVIHG","LZ","NITVMZGXRTYKQFVPFTHB","NCFESYSN","WFSVITJTYPYBEOXYLXU","YZYYG","PZYQPLZ","GJIRATCSLPTFHCSVDH","STOBWVHFSQFIDP","KPRVIGNNFIDCIBL",2071.61,9169.1201,10000.1234,"2025-01-28 10:42:56""004804c4-3494-47de-a7ca-f5ab7ba8e135","OKTICUOYOJNNCULUCGKS","15002789114","男","1988-02-13","身份证","970410241703370828","2000-11-11","09:46:21","2013-01-15","dgjhS","2","6900","107337","SINTVGVWCUQROU","U","LRHRMWTGCKN","T","SADSQUTZD","OBJMWNAHSUERZJPMRG","TAPMJTRLQJIAYXSCAVUQ","VRJCSHSGUAOWAVBDST","OKOUTEUMWVHJYBTP","EYMJMMBWWJW","PALWUHU","NWKMJADTBYPJZ","OVZSQOMBUBXUT","XDFMQHJFIWXN","JFIA","FDD","XVQUOOODC","P","OVRLI","QKH",9722.42,5618.2924,10000.1234,"2025-01-28 10:42:56"
复制代码

三、访问 S3 上 csv 备份

配置 S3 的访问密钥

安装插件和配置 s3 secret。


duckdb> INSTALL 'httpfs';duckdb> SELECT * FROM duckdb_extensions();┌──────────────────┬─────────┬───────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────────┬───────────────────┬────────────────┐│  extension_name  │ loaded  │ installed │     install_path     │                                  description                                  │      aliases      │ extension_version │   install_mode    │ installed_from ││     varchar      │ boolean │  boolean  │       varchar        │                                    varchar                                    │     varchar[]     │      varchar      │      varchar      │    varchar     │├──────────────────┼─────────┼───────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────────┤│ arrow            │ false   │ false     │                      │ A zero-copy data integration between Apache Arrow and DuckDB                  │ []                │                   │                   │                ││ autocomplete     │ true    │ true      │ (BUILT-IN)           │ Adds support for autocomplete in the shell                                    │ []                │                   │ STATICALLY_LINKED │                ││ aws              │ false   │ false     │                      │ Provides features that depend on the AWS SDK                                  │ []                │                   │                   │                ││ azure            │ false   │ false     │                      │ Adds a filesystem abstraction for Azure blob storage to DuckDB                │ []                │                   │                   │                ││ delta            │ false   │ false     │                      │ Adds support for Delta Lake                                                   │ []                │                   │                   │                ││ excel            │ false   │ false     │                      │ Adds support for Excel-like format strings                                    │ []                │                   │                   │                ││ fts              │ true    │ true      │ (BUILT-IN)           │ Adds support for Full-Text Search Indexes                                     │ []                │ v1.1.3            │ STATICALLY_LINKED │                ││ httpfs           │ true    │ true      │ /data/duckdb_ext/v…  │ Adds support for reading and writing files over a HTTP(S) connection          │ [http, https, s3] │ v1.1.3            │ REPOSITORY        │ core           │...│ vss              │ false   │ false     │                      │ Adds indexing support to accelerate Vector Similarity Search                  │ []                │                   │                   │                │├──────────────────┴─────────┴───────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴───────────────────┴───────────────────┴───────────────────┴────────────────┤│ 24 rows                                                                                                                                                                                                          9 columns │└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.005 user 0.000519 sys 0.001037
duckdb> CREATE PERSISTENT SECRET miniosecret ( TYPE S3, KEY_ID 'admin', SECRET 'password', ENDPOINT '172.16.201.85:9000', URL_STYLE 'path', USE_SSL false );┌─────────┐│ Success ││ boolean │├─────────┤│ true │└─────────┘
duckdb> SELECT * FROM duckdb_secrets();┌─────────────┬─────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────┐│ name │ type │ provider │ persistent │ storage │ scope │ secret_string ││ varchar │ varchar │ varchar │ boolean │ varchar │ varchar[] │ varchar │├─────────────┼─────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────┤│ miniosecret │ s3 │ config │ true │ local_file │ [s3://, s3n://, s3… │ name=miniosecret;type=s3;provider=config;serializable=… │└─────────────┴─────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────┘
复制代码


注:PERSISTENT SECRET 存储在 ${HOME}/.duckdb/stored_secrets 目录下。

Duckdb 访问 S3 的 CSV 备份文件

通过 Duckdb 的 S3 expansion 结合 read_csv 访问数据。


duckdb> SELECT count(*) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true);100% ▕████████████████████████████████████████████████████████████▏ ┌──────────────┐│ count_star() ││    int64     │├──────────────┤│     10000000 │└──────────────┘Run Time (s): real 18.073 user 17.706364 sys 6.432855
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;┌──────────────────────┬─────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬────────────────────┬────────────────┬─────────────────┬──────────┬───────────┬────────────┬─────────────────────┐│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col17 │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime ││ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ varchar │ varchar │ varchar │ double │ timestamp │├──────────────────────┼─────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼────────────────────┼────────────────┼─────────────────┼──────────┼───────────┼────────────┼─────────────────────┤│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ GJIRATCSLPTFHCSVDH │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │├──────────────────────┴─────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴────────────────────┴────────────────┴─────────────────┴──────────┴───────────┴────────────┴─────────────────────┤│ 1 rows 38 columns (14 shown) │└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 3.981 user 6.215305 sys 1.525878
duckdb> .mode lineduckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1; userid = 004804a7-e96e-4cf8-8622-891e0ae9d87d name = LKQKM phone = 10240894028 sex = 女 birth = 1979-10-29 card_type = 身份证 card_id = 588802761557381279 reg_date = 2001-05-12 reg_time = 02:41:58 modify_date = 2016-07-09 unit_id = hDY vip_level = 4 corp_id = 22100 job_number = 130353 col00 = EDRROXGRCYJ col01 = TENOFL col02 = MDBPTICNPYQKYDTRHSGC col03 = YPUJVK col04 = JSKWBZYHXR col05 = IXBPDPDPEARSP col06 = QDSLHMNZ col07 = DJBRX col08 = DYCCQDHJIMA col09 = AFISVLDMXLQPZHVZZDQA col10 = FYHYBEYQGYYDVIHG col11 = LZ col12 = NITVMZGXRTYKQFVPFTHB col13 = NCFESYSN col14 = WFSVITJTYPYBEOXYLXU col15 = YZYYG col16 = PZYQPLZ col17 = GJIRATCSLPTFHCSVDH col18 = STOBWVHFSQFIDP col19 = KPRVIGNNFIDCIBL balance1 = 2071.61 balance2 = 9169.1201 balance3 = 10000.1234modify_datetime = 2025-01-28 10:42:56Run Time (s): real 2.791 user 6.307715 sys 1.495764duckdb> .mode duckbox
复制代码


注:.mode line 类似于 MySQL > 的 \G 结束符的输出,.mode table 类似于 MySQL > 的默认输出,.mode duckbox 是 Duckdb 的默认输出,自适应屏幕的简略输出。

标准备份数据导入 Duckdb

默认情况下,使用 auto_detect 访问数据不一定能使用正确的字段格式,先使用 sniff_csv 生成配置模板。


duckdb> .mode lineduckdb> SELECT Prompt FROM sniff_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', sample_size = 10000);Prompt = FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\r\n', skip=0, comment='', header=true, columns={'userid': 'VARCHAR', 'name': 'VARCHAR', 'phone': 'BIGINT', 'sex': 'VARCHAR', 'birth': 'DATE', 'card_type': 'VARCHAR', 'card_id': 'VARCHAR', 'reg_date': 'DATE', 'reg_time': 'TIME', 'modify_date': 'DATE', 'unit_id': 'VARCHAR', 'vip_level': 'BIGINT', 'corp_id': 'BIGINT', 'job_number': 'BIGINT', 'col00': 'VARCHAR', 'col01': 'VARCHAR', 'col02': 'VARCHAR', 'col03': 'VARCHAR', 'col04': 'VARCHAR', 'col05': 'VARCHAR', 'col06': 'VARCHAR', 'col07': 'VARCHAR', 'col08': 'VARCHAR', 'col09': 'VARCHAR', 'col10': 'VARCHAR', 'col11': 'VARCHAR', 'col12': 'VARCHAR', 'col13': 'VARCHAR', 'col14': 'VARCHAR', 'col15': 'VARCHAR', 'col16': 'VARCHAR', 'col17': 'VARCHAR', 'col18': 'VARCHAR', 'col19': 'VARCHAR', 'balance1': 'DOUBLE', 'balance2': 'DOUBLE', 'balance3': 'DOUBLE', 'modify_datetime': 'TIMESTAMP'}, dateformat='%Y-%m-%d', sample_size=10000);Run Time (s): real 0.174 user 0.277318 sys 0.060880
复制代码


使用 read_csv 的 auto_detect 的自适应失败或 sniff_csv 的 sample_size 内的数据没有对应的格式,最终会导致缺少配置项的导入过程仍有可能会失败,需要手工调整。


需要调整配置模板后导入数据,也可以使用备份时对应的 schema 文件内关于字段的描述。


duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000120000.csv');Run Time (s): real 1.472 user 3.583090 sys 0.183995Conversion Error: CSV Error on Line: 154129Original Line: "04e471fa-0a6b-4181-b2a6-e4bc53af5fac","UVJXFMBEC","12308418577","女","2003-12-17","身份证","520850486289953447","2009-09-13",\N,"2012-04-30","8","0","4500","131043","EXDBHQFLZMAJZH","RVWDBE","UBLXVMPIC","CIHYSHCEQZZFBQZ","HPOARFSFFUQSPPN","PQKLEZ","DJFPVM","RBAKOUZZTN","V","PMVYO","V","BCIMNBFA","UOCMGLHVWU","JCUQJYT","KRFFJNMDIYCUBAJMFYK","OKLIMZU","VNSMFYDLH","NWJJLQZP","GKVL","UDUVEYSEBVZMQRKTLF",1606.15,9866.7268,10000.1234,"2025-01-27 09:39:44"Error when converting column "reg_time". Could not convert string "\N" to 'TIME'
Column reg_time is being converted as type TIMEThis type was auto-detected from the CSV file.
复制代码


Dumpling 默认备份 CSV 时使用 ‘\N’,需要添加 nullstr=’\N’ 配置项。修改后,再导入。


duckdb> SELECT *FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='', nullstr='\N', header=true,    columns = {'userid': 'VARCHAR','name': 'VARCHAR','phone': 'BIGINT','sex': 'VARCHAR','birth': 'DATE','card_type': 'VARCHAR','card_id': 'VARCHAR','reg_date': 'DATE','reg_time': 'TIME','modify_date': 'DATE','unit_id': 'VARCHAR','vip_level': 'BIGINT','corp_id': 'BIGINT','job_number': 'BIGINT','col00': 'VARCHAR','col01': 'VARCHAR','col02': 'VARCHAR','col03': 'VARCHAR','col04': 'VARCHAR','col05': 'VARCHAR','col06': 'VARCHAR','col07': 'VARCHAR','col08': 'VARCHAR','col09': 'VARCHAR','col10': 'VARCHAR','col11': 'VARCHAR','col12': 'VARCHAR','col13': 'VARCHAR','col14': 'VARCHAR','col15': 'VARCHAR','col16': 'VARCHAR','col17': 'VARCHAR','col18': 'VARCHAR','col19': 'VARCHAR','balance1': 'DOUBLE','balance2': 'DOUBLE','balance3': 'DOUBLE','modify_datetime': 'TIMESTAMP'    });-- 输出略。
duckdb> create table users10m_ctas as SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true, columns = {'userid': 'VARCHAR','name': 'VARCHAR','phone': 'BIGINT','sex': 'VARCHAR','birth': 'DATE','card_type': 'VARCHAR','card_id': 'VARCHAR','reg_date': 'DATE','reg_time': 'TIME','modify_date': 'DATE','unit_id': 'VARCHAR','vip_level': 'BIGINT','corp_id': 'BIGINT','job_number': 'BIGINT','col00': 'VARCHAR','col01': 'VARCHAR','col02': 'VARCHAR','col03': 'VARCHAR','col04': 'VARCHAR','col05': 'VARCHAR','col06': 'VARCHAR','col07': 'VARCHAR','col08': 'VARCHAR','col09': 'VARCHAR','col10': 'VARCHAR','col11': 'VARCHAR','col12': 'VARCHAR','col13': 'VARCHAR','col14': 'VARCHAR','col15': 'VARCHAR','col16': 'VARCHAR','col17': 'VARCHAR','col18': 'VARCHAR','col19': 'VARCHAR','balance1': 'DOUBLE','balance2': 'DOUBLE','balance3': 'DOUBLE','modify_datetime': 'TIMESTAMP' });100% ▕████████████████████████████████████████████████████████████▏ Run Time (s): real 33.892 user 167.084411 sys 14.806282
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= du s3://tidbbak/dumpling/20250128161903/ 4913936499 42 objects s3://tidbbak/dumpling/20250128161903/
# ls -lh /data/bak.duckdb -rw-r--r-- 1 root root 2.9G Jan 28 17:12 /data/bak.duckdb
复制代码


查看 CTAS 的表结构和 DDL,已经按标识的字段类型创建。


duckdb> desc users10m_ctas;┌─────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐│   column_name   │ column_type │  null   │   key   │ default │  extra  ││     varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │├─────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤│ userid          │ VARCHAR     │ YES     │         │         │         ││ name            │ VARCHAR     │ YES     │         │         │         ││ phone           │ BIGINT      │ YES     │         │         │         ││ sex             │ VARCHAR     │ YES     │         │         │         ││ birth           │ DATE        │ YES     │         │         │         ││ card_type       │ VARCHAR     │ YES     │         │         │         ││ card_id         │ VARCHAR     │ YES     │         │         │         ││ reg_date        │ DATE        │ YES     │         │         │         ││ reg_time        │ TIME        │ YES     │         │         │         ││ modify_date     │ DATE        │ YES     │         │         │         ││ unit_id         │ VARCHAR     │ YES     │         │         │         ││ vip_level       │ BIGINT      │ YES     │         │         │         ││ corp_id         │ BIGINT      │ YES     │         │         │         ││ job_number      │ BIGINT      │ YES     │         │         │         ││ col00           │ VARCHAR     │ YES     │         │         │         ││ col01           │ VARCHAR     │ YES     │         │         │         ││ col02           │ VARCHAR     │ YES     │         │         │         ││ col03           │ VARCHAR     │ YES     │         │         │         ││ col04           │ VARCHAR     │ YES     │         │         │         ││ col05           │ VARCHAR     │ YES     │         │         │         ││ col06           │ VARCHAR     │ YES     │         │         │         ││ col07           │ VARCHAR     │ YES     │         │         │         ││ col08           │ VARCHAR     │ YES     │         │         │         ││ col09           │ VARCHAR     │ YES     │         │         │         ││ col10           │ VARCHAR     │ YES     │         │         │         ││ col11           │ VARCHAR     │ YES     │         │         │         ││ col12           │ VARCHAR     │ YES     │         │         │         ││ col13           │ VARCHAR     │ YES     │         │         │         ││ col14           │ VARCHAR     │ YES     │         │         │         ││ col15           │ VARCHAR     │ YES     │         │         │         ││ col16           │ VARCHAR     │ YES     │         │         │         ││ col17           │ VARCHAR     │ YES     │         │         │         ││ col18           │ VARCHAR     │ YES     │         │         │         ││ col19           │ VARCHAR     │ YES     │         │         │         ││ balance1        │ DOUBLE      │ YES     │         │         │         ││ balance2        │ DOUBLE      │ YES     │         │         │         ││ balance3        │ DOUBLE      │ YES     │         │         │         ││ modify_datetime │ TIMESTAMP   │ YES     │         │         │         │├─────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤│ 38 rows                                                     6 columns │└───────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.001 user 0.001159 sys 0.000000duckdb> .mode lineduckdb> WITH table_columns AS (                        SELECT                        name,                        type,                        CASE WHEN "notnull" = 1 THEN 'NOT NULL' ELSE '' END AS not_null                        FROM pragma_table_info('users10m_ctas'))                        SELECT                        'CREATE TABLE users10m_ctas (' ||                        string_agg(name || ' ' || type || ' ' || not_null, ', ') ||                        ');' AS create_table_sql                        FROM table_columns;create_table_sql = CREATE TABLE users10m_ctas (userid VARCHAR , name VARCHAR , phone BIGINT , sex VARCHAR , birth DATE , card_type VARCHAR , card_id VARCHAR , reg_date DATE , reg_time TIME , modify_date DATE , unit_id VARCHAR , vip_level BIGINT , corp_id BIGINT , job_number BIGINT , col00 VARCHAR , col01 VARCHAR , col02 VARCHAR , col03 VARCHAR , col04 VARCHAR , col05 VARCHAR , col06 VARCHAR , col07 VARCHAR , col08 VARCHAR , col09 VARCHAR , col10 VARCHAR , col11 VARCHAR , col12 VARCHAR , col13 VARCHAR , col14 VARCHAR , col15 VARCHAR , col16 VARCHAR , col17 VARCHAR , col18 VARCHAR , col19 VARCHAR , balance1 DOUBLE , balance2 DOUBLE , balance3 DOUBLE , modify_datetime TIMESTAMP );Run Time (s): real 0.002 user 0.002692 sys 0.000146
复制代码


注意:DuckDB 的 VARCHAR 没有显式长度限制,可以存储任意长度的字符串。实际使用中,字符串长度受限于内存大小和系统性能。

测试原生表和外部表的性能

duckdb> SELECT max(reg_date),min(balance1) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略};
max(reg_date) = 2009-12-30min(balance1) = 0.02Run Time (s): real 18.868 user 19.898231 sys 6.671171
duckdb> SELECT max(reg_date),min(balance1) FROM users10m_ctas;max(reg_date) = 2009-12-30min(balance1) = 0.02Run Time (s): real 0.011 user 0.051131 sys 0.001120
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略}) where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';100% userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac name = UVJXFMBEC phone = 12308418577 sex = 女 birth = 2003-12-17 card_type = 身份证 card_id = 520850486289953447 reg_date = 2009-09-13 reg_time = modify_date = 2012-04-30 unit_id = 8 vip_level = 0 corp_id = 4500 job_number = 131043 col00 = EXDBHQFLZMAJZH col01 = RVWDBE col02 = UBLXVMPIC col03 = CIHYSHCEQZZFBQZ col04 = HPOARFSFFUQSPPN col05 = PQKLEZ col06 = DJFPVM col07 = RBAKOUZZTN col08 = V col09 = PMVYO col10 = V col11 = BCIMNBFA col12 = UOCMGLHVWU col13 = JCUQJYT col14 = KRFFJNMDIYCUBAJMFYK col15 = OKLIMZU col16 = VNSMFYDLH col17 = NWJJLQZP col18 = GKVL col19 = UDUVEYSEBVZMQRKTLF balance1 = 1606.15 balance2 = 9866.7268 balance3 = 10000.1234modify_datetime = 2025-01-27 09:39:44Run Time (s): real 15.471 user 25.564100 sys 7.191754
duckdb> SELECT * FROM users10m_ctas where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac'; userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac name = UVJXFMBEC phone = 12308418577 sex = 女 birth = 2003-12-17 card_type = 身份证 card_id = 520850486289953447 reg_date = 2009-09-13 reg_time = modify_date = 2012-04-30 unit_id = 8 vip_level = 0 corp_id = 4500 job_number = 131043 col00 = EXDBHQFLZMAJZH col01 = RVWDBE col02 = UBLXVMPIC col03 = CIHYSHCEQZZFBQZ col04 = HPOARFSFFUQSPPN col05 = PQKLEZ col06 = DJFPVM col07 = RBAKOUZZTN col08 = V col09 = PMVYO col10 = V col11 = BCIMNBFA col12 = UOCMGLHVWU col13 = JCUQJYT col14 = KRFFJNMDIYCUBAJMFYK col15 = OKLIMZU col16 = VNSMFYDLH col17 = NWJJLQZP col18 = GKVL col19 = UDUVEYSEBVZMQRKTLF balance1 = 1606.15 balance2 = 9866.7268 balance3 = 10000.1234modify_datetime = 2025-01-27 09:39:44Run Time (s): real 0.006 user 0.009642 sys 0.002340
复制代码


对比场景成绩


使用自定义的 CSV 格式

使用控制字符 x01,即 SOH 作为间隔符。能规避字段中包含默认定义分隔符的情况,防止数据读取分列时出错。


注意:如果使用多字符的 delimiter,duckdb 会指定提示 Invalid Input Error: The delimiter option cannot exceed a size of 1 byte. 的超长错误。


export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password; dumpling --filter 'bakdb.users10m' -u root -P 4500 -h 172.16.201.122  -p root  --filetype csv -t 8 -o "s3://tidbbak/dumpling/${CURDATE}" --s3.endpoint="http://172.16.201.85:9000" -r 200000 -F 256MiB  --csv-separator=$(echo -e "\x01") --escape-backslash=false --csv-null-value="NULL" --csv-line-terminator=$'\r\n' --csv-delimiter="" 
# cat -e bakdb.users10m.0000000390000.csv|more userid^Aname^Aphone^Asex^Abirth^Acard_type^Acard_id^Areg_date^Areg_time^Amodify_date^Aunit_id^Avip_level^Acorp_id^Ajob_number^Acol00^Acol01^Acol02^Acol03^Acol04^Acol05^Acol06^Acol07^Acol08^Acol09^Acol10^Acol11^Acol12^Acol13^Acol14^Acol15^Acol16^Acol17^Acol18^Acol19^Abalance1^Abalance2^Abalance3^Amodify_datetime^M$1dbf161b-87e9-49e4-bd43-83e8179b4292^ACQAZEBAOQL^A14737946971^AM-gM-^TM-7^A2019-06-15^AM-hM-:M-+M-dM-;M-=M-hM-/M-^A^A599781278347424629^A2003-04-16^ANULL^A2018-04-03^Aeg^A1^A22200^A27564^ARLAWIETOMOH^ALQDFFCOHMXVOR^AYPBIIHQNIVDM^AXUUKIEXPJRVGQ^AZQMEPKMNHBACDFOYVT^AVFSWFI^ACPSIYHSUBFCI^AYGRFPCEGHAISOGAXIG^AUSAOTBNP^AWDUS^AUTYZOOMCHECUHYAB^ABS^AWJRVSSQVVUSBVZGL^AZ^AOCSGJZYQEIYIKIXVCEX^APYAXMFDV^AORLWDLPQHPMEMOB^AIZZVERY^AGCRJOF^AXKQDXOUXFXRJJYORJ^ANULL^ANULL^A10000.1234^A2025-01-27 09:39:44^M$
复制代码


对应的配置 read_csv 的参数 delim=e’\x01’ 和 nullstr=‘NULL’


duckdb> create table users10m_ctas2 as SELECT *      · FROM read_csv('s3://tidbbak/dumpling/20250129143910/bakdb.users10m.*.csv', union_by_name = true, delim=e'\x01', quote='', new_line='\r\n', skip=0, comment='', nullstr='NULL', header=true,      ·     columns = {中间格式略});
duckdb> select * from users10m_ctas2 limit 10 ;┌──────────────────────┬──────────────────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬─────────────────┬──────────────────────┬──────────┬───────────┬────────────┬─────────────────────┐│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime ││ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ double │ double │ double │ timestamp │├──────────────────────┼──────────────────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼─────────────────┼──────────────────────┼──────────┼───────────┼────────────┼─────────────────────┤│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004804c4-3494-47de… │ OKTICUOYOJNNCULUCGKS │ 15002789114 │ 男 │ 1988-02-13 │ 身份证 │ 970410241703370828 │ … │ OVRLI │ QKH │ 9722.42 │ 5618.2924 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 00480505-5d3e-40f3… │ OEYHUYADGB │ 12955781898 │ 男 │ 2001-03-05 │ 身份证 │ 897826304664568897 │ … │ GXVSEAORZA │ SLDXCJYWYFZFHEHERQPV │ 6997.88 │ 8810.8585 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 0048050a-8c4f-4b68… │ QLCWZCMS │ 16300152778 │ 女 │ 2007-07-11 │ 身份证 │ 177321359793083575 │ … │ FSPAC │ MZKB │ 9225.89 │ 628.1277 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 0048050f-c7e2-4d9d… │ YLQALLEVP │ 10781833261 │ 男 │ 1981-06-23 │ 身份证 │ 620054938335154911 │ … │ CPFGWPHW │ NUMQXOZGAJODSEJYL │ 7555.27 │ 5465.2121 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 00480534-030c-4caa… │ HAWMDVNXMVR │ 17899716473 │ 女 │ 2001-12-26 │ 身份证 │ 599799321632177923 │ … │ DFCXDQQRLCQL │ PFNZ │ 2019.48 │ 8455.7013 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805c2-78c6-4d1b… │ DRGFKEKBPP │ 16538556539 │ 女 │ 2016-09-08 │ 身份证 │ 698176678748396854 │ … │ MZNAQX │ UCOBKOKZBQGWKTGBSSXF │ 2793.68 │ 3975.8125 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805d5-d97d-49db… │ Y │ 10521349859 │ 男 │ 1998-12-10 │ 身份证 │ 309479851036137986 │ … │ DMONNXPUFSXMVPR │ NRHQXIXNTHMU │ 9898.58 │ 9541.5137 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805de-0e71-425f… │ SGFAWOYJ │ 19009647824 │ 男 │ 1973-02-10 │ 身份证 │ 453108689618730419 │ … │ CU │ SKBELQNXTITOWSTLE │ 1370.42 │ 7448.9362 │ 10000.1234 │ 2025-01-28 10:42:56 ││ 004805de-562d-46f1… │ NPTFMCP │ 15951915861 │ 男 │ 2005-02-02 │ 身份证 │ 140249578270439645 │ … │ ALVM │ OIOBZ │ 6013.69 │ 8486.551 │ 10000.1234 │ 2025-01-28 10:42:56 │├──────────────────────┴──────────────────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴─────────────────┴──────────────────────┴──────────┴───────────┴────────────┴─────────────────────┤│ 10 rows 38 columns (13 shown) │└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Run Time (s): real 0.007 user 0.007134 sys 0.000018
复制代码

四、总结

备份的数据具有较高的业务价值,可以支持历史快照的即席查询,进一步分析业务的历史变化,但是由于平面文件使用不方便,在日常运营过程中,备份数据没有体现出业务价值。在 MySQL 生态中,通过 dumpling 备份成 CSV,结合 S3 兼容的对象存储,再通过 DuckDB 实现数据备份的快速打开和查询,其他的数据库也可以参考。在具体的实践过程中,可以通过自动化平台,实现按需的自助查询操作,具有较高的业务意义。


发布于: 刚刚阅读数: 3
用户头像

TiDB 社区官网:https://tidb.net/ 2021-12-15 加入

TiDB 社区干货传送门是由 TiDB 社区中布道师组委会自发组织的 TiDB 社区优质内容对外宣布的栏目,旨在加深 TiDBer 之间的交流和学习。一起构建有爱、互助、共创共建的 TiDB 社区 https://tidb.net/

评论

发布
暂无评论
DuckDB x TiDB -- 使用 DuckDB 唤醒 CSV 文件_迁移_TiDB 社区干货传送门_InfoQ写作社区