写点什么

Spark 测试用例生成 apache iceberg 结果

用户头像
聚变
关注
发布于: 2021 年 04 月 13 日

修改测试代码

org.apache.iceberg.examples.ReadAndWriteTablesTest#createUnpartitionedTable


运行测试用例

org.apache.iceberg.examples.ReadAndWriteTablesTest#createUnpartitionedTable


@Testpublic void createUnpartitionedTable() {  	table = tables.create(schema, pathToTable.toString());		List<SimpleRecord> expected = Lists.newArrayList(    new SimpleRecord(1, "a"),    new SimpleRecord(2, "b"),    new SimpleRecord(3, "c"));
Dataset<Row> df = spark.createDataFrame(expected, SimpleRecord.class);
df.select("id", "data").write() .format("iceberg") .mode("append") .save(pathToTable.toString());
table.refresh();}
复制代码


生成了如下文件


v1.metadata.json

创建表时候的元数据

v2.metadata.json

插入数据时候的元数据

version-hint.text

当前的版本

v2 的 metadata.json 指向 snap-126031204022468765-1-91b91a3e-8dda-43fb-8553-8ba9bf2a55f8.avro


snap-126031204022468765-1-91b91a3e-8dda-43fb-8553-8ba9bf2a55f8.avro


{	"manifest_path": "/var/folders/9_/wwhnv4dx3h724r6y8678sk_r0000gp/T/temp577147697930163376/metadata/91b91a3e-8dda-43fb-8553-8ba9bf2a55f8-m0.avro",	"manifest_length": 5839,	"partition_spec_id": 0,	"added_snapshot_id": {		"long": 126031204022468765	},	"added_data_files_count": {		"int": 2	},	"existing_data_files_count": {		"int": 0	},	"deleted_data_files_count": {		"int": 0	},	"partitions": {		"array": []	},	"added_rows_count": {		"long": 3	},	"existing_rows_count": {		"long": 0	},	"deleted_rows_count": {		"long": 0	}}
复制代码


91b91a3e-8dda-43fb-8553-8ba9bf2a55f8-m0.avro


{	"status": 1,	"snapshot_id": {		"long": 126031204022468765	},	"data_file": {		"file_path": "/var/folders/9_/wwhnv4dx3h724r6y8678sk_r0000gp/T/temp577147697930163376/data/00000-0-eb15c143-1182-4666-9274-40fd3a4a4339-00001.parquet",		"file_format": "PARQUET",		"partition": {},		"record_count": 1,		"file_size_in_bytes": 611,		"block_size_in_bytes": 67108864,		"column_sizes": {			"array": [{				"key": 1,				"value": 51			}, {				"key": 2,				"value": 52			}]		},		"value_counts": {			"array": [{				"key": 1,				"value": 1			}, {				"key": 2,				"value": 1			}]		},		"null_value_counts": {			"array": [{				"key": 1,				"value": 0			}, {				"key": 2,				"value": 0			}]		},		"nan_value_counts": {			"array": []		},		"lower_bounds": {			"array": [{				"key": 1,				"value": "\u0001\u0000\u0000\u0000"			}, {				"key": 2,				"value": "a"			}]		},		"upper_bounds": {			"array": [{				"key": 1,				"value": "\u0001\u0000\u0000\u0000"			}, {				"key": 2,				"value": "a"			}]		},		"key_metadata": null,		"split_offsets": {			"array": [4]		},		"sort_order_id": {			"int": 0		}	}} {	"status": 1,	"snapshot_id": {		"long": 126031204022468765	},	"data_file": {		"file_path": "/var/folders/9_/wwhnv4dx3h724r6y8678sk_r0000gp/T/temp577147697930163376/data/00001-1-47b21804-8617-4f6b-aad3-a10280e7324a-00001.parquet",		"file_format": "PARQUET",		"partition": {},		"record_count": 2,		"file_size_in_bytes": 610,		"block_size_in_bytes": 67108864,		"column_sizes": {			"array": [{				"key": 1,				"value": 53			}, {				"key": 2,				"value": 55			}]		},		"value_counts": {			"array": [{				"key": 1,				"value": 2			}, {				"key": 2,				"value": 2			}]		},		"null_value_counts": {			"array": [{				"key": 1,				"value": 0			}, {				"key": 2,				"value": 0			}]		},		"nan_value_counts": {			"array": []		},		"lower_bounds": {			"array": [{				"key": 1,				"value": "\u0002\u0000\u0000\u0000"			}, {				"key": 2,				"value": "b"			}]		},		"upper_bounds": {			"array": [{				"key": 1,				"value": "\u0003\u0000\u0000\u0000"			}, {				"key": 2,				"value": "c"			}]		},		"key_metadata": null,		"split_offsets": {			"array": [4]		},		"sort_order_id": {			"int": 0		}	}}
复制代码


小结

官方文档概念与生成文件的对应关系


其他

avro 转 json

java -jar avro-tools-1.19.2.jar tojson ~/study/study-code/iceberg/tmp/temp577147697930163376/metadata/91b91a3e-8dda-43fb-8553-8ba9bf2a55f8-m0.avro

用户头像

聚变

关注

还未添加个人签名 2017.10.18 加入

还未添加个人简介

评论

发布
暂无评论
Spark测试用例生成apache iceberg结果