写点什么

orc file format

作者:Downal
  • 2023-09-18
    北京
  • 本文字数:5625 字

    阅读完需:约 18 分钟

参考官方文档:https://orc.apache.org/specification/ORCv1/

ORC 文件可以大致分为三个部分:Header、Body、Tail。

  • Header 由字节 "ORC" 组成,以表明是 ORC 文件

  • Body 包含 rows 和 indexes

  • Tail 包含 file level information

File Tail

Postscript

Postscript包含一些必要的信息,

  • Footer的长度

  • metadata 部分

  • 文件的 version

  • 使用的压缩方式(eg. none, zlib or snappy)


Postscript不会被压缩,处于文件末尾的1 byte之前的位置,这个1 byte存储Posscript的长度,因此Postscript的大小不会超过28 bytes

解析了Postscript,被压缩的Footer的长度就知道了,那Footer也就可以被解压缩并解析了。

message PostScript { // the length of the footer section in bytes optional uint64 footerLength = 1; // the kind of generic compression used optional CompressionKind compression = 2; // the maximum size of each compression chunk optional uint64 compressionBlockSize = 3; // the version of the writer repeated uint32 version = 4 [packed = true]; // the length of the metadata section in bytes optional uint64 metadataLength = 5; // the fixed string "ORC" optional string magic = 8000;}
复制代码


enum CompressionKind { NONE = 0; ZLIB = 1; SNAPPY = 2; LZO = 3; LZ4 = 4; ZSTD = 5;}
复制代码

Footer

Footer包括:

  • ORC 文件的 Body 部分的 layout

  • schema 信息的 type

  • 行数

  • 每一 column 的统计信息

message Footer { // the length of the file header in bytes (always 3) optional uint64 headerLength = 1; // the length of the file header and body in bytes optional uint64 contentLength = 2; // the information about the stripes repeated StripeInformation stripes = 3; // the schema information repeated Type types = 4; // the user metadata that was added repeated UserMetadataItem metadata = 5; // the total number of rows in the file optional uint64 numberOfRows = 6; // the statistics of each column across the file repeated ColumnStatistics statistics = 7; // the maximum number of rows in each index entry optional uint32 rowIndexStride = 8; // Each implementation that writes ORC files should register for a code // 0 = ORC Java // 1 = ORC C++ // 2 = Presto // 3 = Scritchley Go from https://github.com/scritchley/orc // 4 = Trino optional uint32 writer = 9; // information about the encryption in this file optional Encryption encryption = 10; // the number of bytes in the encrypted stripe statistics optional uint64 stripeStatisticsLength = 11;}
复制代码

Stripe Information

ORC 文件的 Body 部分被分成 stripes,每个stripe都是自包含的,并且可以只使用它自己的字节加上文件的FooterPostscript来读取。每个stripe只包含完整的 rows,这样 rows 就不会跨越stripe的边界。

stripe有三个部分:

  • indexes for rows

  • data

  • stripe footer

indexdata都按 column 划分,因此只需要读取所需 column 的数据。

message StripeInformation { // the start of the stripe within the file optional uint64 offset = 1; // the length of the indexes in bytes optional uint64 indexLength = 2; // the length of the data in bytes optional uint64 dataLength = 3; // the length of the footer in bytes optional uint64 footerLength = 4; // the number of rows in the stripe optional uint64 numberOfRows = 5; // If this is present, the reader should use this value for the encryption // stripe id for setting the encryption IV. Otherwise, the reader should // use one larger than the previous stripe's encryptStripeId. // For unmerged ORC files, the first stripe will use 1 and the rest of the // stripes won't have it set. For merged files, the stripe information // will be copied from their original files and thus the first stripe of // each of the input files will reset it to 1. // Note that 1 was choosen, because protobuf v3 doesn't serialize // primitive types that are the default (eg. 0). optional uint64 encryptStripeId = 6; // For each encryption variant, the new encrypted local key to use until we // find a replacement. repeated bytes encryptedLocalKeys = 7;}
复制代码

Type Information

ORC 文件中的所有行必须具有相同的 schema。逻辑上,schema 表示为如下图所示的树,其中 compound types(复合类型)下面有 子列(subcolumns)。



等价的 Hive DDL:

create table Foobar ( myInt int, myMap map<string, struct<myString : string, myDouble: double>>, myTime timestamp);
复制代码

通过 先序遍历,类型树被平铺成一个列表,其中每个 type 被分配下一个id。显然,type tree 的 root 总是 type id 为 0。复合类型有一个名为 subtypes 的字段,其中包含其子类型 id 的列表。

message Type { enum Kind { BOOLEAN = 0; BYTE = 1; SHORT = 2; INT = 3; LONG = 4; FLOAT = 5; DOUBLE = 6; STRING = 7; BINARY = 8; TIMESTAMP = 9; LIST = 10; MAP = 11; STRUCT = 12; UNION = 13; DECIMAL = 14; DATE = 15; VARCHAR = 16; CHAR = 17; TIMESTAMP_INSTANT = 18; } // the kind of this type required Kind kind = 1; // the type ids of any subcolumns for list, map, struct, or union repeated uint32 subtypes = 2 [packed=true]; // the list of field names for struct repeated string fieldNames = 3; // the maximum length of the type for varchar or char in UTF-8 characters optional uint32 maximumLength = 4; // the precision and scale for decimal optional uint32 precision = 5; optional uint32 scale = 6;}
复制代码

Column Statistics

对于每个 column,writer 记录计数(count),并根据类型记录其他有用的字段。

  • 对于大多数基本类型,它记录 min、max

  • 对于数字类型,它记录 min、max、sum

从 Hive 1.1.0 开始,通过设置hasNull标志,列统计信息还将记录行组中是否有空值。ORC 的谓词下推使用hasNull标志来更好地解决'IS NULL'查询。

message ColumnStatistics { // the number of values optional uint64 numberOfValues = 1; // At most one of these has a value for any column optional IntegerStatistics intStatistics = 2; optional DoubleStatistics doubleStatistics = 3; optional StringStatistics stringStatistics = 4; optional BucketStatistics bucketStatistics = 5; optional DecimalStatistics decimalStatistics = 6; optional DateStatistics dateStatistics = 7; optional BinaryStatistics binaryStatistics = 8; optional TimestampStatistics timestampStatistics = 9; optional bool hasNull = 10;}
复制代码

For integer types

对于整数类型(tinyint、smallint、int、bigint),列统计信息包括 min、max 和 sum。如果 sum 在计算过程中的任何一点长时间溢出,则不记录 sum。

message IntegerStatistics { optional sint64 minimum = 1; optional sint64 maximum = 2; optional sint64 sum = 3;}
复制代码

For floating point types

对于浮点类型(float、double),列统计信息包括 min、max 和 sum。如果 sum 超过 double,则不记录 sum。

message DoubleStatistics { optional double minimum = 1; optional double maximum = 2; optional double sum = 3;}
复制代码

For string type

记录 min、max 和 值的长度之和。

message StringStatistics { optional string minimum = 1; optional string maximum = 2; // sum will store the total length of all strings optional sint64 sum = 3;}
复制代码

For boolean type

。。。


For decimal type

。。。


For decimal type


For Date type


For Timestamp type


For Binary type


。。。

File Metadata

文件元数据部分包含 stripe level 粒度的列统计信息。这些统计信息,基于每个 stripe 计算的谓词下推,来消除 input split。

message StripeStatistics { repeated ColumnStatistics colStats = 1;}
复制代码


message Metadata { repeated StripeStatistics stripeStats = 1;}
复制代码

User Metadata


Column Encryption

列加密🔐


Data Masks


Encryption Keys


Encryption Variants


Stream Encryption


Compression


Run Length Encoding


Stripes

ORC 文件的 Body 部分由一系列的stripes组成。stripe很大(通常为 200MB),彼此独立,通常由不同的任务处理。列式存储格式的定义特征是,每列的数据都是单独存储的,从文件中读取的数据应该与读取的列数成比例。


在 ORC 文件中,每个 column 存储在文件中相邻存储的几个stream中。

例如,一个整数列被表示为两个streams

  • PRESENT:如果值是非空的,它使用一个记录每个值的 bit

  • DATA:则记录非空值。

如果stripe中所有列的值都是非空的,则从条带中省略PRESENTstream。对于二进制数据,ORC 使用PRESENTDATALENGTH三个 stream,它们存储每个值的长度。每种类型的详细信息将在以下小节中介绍。


每个stripe的 layout 大致为:

  • indexes streams

  • unencrypted

  • encryption variant 1..N

  • data streams

  • unencrypted

  • encryption variant 1..N

  • stripe footer

There is a general order for index and data streams:

  • Index streams are always placed together in the beginning of the stripe.

  • Data streams are placed together after index streams (if any).

  • Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant.

There is no fixed order within each unencrypted or encryption variant in the index and data streams:

  • Different stream kinds of the same column can be placed in any order.

  • Streams from different columns can even be placed in any order. To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth.

In the example of the integer column mentioned above, the order of the PRESENT stream and the DATA stream cannot be determined in advance. We need to get the precise information by StripeFooter.

Stripe Footer

包括:

  • 每一个 column 的编码

  • streams 的目录、位置

message StripeFooter { // the location of each stream repeated Stream streams = 1; // the encoding of each column repeated ColumnEncoding columns = 2; optional string writerTimezone = 3; // one for each column encryption variant repeated StripeEncryptionVariant encryption = 4;}
复制代码


如果文件包含加密的列,那么这些 stream 和 clomun encodings 将根据 encryption variant 单独存储在StripeEncryptionVariant中。此外,StripeFooter将包含两个额外的虚拟 stream:ENCRYPTED_INDEXENCRYPTED_DATA,它们分配加密变量用于存储加密索引和数据流的空间。

message StripeEncryptionVariant {  repeated Stream streams = 1;  repeated ColumnEncoding encoding = 2;}
复制代码


为了描述每个 stream,ORC 存储了

  • stream 的种类

  • column id

  • stream 的字节数大小

对于每个 stream 中存储的内容,取决于 column 的类型和编码。

message Stream { enum Kind {   // boolean stream of whether the next value is non-null   PRESENT = 0;   // the primary data stream   DATA = 1;   // the length of each value for variable length data   LENGTH = 2;   // the dictionary blob   DICTIONARY_DATA = 3;   // deprecated prior to Hive 0.11   // It was used to store the number of instances of each value in the   // dictionary   DICTIONARY_COUNT = 4;   // a secondary data stream   SECONDARY = 5;   // the index for seeking to particular row groups   ROW_INDEX = 6;   // original bloom filters used before ORC-101   BLOOM_FILTER = 7;   // bloom filters that consistently use utf8   BLOOM_FILTER_UTF8 = 8;
// Virtual stream kinds to allocate space for encrypted index and data. ENCRYPTED_INDEX = 9; ENCRYPTED_DATA = 10;
// stripe statistics streams STRIPE_STATISTICS = 100; // A virtual stream kind that is used for setting the encryption IV. FILE_STATISTICS = 101; } required Kind kind = 1; // the column id optional uint32 column = 2; // the number of bytes in the file optional uint64 length = 3;}
复制代码


根据它们的类型,可以有多种编码选项。编码被分为直接类别或基于字典的类别,并进一步细化它们是否使用 RLE v1 还是 v2。

message ColumnEncoding { enum Kind { // the encoding is mapped directly to the stream using RLE v1 DIRECT = 0; // the encoding uses a dictionary of unique values using RLE v1 DICTIONARY = 1; // the encoding is direct using RLE v2 DIRECT_V2 = 2; // the encoding is dictionary-based using RLE v2 DICTIONARY_V2 = 3; } required Kind kind = 1; // for dictionary encodings, record the size of the dictionary optional uint32 dictionarySize = 2;}
复制代码


Column Encodings


Indexes

Row Group Index


Bloom Filter Index

发布于: 刚刚阅读数: 6
用户头像

Downal

关注

还未添加个人签名 2020-05-18 加入

还未添加个人简介

评论

发布
暂无评论
orc file format_orc_Downal_InfoQ写作社区