参考官方文档:https://orc.apache.org/specification/ORCv1/
ORC 文件可以大致分为三个部分:Header、Body、Tail。
File Tail
Postscript
Postscript
包含一些必要的信息,
Postscript
不会被压缩,处于文件末尾的1 byte
之前的位置,这个1 byte
存储Posscript
的长度,因此Postscript
的大小不会超过28 bytes
解析了Postscript
,被压缩的Footer
的长度就知道了,那Footer
也就可以被解压缩并解析了。
message PostScript {
// the length of the footer section in bytes
optional uint64 footerLength = 1;
// the kind of generic compression used
optional CompressionKind compression = 2;
// the maximum size of each compression chunk
optional uint64 compressionBlockSize = 3;
// the version of the writer
repeated uint32 version = 4 [packed = true];
// the length of the metadata section in bytes
optional uint64 metadataLength = 5;
// the fixed string "ORC"
optional string magic = 8000;
}
复制代码
enum CompressionKind {
NONE = 0;
ZLIB = 1;
SNAPPY = 2;
LZO = 3;
LZ4 = 4;
ZSTD = 5;
}
复制代码
Footer
Footer
包括:
ORC 文件的 Body 部分的 layout
schema 信息的 type
行数
每一 column 的统计信息
message Footer {
// the length of the file header in bytes (always 3)
optional uint64 headerLength = 1;
// the length of the file header and body in bytes
optional uint64 contentLength = 2;
// the information about the stripes
repeated StripeInformation stripes = 3;
// the schema information
repeated Type types = 4;
// the user metadata that was added
repeated UserMetadataItem metadata = 5;
// the total number of rows in the file
optional uint64 numberOfRows = 6;
// the statistics of each column across the file
repeated ColumnStatistics statistics = 7;
// the maximum number of rows in each index entry
optional uint32 rowIndexStride = 8;
// Each implementation that writes ORC files should register for a code
// 0 = ORC Java
// 1 = ORC C++
// 2 = Presto
// 3 = Scritchley Go from https://github.com/scritchley/orc
// 4 = Trino
optional uint32 writer = 9;
// information about the encryption in this file
optional Encryption encryption = 10;
// the number of bytes in the encrypted stripe statistics
optional uint64 stripeStatisticsLength = 11;
}
复制代码
Stripe Information
ORC 文件的 Body 部分被分成 stripes,每个stripe
都是自包含的,并且可以只使用它自己的字节加上文件的Footer
和Postscript
来读取。每个stripe
只包含完整的 rows,这样 rows 就不会跨越stripe
的边界。
stripe
有三个部分:
indexes
for rows
data
stripe footer
index
和data
都按 column 划分,因此只需要读取所需 column 的数据。
message StripeInformation {
// the start of the stripe within the file
optional uint64 offset = 1;
// the length of the indexes in bytes
optional uint64 indexLength = 2;
// the length of the data in bytes
optional uint64 dataLength = 3;
// the length of the footer in bytes
optional uint64 footerLength = 4;
// the number of rows in the stripe
optional uint64 numberOfRows = 5;
// If this is present, the reader should use this value for the encryption
// stripe id for setting the encryption IV. Otherwise, the reader should
// use one larger than the previous stripe's encryptStripeId.
// For unmerged ORC files, the first stripe will use 1 and the rest of the
// stripes won't have it set. For merged files, the stripe information
// will be copied from their original files and thus the first stripe of
// each of the input files will reset it to 1.
// Note that 1 was choosen, because protobuf v3 doesn't serialize
// primitive types that are the default (eg. 0).
optional uint64 encryptStripeId = 6;
// For each encryption variant, the new encrypted local key to use until we
// find a replacement.
repeated bytes encryptedLocalKeys = 7;
}
复制代码
Type Information
ORC 文件中的所有行必须具有相同的 schema。逻辑上,schema 表示为如下图所示的树,其中 compound types(复合类型)下面有 子列(subcolumns)。
等价的 Hive DDL:
create table Foobar (
myInt int,
myMap map<string,
struct<myString : string,
myDouble: double>>,
myTime timestamp
);
复制代码
通过 先序遍历,类型树被平铺成一个列表,其中每个 type 被分配下一个id
。显然,type tree 的 root 总是 type id 为 0。复合类型有一个名为 subtypes 的字段,其中包含其子类型 id 的列表。
message Type {
enum Kind {
BOOLEAN = 0;
BYTE = 1;
SHORT = 2;
INT = 3;
LONG = 4;
FLOAT = 5;
DOUBLE = 6;
STRING = 7;
BINARY = 8;
TIMESTAMP = 9;
LIST = 10;
MAP = 11;
STRUCT = 12;
UNION = 13;
DECIMAL = 14;
DATE = 15;
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
}
// the kind of this type
required Kind kind = 1;
// the type ids of any subcolumns for list, map, struct, or union
repeated uint32 subtypes = 2 [packed=true];
// the list of field names for struct
repeated string fieldNames = 3;
// the maximum length of the type for varchar or char in UTF-8 characters
optional uint32 maximumLength = 4;
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
}
复制代码
Column Statistics
对于每个 column,writer 记录计数(count),并根据类型记录其他有用的字段。
对于大多数基本类型,它记录 min、max
对于数字类型,它记录 min、max、sum
从 Hive 1.1.0 开始,通过设置hasNull
标志,列统计信息还将记录行组中是否有空值。ORC 的谓词下推使用hasNull
标志来更好地解决'IS NULL'
查询。
message ColumnStatistics {
// the number of values
optional uint64 numberOfValues = 1;
// At most one of these has a value for any column
optional IntegerStatistics intStatistics = 2;
optional DoubleStatistics doubleStatistics = 3;
optional StringStatistics stringStatistics = 4;
optional BucketStatistics bucketStatistics = 5;
optional DecimalStatistics decimalStatistics = 6;
optional DateStatistics dateStatistics = 7;
optional BinaryStatistics binaryStatistics = 8;
optional TimestampStatistics timestampStatistics = 9;
optional bool hasNull = 10;
}
复制代码
For integer
types
对于整数类型(tinyint、smallint、int、bigint),列统计信息包括 min、max 和 sum。如果 sum 在计算过程中的任何一点长时间溢出,则不记录 sum。
message IntegerStatistics {
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 sum = 3;
}
复制代码
For floating point
types
对于浮点类型(float、double),列统计信息包括 min、max 和 sum。如果 sum 超过 double,则不记录 sum。
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
}
复制代码
For string
type
记录 min、max 和 值的长度之和。
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings
optional sint64 sum = 3;
}
复制代码
For boolean
type
。。。
For decimal
type
。。。
For decimal
type
For Date
type
For Timestamp
type
For Binary
type
。。。
File Metadata
文件元数据部分包含 stripe level 粒度的列统计信息。这些统计信息,基于每个 stripe 计算的谓词下推,来消除 input split。
message StripeStatistics {
repeated ColumnStatistics colStats = 1;
}
复制代码
message Metadata {
repeated StripeStatistics stripeStats = 1;
}
复制代码
User Metadata
Column Encryption
列加密🔐
Data Masks
Encryption Keys
Encryption Variants
Stream Encryption
Compression
Run Length Encoding
Stripes
ORC 文件的 Body 部分由一系列的stripes
组成。stripe
很大(通常为 200MB),彼此独立,通常由不同的任务处理。列式存储格式的定义特征是,每列的数据都是单独存储的,从文件中读取的数据应该与读取的列数成比例。
在 ORC 文件中,每个 column 存储在文件中相邻存储的几个stream
中。
例如,一个整数列被表示为两个streams
如果stripe
中所有列的值都是非空的,则从条带中省略PRESENT
stream。对于二进制数据,ORC 使用PRESENT
、DATA
和LENGTH
三个 stream,它们存储每个值的长度。每种类型的详细信息将在以下小节中介绍。
每个stripe
的 layout 大致为:
indexes streams
unencrypted
encryption variant 1..N
data streams
unencrypted
encryption variant 1..N
stripe footer
There is a general order for index and data streams:
Index streams are always placed together in the beginning of the stripe.
Data streams are placed together after index streams (if any).
Inside index streams or data streams, the unencrypted streams should be placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the index and data streams:
Different stream kinds of the same column can be placed in any order.
Streams from different columns can even be placed in any order. To get the precise information (a.k.a stream kind, column id and location) of a stream within a stripe, the streams field in the StripeFooter described below is the single source of truth.
In the example of the integer column mentioned above, the order of the PRESENT stream and the DATA stream cannot be determined in advance. We need to get the precise information by StripeFooter.
Stripe Footer
包括:
每一个 column 的编码
streams 的目录、位置
message StripeFooter {
// the location of each stream
repeated Stream streams = 1;
// the encoding of each column
repeated ColumnEncoding columns = 2;
optional string writerTimezone = 3;
// one for each column encryption variant
repeated StripeEncryptionVariant encryption = 4;
}
复制代码
如果文件包含加密的列,那么这些 stream 和 clomun encodings 将根据 encryption variant 单独存储在StripeEncryptionVariant
中。此外,StripeFooter
将包含两个额外的虚拟 stream:ENCRYPTED_INDEX
和ENCRYPTED_DATA
,它们分配加密变量用于存储加密索引和数据流的空间。
message StripeEncryptionVariant {
repeated Stream streams = 1;
repeated ColumnEncoding encoding = 2;
}
复制代码
为了描述每个 stream,ORC 存储了
stream 的种类
column id
stream 的字节数大小
对于每个 stream 中存储的内容,取决于 column 的类型和编码。
message Stream {
enum Kind {
// boolean stream of whether the next value is non-null
PRESENT = 0;
// the primary data stream
DATA = 1;
// the length of each value for variable length data
LENGTH = 2;
// the dictionary blob
DICTIONARY_DATA = 3;
// deprecated prior to Hive 0.11
// It was used to store the number of instances of each value in the
// dictionary
DICTIONARY_COUNT = 4;
// a secondary data stream
SECONDARY = 5;
// the index for seeking to particular row groups
ROW_INDEX = 6;
// original bloom filters used before ORC-101
BLOOM_FILTER = 7;
// bloom filters that consistently use utf8
BLOOM_FILTER_UTF8 = 8;
// Virtual stream kinds to allocate space for encrypted index and data.
ENCRYPTED_INDEX = 9;
ENCRYPTED_DATA = 10;
// stripe statistics streams
STRIPE_STATISTICS = 100;
// A virtual stream kind that is used for setting the encryption IV.
FILE_STATISTICS = 101;
}
required Kind kind = 1;
// the column id
optional uint32 column = 2;
// the number of bytes in the file
optional uint64 length = 3;
}
复制代码
根据它们的类型,可以有多种编码选项。编码被分为直接类别或基于字典的类别,并进一步细化它们是否使用 RLE v1 还是 v2。
message ColumnEncoding {
enum Kind {
// the encoding is mapped directly to the stream using RLE v1
DIRECT = 0;
// the encoding uses a dictionary of unique values using RLE v1
DICTIONARY = 1;
// the encoding is direct using RLE v2
DIRECT_V2 = 2;
// the encoding is dictionary-based using RLE v2
DICTIONARY_V2 = 3;
}
required Kind kind = 1;
// for dictionary encodings, record the size of the dictionary
optional uint32 dictionarySize = 2;
}
复制代码
Column Encodings
Indexes
Row Group Index
Bloom Filter Index
评论