Java 中 String 占用空间的评估标准

关注
发布于: 2020 年 10 月 22 日
问题使用Java的字符串时，如何准确评估其空间占用
﻿
结论评估String空间占用时要分JDK版本、存储方式(存储在JVM堆内存还是磁盘)
使用JDK8及以下版本，字符串在JVM堆里以UTF-16编码存储，即一个字符占2个字节或4个字节，当序列化成文件存储到磁盘，根据系统编码不同占用大小不一，一般系统默认UTF-8，即一个字符占用1到6个字节
我们一般的key/value字符串取值都是ASCII字符，因此它们在JVM堆里占2个字节，磁盘占1个字节，节省50%的空间
JDK9及以上模式开启了字符串压缩功能，ISO-8859-1字符(0x0000-0x00FF)也只占1个字节了，其余字符仍使用UTF-16存储
Java已经在考虑后续版本指定系统默认编码为UTF-8了(内部表示仍然是UTF-16)
﻿
延伸 - UnicodeUnicode在91年发布第一个版本时认为16位能表示世界上所有字符，定义了2^16（0x0000 to 0xFFFF）个字符，后来发现不行，96年发布了第二个版本，字符表示不再局限于16bit，扩展到了0x10FFFF
Unicode是一套字符集，它使用一个数字来表示一个字符，这个数字被称为【code point】，每个code point可以对应一个int类型（只使用低21位即可，高11位只能为0），因此char类型可以转为int类型
wiki：Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor
Unicode将code point划分为17个code panel，编号为#0到#16。每个code panel包含65,536（2^16）个代码点。其中，Plane#0叫做基本多语言平面（Basic Multilingual Plane，BMP），其余平面叫做补充平面（Supplementary Planes），我们用到的大部分字母、汉字，都在BMP里（0x0000到0xFFFF）
    - 0x0000到0x007F：ASCII总共有128个字符，占据了BMP的前128个code point
    - 0x0080到0x00FF：ISO-8859-1 共256个字符，占据了BMP的前256个code point
    - 从0xD800到0xDBFF的1024个code point是High-surrogate code point，从0xDC00到0xDFFF的1024个code point是Low-surrogate code point。这2048个code point并不是有效的字符code point，它们是为UTF编码保留的。一个High-surrogate和一个Low-surrogate组成一个Surrogate Pair，可以在UTF-16里编码BMP之外的某个code point
javadoc : The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters.
Unicode里还有code unit的概念，代表编码格式里的最小表示单元，一个code unit对应一个char，因此Java里补充字符对应一个code point，但对应两个char或两个code unit
wiki : The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form
在Unicode出现之前，字符集都是和具体编码方案绑定在一起的，例如ASCII编码系统规定使用7比特来编码ASCII字符集；Unicode在设计上就将字符集和字符编码方案分离开，也就是说，虽然每个字符在Unicode字符集中都能找到唯一确定的code point，但是决定最终字节流的却是具体的字符集编码character encoding。例如同样是对Unicode字符“A”进行编码，UTF-8字符编码得到的字节流是0x41，而UTF-16（大端模式）得到的是0x00 0x41
// 汉字“木”的Unicode码点为0x6728，在UTF-8中被编码为3字节E6 9C A8
char woodChar = '木';
int codePoint = woodChar;
log(codePoint); // 26408
log(Character.charCount(woodChar)); // 1
log(Integer.toHexString(woodChar)); // 6728
log(Arrays.toString(String.valueOf(woodChar) .getBytes(StandardCharsets.UTF_8.name()))); // [-26, -100, -88]，字节流和具体编码方案有关
log(String.valueOf(woodChar).charAt(0)); // 木，charAt返回的是JVM的内部表示
字符集编码实现时要考虑存储空间、与其他字符集的兼容等等，常用的Unicode字符集编码有UCS-2、UTF-8和UTF-16，它们是一种算法实现，将Unicode的code point映射成自定义大小的字节，从而进行传输和保存
很多时候字符集和字符集编码的区分也并不严格，比如java.nio.charset.StandardCharsets里将UTF-8和UTF-16视为和ASCII、ISO-8859-1并列的字符集；ASCII或GBK都内含了编码方案，所以既可以认为是字符集有是编码方案，但Unicode特殊，不能用来指代一种编码方案
java.nio.charset.StandardCharsets里UTF16的编码方案分为UTF16BE、UTF16LE和UTF16，这是BOM(byte-order-mark)的概念，源自《格列佛游记》。鸡蛋通常一端大一端小，小人国的人们对于剥蛋壳时应从哪一端开始剥起有着不一样的看法。同样，计算机界对于传输多字节字（由多个字节来共同表示一个数据类型）时，是先传高位字节（大端）还是先传低位字节（小端）也有着不一样的看法。一般网络协议都采用大端模式进行传输。
BOM规则比较乱，一般认为windows文件开头经常加BOM，UTF-16格式比UTF-8更可能有BOM
log("a".getBytes(StandardCharsets.UTF_16).length); // return 4，第一个字节时BOM标识
// [-2, -1, 0, 97]，-2 是 0xFE 而-1 是 0xFF，这个是默认的BOM字符
// FF为255，转为byte就变成了-1，FE同理
log(Arrays.toString("a".getBytes(StandardCharsets.UTF_16))); 
根据 Unicode 标准 (v6.2, p.30): UTF-8 既不要求也不推荐使用 BOM
﻿
延伸 - java.lang.CharacterJava在96发布的第一个版本基于Unicode1.0，定义char类型为定长16位，后续JDK随着Unicode演变，但最初的决定无法改变。Java8基于Unicode6.2，Java11基于Unicode10，Unicode最新版本是12
限制Java char演变的不是编码，而是位长，Java最初用的是UCS-2，后来改为UTF-16编码(但char的表示范围还是UCS-2最初覆盖的Unicode字符，即编码方式改了，但16位只能实现一个UTF-16的子集)，都是定长16bit，因为位长没有变，因此对内部实现和外部API影响不大，但如果后续如果改为UTF-8，大量JDK类库及字符处理类库行为都会发生变更，影响会非常大，所以JDK只能通过优化String的表示来提高性能，而不能改变最初的决定
javadoc : The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value
由于char的16bit定长，Unicode里增补字符无法表示，因此，在JDK5里，Java对char的实现逻辑进行了变更，添加了对增补字符的表示
blog : 
1, In the end, the decision was for a tiered approach: Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class. Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs. Provide APIs to easily convert between various char and code point-based representations.
2, The Java Language Specification specifies that all Unicode letters and digits can be used in identifiers. So the Java Language Specification was updated to refer to new code point-based methods to define the legal characters in identifiers. The javac compiler and other tools that need to detect identifiers were changed to use these new methods.
3, Only where applications interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, and these character can be supplementary characters, does the application have to be changed
4, You might wonder whether it's better to convert all text into code point representation (say, an int[]) and process it in that representation, or whether it's better to stick with char sequences most of the time and only convert to code points when needed. Well, the Java platform APIs in general certainly have a preference forchar sequences, and using them will also save memory space. 
Java使用两个char来表示一个非BMP字符，而这两个char是BMP保留区域范围，特地用来表示非BMP字符的
javadoc : The Java platform uses the UTF-16 representation in chararrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF)
﻿
java.lang.Character提供了大量方法来处理code point、code unit，这也是unicode字符集处理的核心概念
  javadoc : In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding
 
延伸 - java.lang.String * String是通过char数组实现，因此它的内部编码、内部表示方式都受到char的影响，比如String里的index都是指的code point数，而非字符数；String的length()表示的是code unit个数，而不是code point；可以使用codePointCount()方法获取字符数
Length() javaDoc： 
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
Java内部使用了UTF-16，所以code point大小就等于code union大小，但其他编码方式这两个概念大小不一
 A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs. Index values refer to char code units, so a supplementary character uses two positions in a String.
﻿
// JDK的indexOf实现
//  根据是否是BMP字符返回对应的code point
// 所以UTF-16需要用两个0xDC00到0xDFFF区code point表示增补字符来防止混乱
public int indexOf(int ch, int fromIndex) {    
    final int max = value.length;    
    if (fromIndex < 0) {        
        fromIndex = 0;    
    } else if (fromIndex >= max) {       
        // Note: fromIndex might be near -1>>>1.        
        return -1;    
    }    if (ch < Character.MIN_SUPPLEMENTARY_CODE_POINT) {        
        // handle most cases here (ch is a BMP code point or a       
        // negative value (invalid code point))        
        final char[] value = this.value;        
        for (int i = fromIndex; i < max; i++) {            
            if (value[i] == ch) {               
                return i;            
            }        
        }        
        return -1;    
    } else {        
        return indexOfSupplementary(ch, fromIndex);    
    }
}
String.getBytes()是一个使用指定编码来将String的内码转换为指定外码的方法，因此如果不是指定UTF-16，和实际JVM内存中存储的大小是有区别的
根据文件的编码格式，可以使用String.getBytes("xx")来确定String在文件里的大小
Java的Class文件中的字符串常量与符号名字也都规定用UTF-8编码。这是当时设计者为了平衡运行时的时间效率（采用定长编码的UTF-16）与外部存储的空间效率（采用变长的UTF-8编码）而做的取舍，后续为了保持一致性，JDK可能会把外部默认编码统一为UTF-8
﻿
延伸 - 其他APIJava里和字符集编码相关的API还是很多的，比如java.io、java.nio、java.text、java.util.regex和java.lang下最重要的几个API，基本来说，一般都提供默认字符集编码的方法和指定字符集编码的方法
默认字符集编码，也就是Java属性file.encoding的值，我们可以通过java.nio.charsets.Charset.defaultCharset()来查看，或在启动时命令行来指定
避免使用char类型参数的方法：char类型的参数在处理补充字符的时候会有问题，因此优先使用int类型参数的方法来处理字符
// 都是合法字符，但返回不一样
Character.isLetter('\uD840'); // return false，其实也是合法字符，但char支持不了大于FFFF的字符
Character.isLetter(0x2F81A); // return true
注意 String.length 和 codePointCount   的区别，前者返回的是code unit，或char的个数，后者返回的才是字符的个数
删除字符的时候要小心，确认删除的是char还是字符，比如 StringBuilder.deleteCharAt 删除的就是char，因此补充字符会删除一半
反转字符的时候要小心，比如 StringBuilder.reverse 反转的是char，因此补充字符会高低位转换，会不被识别
使用java里字符或字符串相关的api时一定要注意是讨论的code unit还是code point
    - String.charAt()：code unit或char
   - String.length() ：code unit或char
   - String.substring() ：code unit或char
   - String.codePointAt() ：code point
接口 CharSequence 里提供了两个默认方法，十分有用
    - public default IntStream chars()
    - public default IntStream codePoints()
﻿
参考资料Unicode
Unicode FAQ
JEP 254: Compact Strings 
JEP 327: Unicode 10 
JEP draft: Use UTF-8 as default Charset 
JSR 204: Unicode Supplementary Character Support
the char type in java is broken
java.lang.Charater
java.lang.String
Supplementary Characters in the Java Platform
Unicode surrogate programming with the Java language
every software developer must know about unicode
utf-8 everywhere
﻿
发布于: 2020 年 10 月 22 日阅读数: 44
原文链接:【http://xie.infoq.cn/article/5df6572b758734ea4cc907e90】。文章转载请联系作者。
陈德伟

关注
还未添加个人签名 2018.04.26 加入
还未添加个人简介
发布
暂无评论
创作场景

Java 中 String 占用空间的评估标准

问题

结论

延伸 - Unicode

延伸 - java.lang.Character

延伸 - java.lang.String

延伸 - 其他API

参考资料

陈德伟

评论