Go 语言中的字符串、字节、符文与字符

Strings, bytes, runes and characters in Go

简介

本文讨论 Go 中的字符串（string）。核心问题是：“为什么索引 Go 字符串的第 n 个位置时，得到的不是第 n 个字符？”这涉及字节（byte）、字符（character）、符文（rune）、Unicode 和 UTF-8 之间的区别。

推荐阅读：Joel Spolsky 的《每个软件开发人员绝对、必须知道的关于 Unicode 和字符集的最基本知识》。

什么是字符串？

在 Go 中，字符串是只读的字节切片。

字符串包含任意字节，不需要是 Unicode 或 UTF-8 文本。
字符串内容完全等同于字节切片。

以下是一个包含特殊字节值的字符串字面量示例：

const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

打印字符串

由于示例字符串包含无效的 UTF-8 字节，直接打印会产生乱码：

fmt.Println(sample)

会产生这一团乱码（具体显示取决于环境）：

= ⌘

为了检查字符串的真实内容，可以使用以下方法：

遍历字节：

for i := 0; i < len(sample); i++ {
    fmt.Printf("%x ", sample[i])
}

输出：bd b2 3d bc 20 e2 8c 98

%x (十六进制)：
```
fmt.Printf("%x\n", sample)
```
输出：bdb23dbc20e28c98
% x (带空格的十六进制)：
```
fmt.Printf("% x\n", sample)
```
输出：bd b2 3d bc 20 e2 8c 98
%q (带引号)：转义不可打印的字节序列。
```
fmt.Printf("%q\n", sample)
```
输出："\xbd\xb2=\xbc ⌘"
%+q (ASCII 仅转义)：转义非 ASCII 字节，同时解释 UTF-8。
```
fmt.Printf("%+q\n", sample)
```
输出："\xbd\xb2=\xbc \u2318"（显示了瑞典符号的 Unicode 值 U+2318）。

完整示例程序

package main
 
import "fmt"
 
func main() {
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
 
    fmt.Println("Println:")
    fmt.Println(sample)
 
    fmt.Println("Byte loop:")
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }
    fmt.Printf("\n")
 
    fmt.Println("Printf with %x:")
    fmt.Printf("%x\n", sample)
 
    fmt.Println("Printf with % x:")
    fmt.Printf("% x\n", sample)
 
    fmt.Println("Printf with %q:")
    fmt.Printf("%q\n", sample)
 
    fmt.Println("Printf with %+q:")
    fmt.Printf("%+q\n", sample)
}

UTF-8 和字符串字面量

索引字符串产生的是字节。字符串的 UTF-8 表示通常在编写源代码时创建。

Go 源代码定义为 UTF-8。
字符串字面量包含 UTF-8 文本（除非包含破坏性的字节级转义）。
原始字符串（反引号 `）始终包含有效的 UTF-8。

示例程序展示了字符、引用字符串和十六进制字节的关系：

package main
 
import "fmt"
 
func main() {
    const placeOfInterest = `⌘`
 
    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")
 
    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")
 
    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")
}

输出：

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98

字节 e2 8c 98 是十六进制值 2318 (U+2318) 的 UTF-8 编码。

总结：字符串可以包含任意字节，但从字符串字面量构造时，这些字节几乎总是 UTF-8。

码点、字符和符文

码点 (Code Point)：Unicode 标准中由单个值表示的项目（如 U+2318）。
字符 (Character)：概念模糊，一个字符可能对应多个码点（如带重音的字母）。
符文 (Rune)：Go 术语，int32 的别名，代表一个 Unicode 码点。
符文常量：如 '⌘'，其类型和值是整数值为 0x2318 的 rune。

重点总结：

Go 源代码总是 UTF-8。
字符串包含任意字节。
字符串字面量（无字节转义时）总是包含有效的 UTF-8 序列。
这些序列代表 Unicode 码点，称为符文 (runes)。
Go 不保证字符串中的字符是规范化的。

范围循环 (Range Loops)

Go 在 for range 循环中对 UTF-8 进行特殊处理：

循环会解码 UTF-8 编码的符文。
索引是当前符文的起始字节位置。
值是符文的码点。

package main
 
import "fmt"
 
func main() {
    const nihongo = "日本語"
    for index, runeValue := range nihongo {
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }
}

输出显示每个码点占用多个字节：

U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6

库

unicode/utf8 包提供了验证、反汇编和重组 UTF-8 字符串的工具。 DecodeRuneInString 函数与 for range 循环逻辑一致，返回符文及其字节宽度。

package main
 
import (
    "fmt"
    "unicode/utf8"
)
 
func main() {
    const nihongo = "日本語"
    for i, w := 0, 0; i < len(nihongo); i += w {
        runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
        fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
        w = width
    }
}

结论

字符串由字节构建，索引字符串产生字节。虽然字符串可能包含任意字节，但 UTF-8 是 Go 字符串设计的核心。

lllllan's blog

探索