Stack 顿悟三部曲（3）：溯源 goroutine 堆栈

2022-05-05
江苏
本文字数：14093 字
阅读完需：约 46 分钟

通过从CPU的视角说起和穿越虚拟内存的迷雾两篇文章我们知道，所谓进程堆栈不过是应用程序向内核申请了一块连续内存后，设定相应的寄存器，从而将这块内存当做堆栈来使用，典型的用法就是用于函数调用。

我们在上一篇讨论了进线程的堆栈，现在继续探索 go 中的协程栈。如果吊一下书袋的话，口称 go 协程是不严谨的，go 的协程不同于其他语言的协程，go 的协程是一种有栈协程，每一个协程都有自己的协程堆栈，因此 go 官网发明了一个新词 goroutine，以区别于普通的 coroutine。我们接下来就聊聊 goroutine 的堆栈。在此之前，先来回顾一下上一篇中对进线程堆栈位置的总结。

本文基于 Linux 平台 x64 架构，使用 go 1.18 源码，禁用 cgo

1· 进线程堆栈

图 3-1 位于不同区域的线程 stack

图 3.1 为 64 位虚拟地址空间布局图，粉色标识说明了线程堆栈可能存在的位置，总结下来，不外乎以下三种情况：

主线程堆栈位于用户空间顶部，但 clone 时，子进程的主线程实际使用的堆栈未必如此。
有可能分配在 mmap 区域。
有可能通过 C 库 malloc 分配在 heap 区域。

2. goroutine 的堆栈

或许你已经知道 goroutine 的堆栈是从 heap 上分配的，但如果你足够好奇，你就会为 heap 在虚拟地址空间中的位置而发狂。

go 重写了运行时，如果不使用 cgo 的话，编译完成的 go 程序是静态链接的，不依赖任何 C 库，这使它拥有不错的可移植性，在较新内核上编译好的程序，拉到旧版本内核的操作系统上依然能够运行。在这一点上，rust 并没有多少优势，反而新生语言 hare 表现足够强劲。

不依赖 C 库，意味着 go 对 heap 的管理有自己的方式。那么， go 管理的 heap 是否与之前内存空间布局图中的 heap 位置相同就要打一个大大的问号了。要搞清楚这个问题，我们需要到 runtime 的源码中一探究竟，且要挖到 go 与内核的接口处，找出其申请内存的方式方可。

本文并不打算分析 go 的内存分配器，也不打算介绍堆栈的分配算法，仅仅为了解决 goroutine 堆栈在虚拟地址空间中位置的疑惑。想了解内存管理和堆栈分配算法的读者可以参考详解Go中内存分配源码实现与一文教你搞懂 Go 中栈操作。

先从普通 goroutine 的创建开始吧！

在 go 中，每通过go func(){}的方式开启一个 goroutine 时，编译器都会将其转换成对 runtime.newproc的调用：

// Create a new g running fn.// Put it on the queue of g's waiting to run.// The compiler turns a go statement into a call to this.func newproc(fn *funcval) {  gp := getg()  pc := getcallerpc()    // 切换到线程堆栈创建 g  systemstack(func() {    newg := newproc1(fn, gp, pc)
    _p_ := getg().m.p.ptr()    runqput(_p_, newg, true)
    if mainStarted {      wakep()    }  })}

复制代码

newproc 仅仅是对 newproc1 的包装，创建新 g 的动作不能在用户堆栈上进行，所以这里切换到底层线程的堆栈来执行。

// Create a new g in state _Grunnable, starting at fn. callerpc is the// address of the go statement that created this. The caller is responsible// for adding the new g to the scheduler.func newproc1(fn *funcval, callergp *g, callerpc uintptr) *g {  _g_ := getg()
  if fn == nil {    _g_.m.throwing = -1 // do not dump full stacks    throw("go of nil func value")  }  acquirem() // disable preemption because it can be holding p in a local var
  _p_ := _g_.m.p.ptr()    // 从 P 的空闲链表中获取一个新的 G  newg := gfget(_p_)    // 获取不到则调用 malg 进行创建  if newg == nil {    newg = malg(_StackMin)    casgstatus(newg, _Gidle, _Gdead)    allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.  }    ......}

复制代码

newproc1 方法很长，里面主要是获取 G ，然后对获取到的 G 做一些初始化的工作。当创建 G 时，会先从缓存的空闲链表中获取，如果没有空闲的 G ，再进行创建。所以，我们这里只看 malg 函数的调用。

在调用 malg 函数的时候会传入一个最小堆栈大小值：_StackMin（linux 平台下为 2048）。

// Allocate a new g, with a stack big enough for stacksize bytes.func malg(stacksize int32) *g {  newg := new(g)  if stacksize >= 0 {    stacksize = round2(_StackSystem + stacksize)    systemstack(func() {      newg.stack = stackalloc(uint32(stacksize))    })    newg.stackguard0 = newg.stack.lo + _StackGuard    newg.stackguard1 = ^uintptr(0)    // Clear the bottom word of the stack. We record g    // there on gsignal stack during VDSO on ARM and ARM64.    *(*uintptr)(unsafe.Pointer(newg.stack.lo)) = 0  }  return newg}

复制代码

malg 会创建新的 G 并为其设置好堆栈，以及堆栈的边界，以供日后扩容使用。这里重点看 stackalloc 函数，堆栈的内存的分配就是由它来完成的，函数的返回值赋给新 G 的 stack 字段。

G 的 stack 字段是一个 stack 结构体类型，里面标记了堆栈的高地址和低地址：

// Stack describes a Go execution stack.// The bounds of the stack are exactly [lo, hi),// with no implicit data structures on either side.type stack struct {  lo uintptr  hi uintptr}

复制代码

我们接着看这个 stack 是怎么创建出来的。

stackalloc 的函数比较长，里面涉及到大堆栈和小堆栈的分配逻辑，这里就不贴大段的代码了。这个函数不管是从 cache 还是 pool 中获取内存，最终都会在内存不够时调用 mheap 的allocManual函数去分配新的内存：

mheap_.allocManual(_StackCacheSize>>_PageShift, spanAllocStack)

复制代码

到这里就遇见 go 管理的 heap 了，关于 heap 的位置我们稍后再讨论，现在继续挖 allocManual 直到我们找到系统调用为止。

func (h *mheap) allocManual(npages uintptr, typ spanAllocType) *mspan {  if !typ.manual() {    throw("manual span allocation called with non-manually-managed type")  }  return h.allocSpan(npages, typ, 0)}

复制代码

allocManual 只是对 allocSpan 的简单封装，这里简单提一下 go 对内存管理的最小单位是 mspan，它包含若干连续的页。

allocSpan 的逻辑较多，主要是从 heap 中分配 npages 个页来填充 span。一般随着程序的运行，内存的不断申请，heap 中会有很多空闲的页用来供给后续的内存申请。现在我们需要查看 cache 不足的情况，当 heap 中的 page 不够的时候，就需要推动 heap 增长了，allocSpan 通过调用 mheap.grow 来达成这一点。

// Try to add at least npage pages of memory to the heap,// returning how much the heap grew by and whether it worked.func (h *mheap) grow(npage uintptr) (uintptr, bool) {  assertLockHeld(&h.lock)  ask := alignUp(npage, pallocChunkPages) * pageSize  totalGrowth := uintptr(0)  // This may overflow because ask could be very large  // and is otherwise unrelated to h.curArena.base.  // curArena 无需初始化，但问题是怎么判断 Arena 边界呢  end := h.curArena.base + ask  nBase := alignUp(end, physPageSize)  if nBase > h.curArena.end || /* overflow */ end < h.curArena.base {    // 尝试分配新的 Arena，但有可能跨越 hint 区域，所以全额申请    // Not enough room in the current arena. Allocate more    // arena space. This may not be contiguous with the    // current arena, so we have to request the full ask.    av, asize := h.sysAlloc(ask)    // 此时已经将需要的内存 reserve 了    if av == nil {      print("runtime: out of memory: cannot allocate ", ask, "-byte block (", memstats.heap_sys, " in use)\n")      return 0, false    }
    if uintptr(av) == h.curArena.end {      // 说明是连续的，拓展此 curArena 的边界      // The new space is contiguous with the old      // space, so just extend the current space.      h.curArena.end = uintptr(av) + asize    } else {      // 感觉像是这一次不够分配的，但也别浪费，把剩余的内存标记为已使用，加入到一个地方以供分配      // The new space is discontiguous. Track what      // remains of the current space and switch to      // the new space. This should be rare.      if size := h.curArena.end - h.curArena.base; size != 0 {        // Transition this space from Reserved to Prepared and mark it        // as released since we'll be able to start using it after updating        // the page allocator and releasing the lock at any time.        sysMap(unsafe.Pointer(h.curArena.base), size, &memstats.heap_sys)        // Update stats.        atomic.Xadd64(&memstats.heap_released, int64(size))        stats := memstats.heapStats.acquire()        atomic.Xaddint64(&stats.releagrowsed, int64(size))        memstats.heapStats.release()        // Update the page allocator's structures to make this        // space ready for allocation.        h.pages.grow(h.curArena.base, size)        totalGrowth += size      }      // Switch to the new space.      // 把 curArena 切换到新的地址      h.curArena.base = uintptr(av)      h.curArena.end = uintptr(av) + asize    }
    // Recalculate nBase.    // We know this won't overflow, because sysAlloc returned    // a valid region starting at h.curArena.base which is at    // least ask bytes in size.    nBase = alignUp(h.curArena.base+ask, physPageSize)  }
  // 更新 base  // Grow into the current arena.  v := h.curArena.base  h.curArena.base = nBase
  // 把分配的那块内存标记为 Prepared  // Transition the space we're going to use from Reserved to Prepared.  sysMap(unsafe.Pointer(v), nBase-v, &memstats.heap_sys)
  // ...... 省略部分代码
  // Update the page allocator's structures to make this  // space ready for allocation.  h.pages.grow(v, nBase-v)  totalGrowth += nBase - v  return totalGrowth, true}

复制代码

当curArena的空闲内存（内核返回的内存空间往往会比请求的多一些）不足以满足分配时，调用mheap.sysAlloc来申请更多的空间。

func (h *mheap) sysAlloc(n uintptr) (v unsafe.Pointer, size uintptr) {  assertLockHeld(&h.lock)
  n = alignUp(n, heapArenaBytes)
  // First, try the arena pre-reservation.  v = h.arena.alloc(n, heapArenaBytes, &memstats.heap_sys)  if v != nil {    size = n    goto mapped  }
  // Try to grow the heap at a hint address.  for h.arenaHints != nil {    hint := h.arenaHints    p := hint.addr    if hint.down {      p -= n    }    if p+n < p {      // We can't use this, so don't ask.      v = nil    } else if arenaIndex(p+n-1) >= 1<<arenaBits {      // Outside addressable heap. Can't use.      v = nil    } else {      v = sysReserve(unsafe.Pointer(p), n)    }    // 如果不相等，则说明 mmap 在建议的地址上没能分配成功    if p == uintptr(v) {      // Success. Update the hint.      if !hint.down {        p += n      }      // 成功后，hint 的地址也跟着更新      hint.addr = p      size = n      break    }    // 此时，丢弃这次分配的内存，尝试下一个 arenaHints, 也就是下一个 1T 区间    // Failed. Discard this hint and try the next.    //    // TODO: This would be cleaner if sysReserve could be    // told to only return the requested address. In    // particular, this is already how Windows behaves, so    // it would simplify things there.    if v != nil {      sysFree(v, n, nil)    }    h.arenaHints = hint.next    h.arenaHintAlloc.free(unsafe.Pointer(hint))  }
  if size == 0 {    if raceenabled {      // The race detector assumes the heap lives in      // [0x00c000000000, 0x00e000000000), but we      // just ran out of hints in this region. Give      // a nice failure.      throw("too many address space collisions for -race mode")    }
    // All of the hints failed, so we'll take any    // (sufficiently aligned) address the kernel will give    // us.    // 所有的 hint 都失败了，然后让内核自动分配一个定量内存    v, size = sysReserveAligned(nil, n, heapArenaBytes)    if v == nil {      return nil, 0    }
    // Create new hints for extending this region.    hint := (*arenaHint)(h.arenaHintAlloc.alloc())    hint.addr, hint.down = uintptr(v), true    hint.next, mheap_.arenaHints = mheap_.arenaHints, hint    hint = (*arenaHint)(h.arenaHintAlloc.alloc())    hint.addr = uintptr(v) + size    hint.next, mheap_.arenaHints = mheap_.arenaHints, hint  }    // ......省略大段代码    return}

复制代码

这里真正申请内存的操作是 sysReserve，让我们来一睹究竟：

func sysReserve(v unsafe.Pointer, n uintptr) unsafe.Pointer {  p, err := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)  if err != 0 {    return nil  }  return p}

复制代码

熟悉的 mmap 映入眼帘！我们已经抵达了内核的大门，查看其定义发现，它包裹了一个sysMmap函数，该函数就是发起mmap系统调用的所在，它是由汇编语言写成，Linux 下函数体位于 sys_linux_amd64.s 中：

// sysMmap calls the mmap system call. It is implemented in assembly.func sysMmap(addr unsafe.Pointer, n uintptr, prot, flags, fd int32, off uint32) (p unsafe.Pointer, err int)

复制代码

mmap调用中的 flag _PROT_NONE, _MAP_ANON|_MAP_PRIVATE表示申请的内存块是无文件背景的匿名映射，这里在调用时传入了一个提示地址，用于告知内核尽量从要求的地址开始分配。

内核当然不能保证这一点，但 go 也足够倔强，如果不能保证连续增长，就另找一段空间开始：

// 如果不相等，则说明 mmap 在建议的地址上没能分配成功if p == uintptr(v) {  // Success. Update the hint.  if !hint.down {    p += n  }  // 成功后，hint 的地址也跟着更新  hint.addr = p  size = n  break}// 此时，丢弃这次分配的内存，尝试下一个 arenaHints, 也就是下一个 1T 区间// Failed. Discard this hint and try the next.//// TODO: This would be cleaner if sysReserve could be// told to only return the requested address. In// particular, this is already how Windows behaves, so// it would simplify things there.if v != nil {  sysFree(v, n, nil)}h.arenaHints = hint.nexth.arenaHintAlloc.free(unsafe.Pointer(hint))

复制代码

从 sysAlloc 返回之后，就意味着已经从内核申请到了一块空间。回到 mheap.grow的代码，会看到调用了 sysMap 再次向内核申请内存，sysMap 代码如下：

func sysMap(v unsafe.Pointer, n uintptr, sysStat *sysMemStat) {  sysStat.add(int64(n))
  p, err := mmap(v, n, _PROT_READ|_PROT_WRITE, _MAP_ANON|_MAP_FIXED|_MAP_PRIVATE, -1, 0)  if err == _ENOMEM {    throw("runtime: out of memory")  }  if p != v || err != 0 {    print("runtime: mmap(", v, ", ", n, ") returned ", p, ", ", err, "\n")    throw("runtime: cannot map pages in arena address space")  }}

复制代码

可见，也是一个mmap系统调用，但传入的 flag 不同，多了一个 _MAP_FIXED 。

查看 mmap 的手册便会明白，在不提供_MAP_FIXED 的情况下，内核会尽量从给出的地址分配空间，但避免冲突是第一位的，所以结果并不总能如意。而_MAP_FIXED保证了这一点，即使在请求的地址处已有其它映射的情况下也会覆盖之前的映射。

mmap 文档中也对 _MAP_FIXED 使用提出了警示，而 go 在这里使用是完全没有问题的，因为事先已经向内核申请了该块内存了，在里面隔上一刀根本不需要睁眼。

我们拿到了一块连续的内存，是时候从 allocSpan 返回了，如此 stackalloc 就为新 G 申请到了一块连续内存用作堆栈。

从 goroutine 的新建一直到内核的大门，我们发现了用于申请内存的方式是 mmap，但mmap从进程虚拟地址空间的哪个位置分配内存呢？runtime 源码中给与的提示地址又是从何而来呢？

3. mmap 申请内存的位置

mmap 既是一个系统调用，也是进程虚拟地址空间中的一个区域，让我再次援引《深入 Linux 内核架构》中的一幅图：

图3-2 mmap 区域自顶向下扩展

书中介绍了 2.6 版本的内核内存布局，其中 mmap 区域是和 heap 相对增长的，内核会留出足够的空间给主线程 stack，这样便可最大化的利用内存空间，好在 stack 通常不会很大。

但是 mmap 并非只能在概念上划出的区域进行分配，它甚至可以在用户空间内任意地方分配内存，这当然也包括传统的 heap 区域！还记得 _MAP_FIXED 吧？我打赌它绝对能让你的程序 crash 掉！

heap 是用来为进程动态分配内存的，传统的定义是：堆是一段长度可变的连续虚拟内存，始于进程的未初始化数据段的末尾，随着内存的分配和释放而增减：

图 3-3 Linux 进程的虚拟内存布局

改变 heap 大小的系统调用是 brk 和 sbrk ，而 go 主要使用 mmap 来维护堆，这就说明 go 堆和传统的堆位置是不同的。位置虽然不同，但使命毫无二致，让我们来看一个 go 程序的内存布局：

00400000-004bd000 r-xp 00000000 103:02 8916313      playground/helloworld/hello/hello004bd000-00574000 r--p 000bd000 103:02 8916313      playground/helloworld/hello/hello00574000-0058f000 rw-p 00174000 103:02 8916313      playground/helloworld/hello/hello0058f000-005c4000 rw-p 00000000 00:00 0 c000000000-c000200000 rw-p 00000000 00:00 0c000200000-c017e00000 rw-p 00000000 00:00 0 c017e00000-c018000000 rw-p 00000000 00:00 0 c018000000-c018400000 rw-p 00000000 00:00 0 c018400000-c01c000000 ---p 00000000 00:00 0 7fef44906000-7fef449ba000 rw-p 00000000 00:00 0 7fef449d2000-7fef47c19000 rw-p 00000000 00:00 0 7fef47c19000-7fef57d99000 ---p 00000000 00:00 0 7fef57d99000-7fef57d9a000 rw-p 00000000 00:00 0 7fef57d9a000-7fef69c49000 ---p 00000000 00:00 0 7fef69c49000-7fef69c4a000 rw-p 00000000 00:00 0 7fef69c4a000-7fef6c01f000 ---p 00000000 00:00 0 7fef6c01f000-7fef6c020000 rw-p 00000000 00:00 0 7fef6c020000-7fef6c499000 ---p 00000000 00:00 0 7fef6c499000-7fef6c49a000 rw-p 00000000 00:00 0 7fef6c49a000-7fef6c519000 ---p 00000000 00:00 0 7fef6c519000-7fef6c579000 rw-p 00000000 00:00 0 7ffc335d5000-7ffc335f7000 rw-p 00000000 00:00 0                          [stack]7ffc335f8000-7ffc335fc000 r--p 00000000 00:00 0                          [vvar]7ffc335fc000-7ffc335fe000 r-xp 00000000 00:00 0                          [vdso]ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

复制代码

表 3-1 Go 进程的内存布局映射

除了代码段不足 2M 的区域之外，似乎 c000000000 最值得怀疑，而且这份映射当中没有看到 heap 身影，这直接印证了上述猜想。关于 c000000000 我们要去源码中寻找答案，且看内存分配器的初始化：

func mallocinit() {  // ...... 省略部分代码
    // 只看 64 位系统的初始化部分  // Create initial arena growth hints.  if goarch.PtrSize == 8 {    // On a 64-bit machine, we pick the following hints    // because:    //    // 1. Starting from the middle of the address space    // makes it easier to grow out a contiguous range    // without running in to some other mapping.    //    // 2. This makes Go heap addresses more easily    // recognizable when debugging.    //    // 3. Stack scanning in gccgo is still conservative,    // so it's important that addresses be distinguishable    // from other data.    //    // Starting at 0x00c0 means that the valid memory addresses    // will begin 0x00c0, 0x00c1, ...    // In little-endian, that's c0 00, c1 00, ... None of those are valid    // UTF-8 sequences, and they are otherwise as far away from    // ff (likely a common byte) as possible. If that fails, we try other 0xXXc0    // addresses. An earlier attempt to use 0x11f8 caused out of memory errors    // on OS X during thread allocations.  0x00c0 causes conflicts with    // AddressSanitizer which reserves all memory up to 0x0100.    // These choices reduce the odds of a conservative garbage collector    // not collecting memory because some non-pointer block of memory    // had a bit pattern that matched a memory address.    //    // However, on arm64, we ignore all this advice above and slam the    // allocation at 0x40 << 32 because when using 4k pages with 3-level    // translation buffers, the user address space is limited to 39 bits    // On ios/arm64, the address space is even smaller.    //    // On AIX, mmaps starts at 0x0A00000000000000 for 64-bit.    // processes.    for i := 0x7f; i >= 0; i-- {      var p uintptr      switch {      case raceenabled:        // The TSAN runtime requires the heap        // to be in the range [0x00c000000000,        // 0x00e000000000).        p = uintptr(i)<<32 | uintptrMask&(0x00c0<<32)        if p >= uintptrMask&0x00e000000000 {          continue        }      case GOARCH == "arm64" && GOOS == "ios":        p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)      case GOARCH == "arm64":        p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)      case GOOS == "aix":        if i == 0 {          // We don't use addresses directly after 0x0A00000000000000          // to avoid collisions with others mmaps done by non-go programs.          continue        }        p = uintptr(i)<<40 | uintptrMask&(0xa0<<52)      default:        p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)      }      hint := (*arenaHint)(mheap_.arenaHintAlloc.alloc())      hint.addr = p      hint.next, mheap_.arenaHints = mheap_.arenaHints, hint    }  }}

复制代码

注释部分第一条便说：从地址空间的中间开始向上增长，很容易获得连续的区域，且不会和其它映射部位发生碰撞。

因此 go 选择了从 0x00c0开始，并且用一个 for 循环生成了 128 个提示地址，组成链表初始化到 mheap_.arenaHints：

0x7fc000000000......0x10c0000000000x0fc0000000000x0ec0000000000x0dc0000000000x0cc0000000000x0bc0000000000x0ac0000000000x09c0000000000x08c0000000000x07c0000000000x06c0000000000x05c0000000000x04c0000000000x03c0000000000x02c0000000000x01c0000000000x00c000000000

复制代码

这 128 个起始地址除了最后一个之外，其余皆可向上增长 1TiB 的空间，最后一个距离用户空间顶部仅剩 256 GiB。

0x00c000000000 距离用户空间的开始有 765 GiB，这也是为什么不会和其它映射部位发生碰撞的原因！

mallocinit 初始化了mheap_.arenaHints，还记得 mheap 为增加 heap 而申请内存时的方法吗？

// Try to grow the heap at a hint address.for h.arenaHints != nil {  hint := h.arenaHints  p := hint.addr  if hint.down {    p -= n  }  if p+n < p {    // We can't use this, so don't ask.    v = nil  } else if arenaIndex(p+n-1) >= 1<<arenaBits {    // Outside addressable heap. Can't use.    v = nil  } else {    v = sysReserve(unsafe.Pointer(p), n)  }  // 如果不相等，则说明 mmap 在建议的地址上没能分配成功  if p == uintptr(v) {    // Success. Update the hint.    if !hint.down {      p += n    }    // 成功后，hint 的地址也跟着更新    hint.addr = p    size = n    break  }  // 此时，丢弃这次分配的内存，尝试下一个 arenaHints, 也就是下一个 1T 区间  // Failed. Discard this hint and try the next.  //  // TODO: This would be cleaner if sysReserve could be  // told to only return the requested address. In  // particular, this is already how Windows behaves, so  // it would simplify things there.  if v != nil {    sysFree(v, n, nil)  }  h.arenaHints = hint.next  h.arenaHintAlloc.free(unsafe.Pointer(hint))}

复制代码

mmap 的调用都是围绕着 arenaHints 来进行的，并且每次申请成功后都会更新 hint 的 addr，这样就实现了连续增长，直到失败。如果失败了，就从下一个 1TiB 的区间再次开始！

4. g0 堆栈

看过了普通 goroutine 堆栈的分配之后，再来简要说一下 g0 的堆栈。g0 是个比较特殊的 goroutine 它只是协助 runtime 来执行，但不承载任何执行函数，与普通的用户 goroutine 有所区别。在一定程度上，可以把它类比成操作系统上每个线程的内核栈，每当 runtime 获得控制权的时候就会将堆栈切换到 g0 代表的堆栈上。

go 的 GPM 模型此处不作介绍，建议阅读Scheduling In Go : Part II - Go Scheduler 来了解并发模型。我们只说其中的 M，每个M 都有一个 g0 堆栈，用于执行 runtime 代码，其中较为特殊的 M0 （即 go 进程的主线程，每个 go 程序仅有一个 M0）的 g0 堆栈是通过汇编语言进行初始化的。

我们先来看看 go 程序的入口地址：

richard@Richard-Manjaro:~ » readelf -h carefree ELF Header:  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00   Class:                             ELF64  Data:                              2's complement, little endian  Version:                           1 (current)  OS/ABI:                            UNIX - System V  ABI Version:                       0  Type:                              EXEC (Executable file)  Machine:                           Advanced Micro Devices X86-64  Version:                           0x1  <span style="color:red">Entry point address:               0x463f20</span>  Start of program headers:          64 (bytes into file)  Start of section headers:          456 (bytes into file)  Flags:                             0x0  Size of this header:               64 (bytes)  Size of program headers:           56 (bytes)  Number of program headers:         7  Size of section headers:           64 (bytes)  Number of section headers:         23  Section header string table index: 3

复制代码

读取 ELF 文件头可知，入口地址为 0x463f20，因为禁用了 cgo，没有动态链接库，所以 Entry point 指示的地址既是程序的入口地址。继续看一下该地址指示的代码：

richard@Richard-Manjaro:~ » lldb ./carefree (lldb) target create "./carefree"Current executable set to '/home/richard/carefree' (x86_64).(lldb) image lookup --address 0x463f20      Address: carefree[0x0000000000463f20] (carefree.PT_LOAD[0]..text + 405280)      Summary: carefree`_rt0_amd64_linux`(lldb)

复制代码

_rt0_amd64_linux 即为程序的入口，当运行程序时，shell 会 fork 一个子进程出来，之后执行 execve() 系统调用来装载 go 的可执行文件，当内核装载完毕之后，会将 CPU 的程序计数器设置为此入口点，之后 go 程序开始执行。

_rt0_amd64_linux 是对 asm_amd64.s 中 runtime·rt0_go 的调用，看一下runtime·rt0_go 的内容：

TEXT runtime·rt0_go(SB),NOSPLIT|TOPFRAME,$0  // copy arguments forward on an even stack  MOVQ  DI, AX    // argc  MOVQ  SI, BX    // argv  SUBQ  $(5*8), SP    // 3args 2auto  ANDQ  $~15, SP  MOVQ  AX, 24(SP)  MOVQ  BX, 32(SP)
  // create istack out of the given (operating system) stack.  // _cgo_init may update stackguard.  // 初始化 g0  MOVQ  $runtime·g0(SB), DI  LEAQ  (-64*1024+104)(SP), BX  MOVQ  BX, g_stackguard0(DI)  MOVQ  BX, g_stackguard1(DI)  MOVQ  BX, (g_stack+stack_lo)(DI)  MOVQ  SP, (g_stack+stack_hi)(DI)

复制代码

这段代码设置 g0 堆栈的方式是使用线程堆栈的栈顶指针减少 64KB + 104B 作为 g0 堆栈的低端，当前线程堆栈的栈顶为 g0 堆栈的高端。执行完成后，g0 的堆栈便被初始化为 64KB 了。令人惊讶的是，这居然是在系统线程的 8M 堆栈（Linux 的默认线程堆栈为 8 M）中分配的。

再来看一下其它新建 M 的 g0，go 通过 runtime.newm 来新建操作系统线程，顺藤摸瓜会发现其最终执行的系统调用为 clone:

func newosproc(mp *m) {  stk := unsafe.Pointer(mp.g0.stack.hi)  /*   * note: strace gets confused if we use CLONE_PTRACE here.   */  if false {    print("newosproc stk=", stk, " m=", mp, " g=", mp.g0, " clone=", abi.FuncPCABI0(clone), " id=", mp.id, " ostk=", &mp, "\n")mp.g0.stack.hi  }
  // Disable signals during clone, so that the new thread starts  // with signals disabled. It will enable them in minit.  var oset sigset  sigprocmask(_SIG_SETMASK, &sigset_all, &oset)  ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(abi.FuncPCABI0(mstart)))  sigprocmask(_SIG_SETMASK, &oset, nil)
  if ret < 0 {    print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")    if ret == -_EAGAIN {      println("runtime: may need to increase max user processes (ulimit -u)")    }    throw("newosproc")  }}

复制代码

clone 中堆栈起始地址传入的是 mp.g0.stack.hi，即该 M 的 g0 的堆栈高端地址，看一下 g0 的初始化，相应的代码在 runtime.allocm 中：

if iscgo || mStackIsSystemAllocated() {  mp.g0 = malg(-1)} else {  mp.g0 = malg(8192 * sys.StackGuardMultiplier)}

复制代码

可见后续 g0 分配就是通过 malg 来进行的，该函数我们之前已经介绍过了，此处只要明白分配的堆栈大小为 8K 即可。由此可知，除了 m0 的 g0 在传统的主线程堆栈区域外，后续 M 的堆栈都是分配自 go 堆中，其可能的区域自不待言，我们已在上一节论述过了。

5. goroutine 的堆栈切换

当 goroutine 被 runtime 调度到 CPU 上时，不仅要将程序计数器设置为该 goroutine 的执行函数地址，而且要切换到该 goroutine 的堆栈上执行后续操作，我们这一节就来看看 goroutine 的堆栈是如何切换的。堆栈的切换和调度密切相关，但此处只讨论和堆栈有关的内容，不再深入调度相关的细节。

m0 在初始化好一系列条件之后，会调用 runtime·mstart 从而真正的让 M0 跑起来，后续新建 M 时向 clone 传入的运行函数也是 runtime·mstart，而 runtime·mstart 最终会进入调度函数 runtime.schedule，而 schedule 的工作就是千方百计的寻找空闲的 G 将它送到 CPU 上运行。当最终找到这个 G 的时候，会调用一段用汇编代码写成的函数 runtime·gogo(buf *gobuf)：

// func gogo(buf *gobuf)// restore state from Gobuf; longjmpTEXT runtime·gogo(SB), NOSPLIT, $0-8  MOVQ  buf+0(FP), BX    // gobuf  MOVQ  gobuf_g(BX), DX  MOVQ  0(DX), CX    // make sure g != nil  JMP  gogo<>(SB)
TEXT gogo<>(SB), NOSPLIT, $0  get_tls(CX)  MOVQ  DX, g(CX)  MOVQ  DX, R14    // set the g register  MOVQ  gobuf_sp(BX), SP  // restore SP  MOVQ  gobuf_ret(BX), AX  MOVQ  gobuf_ctxt(BX), DX  MOVQ  gobuf_bp(BX), BP  MOVQ  $0, gobuf_sp(BX)  // clear to help garbage collector  MOVQ  $0, gobuf_ret(BX)  MOVQ  $0, gobuf_ctxt(BX)  MOVQ  $0, gobuf_bp(BX)  MOVQ  gobuf_pc(BX), BX  JMP  BX

复制代码

runtime·gogo 会调用 gogo，传入的参数是 g 结构体中和调度相关的一个字段 gobuf:

type gobuf struct {  sp   uintptr  pc   uintptr  g    guintptr  ctxt unsafe.Pointer  ret  uintptr  lr   uintptr  bp   uintptr // for framepointer-enabled architectures}

复制代码

其中有程序计数器和堆栈栈顶指针等重要的值，这些值都是该 goroutine 被调度出 CPU 的时候保存进来的，是 goroutine 的执行现场。gogo 会将现场恢复，这包括程序计数器和栈顶，之后这个 goroutine 就又从上次中断的地方跑起来了。

6. 总结

本文以探求 goroutine 堆栈在进程虚拟地址空间中的位置为诉求，对源代码进行有目的的展开，并最终找到内存分配的内核接口 mmap。

mmap 的使用太过灵活，以至于非要刻板的对应到虚拟内存布局中的位置显得有些棘手，因为 go 堆 接管的是整个虚拟内存的用户空间，但我们仍然可以从其内存分配的设计思想中窥得一二。

go 堆的起始位置在用户空间的中段，确切的说是距离起始端 768 GiB 的地方开始，而从用户空间 128 TiB 的角度来看，这远远算不上中间，仅仅是相对于传统 heap 来说的。我想这也是 go 对于历史的一种尊重，好在 64 位模式下虚拟地址空间的跨度足够大，可以做出很灵活的设计。

go 堆把后续的空间划分成了 128 份，几乎每份都有 1TiB 的大小，然后默默地从地址 0x00c000000000 处向上增长，因为00 c0 既不是有效的 UTF8 编码，又有足够的辨识度。

参考文献

发布于: 2022-05-05阅读数: 1276

原文链接:【http://xie.infoq.cn/article/eea2bb14aa566043556cd328b】。文章转载请联系作者。

黑客不够黑

关注

感而后应,迫而后动,不得已而后起 2018-11-20 加入

非著名程序员，任职过测试，前端，devops，DBA、Go 后端开发等等个人网站： https://liupzmin.com 联系方式： liupzmin@gmail.com

评论 (1 条评论)

发布

Dow

2022-05-13 10:56

 0 回复

没有更多了

创作场景

Stack 顿悟三部曲（3）：溯源 goroutine 堆栈

1· 进线程堆栈

2. goroutine 的堆栈

3. mmap 申请内存的位置

4. g0 堆栈

5. goroutine 的堆栈切换

6. 总结

黑客不够黑

评论 (1 条评论)