[翻译]Scalable Go Scheduler Design Doc[Go 可扩展调度设计文档]
Scalable Go Scheduler Design Doc[可扩展的Go调度器设计文档]
Dmitry Vyukov
May 2, 2012 [2012年5月2日]
The document assumes some prior knowledge of the Go language and current goroutine scheduler implementation.
本文假设您对Go语言以及目前的Go调度实现有一定了解。
【译者注:这里所说的目前就是2012年5月写这篇设计文档的时候,那时的Go调度还是旧的第一代调度器,所以作者设计了这个新的调度器,然后通过该文档为大家阐述新的调度器模型及设计原理】
Problems with current scheduler
Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.
What's wrong with current implementation:
目前的的调度器存在的问题(译者注:指2012年5月这个时间点go的最新版本中的第一代调度器)
目前的Go调度器限制了(用Go语言编写的)并发程序(尤其高吞吐量的server和并行计算程序)的可伸缩性。
Vtocc服务器在8核处理器上的CPU占用率最高为70%,而profile显示14%的资源用在runtime.futex()上。通常,在性能至关重要的情况下,调度程序可能会禁止用户使用惯用的细粒度并发。
那当前(Go调度器)的实现有什么问题呢?
1. Single global mutex (Sched.Lock) and centralized state. The mutex protects all goroutine-related operations (creation, completion, rescheduling, etc).
问题1.单个的全局锁和集中状态。
互斥锁保护所有与goroutine相关的操作(比如创建,完成,重新调度等)。
2. Goroutine (G) hand-off (G.nextg). Worker threads (M's) frequently hand-off runnable goroutines between each other, this may lead to increased latencies and additional overheads. Every M must be able to execute any runnable G, in particular the M that just created the G.
问题2.Goroutine(G)交接(G.nextg)。
线程(M)经常在彼此之间移交可运行的goroutine,这可能会导致延迟增加和开销增加。每个M必须能够执行任何可运行的G,尤其是刚创建G的M。
3. Per-M memory cache (M.mcache). Memory cache and other caches (stack alloc) are associated with all M's, while they need to be associated only with M's running Go code (an M blocked inside of syscall does not need mcache). A ratio between M's running Go code and all M's can be as high as 1:100. This leads to excessive resource consumption (each MCache can suck up up to 2M) and poor data locality.
问题3.每个M的内存缓存
内存缓存和其他缓存(堆栈分配的)跟所有的M有关,而其实并不需要这样。内存缓存其实仅需要跟正在运行Go 代码的M关联(而无需跟阻塞在系统调用中的M关联)。正在运行Go代码的M(译者注:通俗讲就是正在干活的M)和所有M的比例高达1:100,这样算下来,很明显,导致了过多无用的资源消耗(每个Mcache最多可以吃2M的内存)以及非常不好的作用域局部性。
4. Aggressive thread blocking/unblocking. In presence of syscalls worker threads are frequently blocked and unblocked. This adds a lot of overhead.
问题4.频繁的线程阻塞和解除阻塞
系统调用时,线程M经常被阻塞或解除阻塞,从而增加了很多的开销。
Design[设计]
Processors
The general idea is to introduce a notion of P (Processors) into runtime and implement
work-stealing scheduler on top of Processors.
总体的设计思路是将P的概念引入runtime,并在P之上实现“可窃取调度”。
M represents OS thread (as it is now). P represents a resource that is required to execute Go code. When M executes Go code, it has an associated P. When M is idle or in syscall, it does need P.
There is exactly GOMAXPROCS P’s. All P’s are organized into an array, that is a requirement of work-stealing. GOMAXPROCS change involves stop/start the world to resize array of P’s.
Some variables from sched are de-centralized and moved to P. Some variables from M are moved to P (the ones that relate to active execution of Go code).
M表示操作系统线程,P表示执行Go代码所需的资源。当M执行Go代码时,会先关联P。当M空闲或者处在系统调用时,就需要P。P的数量有GOMAXPROCS 个,P被设计组装在一个数组中,这样设计也是考虑到“窃取策略”的需要。GOMAXPROCS 改变会设计停止或启动所有态(STW不需要过多解释,专业的人都能懂)以调整P数组的大小。sched中的某些变量被分散并移动到P。M中的某些变量被移动到P(与有效执行Go代码相关的一些变量)。
【译者注,P结构的设计草稿举例如下:】
There is also a lock-free list of idle P’s:
还有一个空闲P的无锁列表。
When an M is willing to start executing Go code, it must pop a P form the list. When an M ends executing Go code, it pushes the P to the list. So, when M executes Go code, it necessary has an associated P. This mechanism replaces sched.atomic (mcpu/mcpumax).
当M想去执行Go代码时,必须先从上述的P列表中取出一个P。当M结束执行Go代码时,它会将P放回到列表中。因此,当M执行Go代码时,它必然先关联P。用这种机制代替来sched.atomic 策略。
Scheduling
When a new G is created or an existing G becomes runnable, it is pushed onto a list of runnable goroutines of current P. When P finishes executing G, it first tries to pop a G from own list of runnable goroutines; if the list is empty, P chooses a random victim (another P) and tries to steal a half of runnable goroutines from it.
当一个新创建的G或者一个已存在的G切换到可执行时,就会被推到当前P的可运行队列中。
Syscalls/M Parking and Unparking
When an M creates a new G, it must ensure that there is another M to execute the G (if not all M’s are already busy). Similarly, when an M enters syscall, it must ensure that there is another M to execute Go code.
一个M创建新的G时,必须确保还有另外M来执行G(如果不是所有M都已经在忙)。 同样,当M进入系统调用时,它必须确保还有另外M执行Go代码。
There are two options, we can either promptly block and unblock M’s, or employ some spinning. Here is inherent conflict between performance and burning unnecessary CPU cycles. The idea is to use spinning and do burn CPU cycles. However, it should not affect programs running with GOMAXPROCS=1 (command line utilities, appengine, etc).
有2种选择,我们可以立即阻塞M或取消阻塞M,也可以进行一些自旋。
这是性能和消耗不必要的CPU周期直接的固有冲突。这里有个思路是使用自旋并消耗CPU周期。 但是,它不会影响以GOMAXPROCS = 1(命令行实用程序,appengine等)运行的程序。
Spinning is two-level: (1) an idle M with an associated P spins looking for new G’s, (2) an M w/o an associated P spins waiting for available P’s. There are at most GOMAXPROCS spinning M’s (both (1) and (2)). Idle M’s of type (1) do not block while there are idle M’s of type (2).
自旋分为两个级别:(1)空闲的M会关联P自旋寻找新的G,(2)M w / o关联的P自旋等待可用的P。 最多有GOMAXPROCS个自旋的M((1)和(2))。 当存在类型(2)的空闲M时,类型(1)的空闲M不会被阻塞。
【译者注,这里理解自旋的概念非常关键!!!】
When a new G is spawned, or M enters syscall, or M transitions from idle to busy, it ensures that there is at least 1 spinning M (or all P’s are busy). This ensures that there are no runnable G’s that can be otherwise running; and avoids excessive M blocking/unblocking at the same time.
当产生一个新的G或者M进入系统调用时,或者M从空闲态转化为忙碌态时,它确保至少有1个自旋的M(或所有P都处于繁忙状态)。 这样可以确保没有其他可以运行的可运行G。 并避免同时进行过多的M阻塞/解除阻塞。
Spinning is mostly passive (yield to OS, sched_yield()), but may include a little bit of active spinning (loop burnging CPU) (requires investigation and tuning).
自旋通常是被动的(对OS的收益为sched_yield()),但可能包括一点主动的自旋(循环刻录CPU)(需要调查和调整)。
Termination/Deadlock Detection
Termination/deadlock detection is more problematic in a distributed system. The general idea is to do the checks only when all P’s are idle (global atomic counter of idle P’s), this allows to do more expensive checks that involve aggregation of per-P state.
No details yet.
LockOSThread
This functionality is not performance-critical.
1. Locked G become non-runnable (Gwaiting). M instantly returns P to idle list, wakes up another M and blocks.
2. Locked G becomes runnable (and reaches head of the runq). Current M hands off own P and locked G to the M associated with the locked G, and unblocks it. Current M becomes idle.
Idle G
This functionality is not performance-critical.
There is a global queue of (or a single?) idle G. An M that looks for work checks the queue after several unsuccessful steal attempts.
Implementation Plan
The goal is to split the whole thing into minimal parts that can be independently reviewed and submitted.
1. Introduce the P struct (empty for now); implement allp/idlep containers (idlep is mutex-protected for starters); associate a P with M running Go code. Global mutex and atomic state is still preserved.
2. Move G freelist to P.
3. Move mcache to P.
4. Move stackalloc to P.
5. Move ncgocall/gcstats to P.
6. Decentralize run queue, implement work-stealing. Eliminate G hand off. Still under global mutex.
7. Remove global mutex, implement distributed termination detection, LockOSThread.
8. Implement spinning instead of prompt blocking/unblocking.
The plan may turn out to not work, there are a lot of unexplored details.
Potential Further Improvements
1. Try out LIFO scheduling, this will improve locality. However, it still must provide some degree of fairness and gracefully handle yielding goroutines.
2. Do not allocate G and stack until the goroutine first runs. For a newly created goroutine we need just callerpc, fn, narg, nret and args, that is, about 6 words. This will allow to create a lot of running-to-completion goroutines with significantly lower memory overhead.
4. Better locality of G-to-P. Try to enqueue an unblocked G to a P on which it was last running.
5. Better locality of P-to-M. Try to execute P on the same M it was last running.
6. Throttling of M creation. The scheduler can be easily forced to create thousands of M's per second until OS refuses to create more threads. M’s must be created promptly up to k*GOMAXPROCS, after that new M’s may added by a timer.
Random Notes
- GOMAXPROCS won’t go away as a result of this work.
【译者补充】附录
https://yizhi.ren/2019/06/03/goscheduler/
版权声明: 本文为 InfoQ 作者【卓丁】的原创文章。
原文链接:【http://xie.infoq.cn/article/a6948402be688dba530094e9b】。文章转载请联系作者。
评论