[翻译]The Go scheduler[Go 调度]

关注
发布于: 2020 年 09 月 08 日
原文链接
Introduction[介绍]One of the big features for Go 1.1 is the new scheduler, contributed by Dmitry Vyukov. The new scheduler has given a dramatic increase in performance for parallel Go programs and with nothing better to do, I figured I'd write something about it.
﻿
Go1.1版本最大的特征之一是包含了由Dmitry Vyukov贡献的新调度器。新的调度器已极大地提高了并行Go程序的性能，我没有必要再多做什么，仅仅想写点与它有关的文章。
﻿
Most of what's written in this blog post is already described in the original design doc. It's a fairly comprehensive document, but pretty technical.
﻿
本文的大部分内容，已经在原始设计文档original design doc中有所描述。该设计文档内容相当全面，但较偏技术性。
【译者注，这里作者讲了一个大实话，原始设计文档读起来确实很偏技术性，不行您也可以试试，对比一下。】
All you need to know about the new scheduler is in that design document but this post has pictures, so it's clearly superior.
﻿
您完全可以通过读设计文档以了解关于新调度器的几乎所有信息，但是本文的亮点是还放了一些图片，显然更棒。
﻿
What does the Go runtime need with a scheduler?[Go运行时需要一个什么样的调度]But before we look at the new scheduler, we need to understand why it's needed. Why create a userspace scheduler when the operating system can schedule threads for you?
﻿
在了解新调度器之前，我们熟悉要搞明白：为什么需要它。操作系统已经可以帮我们调度线程了，那我们为什么还要创建一个用户态层面的调度器呢？ 
﻿
The POSIX thread API is very much a logical extension to the existing Unix process model and as such, threads get a lot of the same controls as processes. Threads have their own signal mask, can be assigned CPU affinity, can be put into cgroups and can be queried for which resources they use. All these controls add overhead for features that are simply not needed for how Go programs use goroutines and they quickly add up when you have 100,000 threads in your program.
﻿
POSIX标准的线程（Thread）API在很大程度上是对现有Unix进程模型的逻辑扩展，因此，线程拥有许多与进程相同的控制策略。线程（Threads）拥有自己的信号掩码，可被分配并增强其对CPU的亲和力，可以放进cgroups中，并且可以查询到其使用的资源。诸如此类的控制项都将增加额外的开销，但是相关特性在Go程序使用goroutines是根本不需要的，如果你的程序中有100,000个线程时，这些开销就会迅速上升。
﻿
Another problem is that the OS can't make informed scheduling decisions, based on the Go model. For example, the Go garbage collector requires that all threads are stopped when running a collection and that memory must be in a consistent state. This involves waiting for running threads to reach a point where we know that the memory is consistent.
﻿
另一个问题是，操作系统无法基于Go模型做出明智的调度决策。 比如，Go垃圾收集器要求在运行收集时停止所有线程（STW），并且必须让内存处于一致状态。 这涉及到等待正在运行的线程使其到达内存一致的地步，并且这种一致性的状态能让我们知晓。
﻿
【译者注，简单概括就是：一个方面，操作系统提供的线程控制策略，对于Go goroutine来讲有点多余，根本用不到，另一个方面，相反，Go模型进行垃圾回收时所需的一些调度策略，操作系统层面又无法切实的提供。】
﻿
When you have many threads scheduled out at random points, chances are that you're going to have to wait for a lot of them to reach a consistent state. The Go scheduler can make the decision of only scheduling at points where it knows that memory is consistent. This means that when we stop for garbage collection, we only have to wait for the threads that are being actively run on a CPU core.
﻿
此段待定翻译~ ...... 
Our Cast of CharactersThere are 3 usual models for threading. One is N:1 where several userspace threads are run on one OS thread. This has the advantage of being very quick to context switch but cannot take advantage of multi-core systems. Another is 1:1 where one thread of execution matches one OS thread. It takes advantage of all of the cores on the machine, but context switching is slow because it has to trap through the OS.
﻿
通常有3种线程模型。一种是N:1，即多个用户态线程在一个内核态线程上运行。优势是可以非常快捷的进行上学文切换但却又无法利用多核的优势。另外一种模型是1:1，其中一个用户态线程匹配一个内核态线程，这样充分利用了多核的优势，但是进行上下文切换时又变得慢了，因为需要捕获整个OS。
﻿
Go tries to get the best of both worlds by using a M:N scheduler. It schedules an arbitrary number of goroutines onto an arbitrary number of OS threads. You get quick context switches and you take advantage of all the cores in your system. The main disadvantage of this approach is the complexity it adds to the scheduler.
﻿
Golang 尝试通过M:N的模型来兼顾两全其美。它在任意数量的内核态线程上调度任意数量的goroutines。这样以来，既可以快速的切换上下文，又能充分利用多核的优势。不过，这种方法的主要缺点是它增加来调度策略的复杂性；
﻿
To acomplish the task of scheduling, the Go Scheduler uses 3 main entities:
﻿
为实现调度策略，在Go中主要抽象了下面3种实体：
﻿
The triangle represents an OS thread. It's the thread of execution managed by the OS and works pretty much like your standard POSIX thread. In the runtime code, it's called M for machine.
The circle represents a goroutine. It includes the stack, the instruction pointer and other information important for scheduling goroutines, like any channel it might be blocked on. In the runtime code, it's called a G.
﻿
三角形（M）代表的是操作系统线程，它是由操作系统管理的执行线程，运行原理跟标准 POSIX thread非常相似。在运行时的代码种，被简称为M（对应machine实体）. 圆（G）代表goroutine，它包含了堆栈，指令指针以及调度goroutines所需的其他重要信息，比如可以使其阻塞的任何channel等. 
﻿
The rectangle represents a context for scheduling. You can look at it as a localized version of the scheduler which runs Go code on a single thread. It's the important part that lets us go from a N:1 scheduler to a M:N scheduler. In the runtime code, it's called P for processor. More on this part in a bit.
﻿
矩形代表调度的上下文，您可以将其看作是调度程序的本地化版本，整个调度程序可以在单个但内核线程上运行Go代码。它是我们将N:1模型过渡到M:N模型的重要组成部分，在运行时代码中，它被简称为P(对应processor实体)
﻿
Here we see 2 threads (M), each holding a context (P), each running a goroutine (G). In order to run goroutines, a thread must hold a context.
﻿
在这里，我们看到2个线程（M），每个线程都有一个上下文（P），每个线程都运行一个goroutine（G）。 为了运行goroutine，线程必须拥有一个上下文。
﻿
The number of contexts is set on startup to the value of the GOMAXPROCS environment variable or through the runtime function GOMAXPROCS(). Normally this doesn't change during execution of your program. The fact that the number of contexts is fixed means that only GOMAXPROCS are running Go code at any point. We can use that to tune the invocation of the Go process to the individual computer, such at a 4 core PC is running Go code on 4 threads.
﻿
上线文P的个数可以通过调用运行时的函数GOMAXPROCS来设定，如果未设置，在程序启动时，默认会取环境遍历GOMAXPROCS的值。并且在整个进程运行期间一般都不会改变。上下文都数量是固定的，通常意味者，在任何时候，仅仅有GOMAXPROCS个P在运行Go代码。我们可以使用它来调整Go进程对单个计算机资源的调用，例如在4核PC上的4个线程上运行Go代码。
【译者注：掌握该处所描述的原理至关重要，其间接回答了诸如“一个Go进程起多少个goroutine合适？”之类的问题；额外，P处在M和G中间，不但起承上启下的桥梁左右，还可将M和G灵活解耦，下面会讲到！】
﻿
The greyed out goroutines are not running, but ready to be scheduled. They're arranged in lists called runqueues. Goroutines are added to the end of a runqueue whenever a goroutine executes a go statement. Once a context has run a goroutine until a scheduling point, it pops a goroutine off its runqueue, sets stack and instruction pointer and begins running the goroutine.
﻿
上图中用灰色标注的goroutinues暂时未运行，处于等待被调度的状态。 它们被排列在运行时队列中。当在Go代码中通过go语句启动一个goroutine时，此时该goroutine就被添加到运行时队列的末尾。一旦上下文P在运行一个goroutine达到一个调度点后，它会在运行时队列中弹出一个goroutine，为其设置堆栈和指令指针，然后开始运行该goroutine。
﻿
To bring down mutex contention, each context has its own local runqueue. A previous version of the Go scheduler only had a global runqueue with a mutex protecting it. Threads were often blocked waiting for the mutex to unlocked. This got really bad when you had 32 core machines that you wanted to squeeze as much performance out of as possible.
﻿
为了减少互斥锁争用，每个上下文P都有其自己的本地运行时队列。Go调度的早期版本仅有一个全局的队列（而不是每个P单独再拥有一个自己的本地运行时队列），并且用锁机制来保护它。线程经常会因为等待互斥锁的释放而被阻塞。尤其是当你有32核的计算机硬件资源并想充分利用它的性能时，情况就会变的很糟糕。
【译者注：这里顺便回答了为什么每个上线文P需要有自己的本地队列？】
﻿
The scheduler keeps on scheduling in this steady state as long as all contexts have goroutines to run. However, there are a couple of scenarios that can change that.
﻿
只要所有上下文P都具有待运行的goroutine，调度器便会在此稳态下继续调度。 但是，有两种场景可以改变这种情况。
﻿
Who you gonna (sys)call?You might wonder now, why have contexts at all? Can't we just put the runqueues on the threads and get rid of contexts? Not really. The reason we have contexts is so that we can hand them off to other threads if the running thread needs to block for some reason.
﻿
此刻您可能想知道，为什么要有上下文？ 我们不能只将运行时队列直接放在线程上并摆脱上下文P吗？ 事实上并不是这样的。 我们拥有上下文P的原因是，如果正在运行的线程M出于某种原因需要阻塞，我们可以将它们移交给其他线程M。
﻿
An example of when we need to block, is when we call into a syscall. Since a thread cannot both be executing code and be blocked on a syscall, we need to hand off the context so it can keep scheduling.
﻿
什么时候M可能需要阻塞呢？当操作系统需要执行系统调用时M就会阻塞；因为线程M既不能执行代码也无法阻塞在系统调用上，这个时候，我们就需要将移交上下文P，以便P能进行正常调度。
【译者注：真正执行代码的是M，真正被执行的代码是G，P只不过就是一个搬运工，负责从运行时队列中搬运待执行的G，然后“喂”给M让其执行。】
﻿
Here we see a thread giving up its context so that another thread can run it. The scheduler makes sure there are enough threads to run all contexts. M1 in the illustration above might be created just for the purpose of handling this syscall or it could come from a thread cache. The syscalling thread will hold on to the goroutine that made the syscall since it's technically still executing, albeit blocked in the OS.
﻿
在这里，我们看到一个线程M0放弃了其上下文P，以便另一个线程M1可以运行它。 调度程序确保有足够的线程M来运行所有上下文P。 上图中的M1可能只是为了处理此系统调用而创建的，也可能来自线程缓存。 此段待补充完善～
﻿
When the syscall returns, the thread must try and get a context in order to run the returning goroutine. The normal mode of operation is to steal a context from one of the other threads. If it can't steal one, it will put the goroutine on a global runqueue, put itself on the thread cache and go to sleep.
﻿
当完成一次系统调用并返回时，线程M必须尝试获取上下文P才能运行返回的goroutine。 正常的操作模式是从其他的任何一个线程M上窃取一个上下文P。 如果无法窃取，它将把goroutine放在全局待运行队列中，将自身放在线程缓存中并进入睡眠状态。
﻿
【译者注：系统调用相当于一次特殊任务，特殊任务执行完毕后，M还需要返回调度现场继续待命。回来后它的策略是尝试性的从其他线程上"窃取"一个P，与其说“窃取”，倒不如说是帮忙分担。如果发现没什么任务可分担的（即所谓的“窃取”失败），这时该线程M就会暂时休息♨️，小睡一会～】
﻿
The global runqueue is a runqueue that contexts pull from when they run out of their local runqueue. Contexts also periodically check the global runqueue for goroutines. Otherwise the goroutines on global runqueue could end up never running because of starvation.
﻿
当上下文P本地队列中已没有待执行任务G时，P就会从全局队列中提取一个G，P还会定期的检查全局队列中的goroutines。否则，由于饥饿，可能导致全局运行时队列中的goroutines可能永远无法被执行。
【译者注：这里其实回答了为什么需要global runqueue】
﻿
This handling of syscalls is why Go programs run with multiple threads, even when GOMAXPROCS is 1. The runtime uses goroutines that call syscalls, leaving threads behind.
﻿
此段待定翻译～ 
﻿
Stealing workAnother way that the steady state of the system can change is when a context runs out of goroutines to schedule to. This can happen if the amount of work on the contexts' runqueues is unbalanced. This can cause a context to end up exhausting it's runqueue while there is still work to be done in the system. To keep running Go code, a context can take goroutines out of the global runqueue but if there are no goroutines in it, it'll have to get them from somewhere else.
﻿
另外一种可以更改系统稳定性状态的方法是当上下文p用尽了可调度的goroutine时。如果上下文运行队列上的工作量不饱和，就会发生这种情况。当系统中仍有任务需要被处理时，这可能会导致上下文最终会耗尽它的运行队列。为了继续运行Go 代码，上下文P可以从全局队列中拿一下goroutines过来。但是如果连全局队列中也没有goroutines的话，那么就必须得从其他位置获取了。
﻿
【译者注：所以运行时调度策略中存在两个不同层面的“窃取”，一个是M可能会窃取P，另一个是P可能会窃取G】
﻿
That somewhere is the other contexts. When a context runs out, it will try to steal about half of the runqueue from another context. This makes sure there is always work to do on each of the contexts, which in turn makes sure that all threads are working at their maximum capacity.
﻿
这里所述的其他位置是指其他的上下文P那里。当某个P自己队列中的goroutine被耗尽时，它会尝试从其他的上线文P的运行队列中“窃取”一半的goroutines。这样可以确保每个上下文P都在不停的工作，进而确保所有的线程M都在满负荷运作；
【译者注：理解MPG的关键是，P作为一个中间桥梁，不断的搬运货物G（并发任务work），然后运给真正干活的线程M；】
﻿
Where to go?There are many more details to the scheduler, like cgo threads, the LockOSThread() function and integration with the network poller. These are outside the scope of this post, but still merit study. I might write about these later. There are certainly plenty of interesting constructions to be found in the Go runtime library.
﻿
调度程序还有更多详细信息，例如cgo线程，LockOSThread（）函数以及与网络轮询器的集成。 这些超出了本文的范围，但仍然值得研究。 我可能稍后再写。 Go runtime 库中肯定有很多有趣的构造。
﻿
By Daniel Morsing
Related articles[相关文章]Using Causal Profiling to Optimize the Go HTTP/2 Server[]
A Causal Profiling update[]
Causal Profiling for Go[]
Effective error handling in Go.[]
The Go netpoller[]
﻿