Linux 中父进程为何要苦苦地知道子进程的死亡原因？

关注
发布于: 2020 年 12 月 07 日
白发人送黑发人
﻿
一个普遍的常识是，在Linux里面总是“白发人送黑发人”，子进程死亡，父进程透过wait()等待子进程死亡，并清理子进程僵尸，当然父进程也可以因此而获得子进程的死亡原因。
﻿
子曾经曰过:“Talk is cheap. Show me the code”，我们来看看实际的代码：
﻿
﻿
上述代码中，子进程在第18行通过pause()等待信号，父进程在代码的第22行通过waitpid()等待子进程的结束。其中的参数status是一个输出参数，可以获得子进程死亡的原因。
﻿
比如我们现在把上述程序运行起来：
﻿
./a.out
﻿
child process id: 3320
﻿
然后用信号2去杀死这个子进程3320：
﻿
kill -2 3320
﻿
父进程waitpid()返回，然后status里面获知原因，父进程打印：
﻿
child process is killed by signal 2
﻿
如果我们把子进程中的pause()删除，改为直接退出_exit(1)：
﻿
﻿
则父进程探测到子进程死亡后，可打印它的退出状态：
﻿
$ ./a.out
﻿
child process id: 3362
﻿
child process exits, status=1
﻿
由此可以看出，父进程对子进程的死亡和死亡原因是了如指掌。
﻿
﻿
这一点从内核的源代码里面也可以看出来：
﻿
﻿
在wait_task_zombie()中，父进程会透过子进程的僵尸分析获得子进程的exit_code组合，并进一步拼装status。
﻿
要C/C++ Linux服务器架构师学习资料加群812855908（资料包括C/C++，Linux，golang技术，Nginx，ZeroMQ，MySQL，Redis，fastdfs，MongoDB，ZK，流媒体，CDN，P2P，K8S，Docker，TCP/IP，协程，DPDK，ffmpeg等），免费分享
﻿
﻿
事出必有因
﻿
那么，父进程为什么必须知道子进程的死亡呢？父进程为什么一定要苦苦地知道子进程的死亡原因？
﻿
前一个问题很好回答，如果我们用init进程启动了一个httpd的服务供客户访问我们的网站，然后httpd进程半夜挂了。首先，作为公司的网管，他无法知道httpd死了；其次，他如果知道httpd死了，他也不可能半夜开车去把httpd命令重新输入一遍。所以，这个过程应该由Linux的某种机制自动完成，比如如果init知道了httpd死亡的话，它可以内在地自动重新启动一个httpd进程。
﻿
后一个问题稍微有点复杂，我们要结合一个实际的init项目的例子来解答。这里我们以systemd为例。systemd是目前主流Linux发型版采用的init项目，比如我的Ubuntu 18.10就是：
﻿
$ ls -l /sbin/init
﻿
... /sbin/init -> /lib/systemd/systemd
﻿
/sbin/init是systemd的一个符号链接。
﻿
我们在systemd里面，如果要添加一个开机就启动的后台服务，可以在/lib/systemd/system/目录增加一个service文件。比如，这里我增加了一个非常简单地service文件：
﻿
/lib/systemd/system/simple-server.service
﻿
它的内容如下：
﻿
﻿
simple-server是我写的一个极限简单的打印hello world的服务:
﻿
﻿
我们在Ubuntu中使能这个服务：
﻿
$ sudo systemctl enable simple-server
﻿
Created symlink
﻿
/etc/systemd/system/multi-user.target.wants/simple-server.service
﻿
→ /lib/systemd/system/simple-server.service.
﻿
当场开始这个服务：
﻿
$ sudo systemctl start simple-server
﻿
接下来我们查询下状态，发现是active的：
﻿
﻿
这个时候我们在系统里面是可以看到simple-server这个进程的，它是顶层systemd这个init进程(PID为1)的子进程：
﻿
﻿
这个时候，我们把simple-server这个进程杀掉：
﻿
$ sudo killall simple-server
﻿
再次查看状态：
﻿
﻿
这个时候，我们看到systemd已经检测出来simple-server对应的进程已经被TERM信号kill，服务的状态是inactive。
﻿
我们发现simple-server这个进程也不复存在：
﻿
﻿
pidof什么都没有！！！
﻿
pidof什么都没有！！！
﻿
pidof什么都没有！！！
﻿
你刚才不是说init检测到service死了后，“可以”自动重启服务吗？比如init重新启动httpd？那么，现在我杀死了simple-server，为什么systemd没有自动重新启动它呢？
﻿
注意我说的是“可以”，不是说“必须”。
﻿
因地制宜
﻿
实际上，在systemd里面，一个服务死亡后，要不要重新启动，什么情况下要重新启动，都是可以由用户来定制的。
﻿
我们可以在.service文件的[Service]里面的Restart字段写明什么情况下，我们应该重新启动死亡的子进程。比如，我们可以在.service文件中，增加一行：
﻿
﻿
第6行的Restart=always，实际含义是，无论simple-server因为什么原因死掉，都无条件重新启动它。
﻿
systemd的文档里面有一张表，
﻿
﻿
详细解释了Restart设置为no、always、on-success、on-failure等各种情况下，systemd是否要重新启动这个service。所以systemd实际上区分了5种不同的原因，可进一步阅读：
﻿
﻿
服务是否重新启动
﻿
If set to no (the default), the service will not be restarted. If set to on-success, it will be restarted only when the service process exits cleanly. In this context, a clean exit means an exit code of 0, or one of the signals SIGHUP, SIGINT, SIGTERM or SIGPIPE, and additionally, exit statuses and signals specified in SuccessExitStatus=. If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but excluding the aforementioned four signals), when an operation (such as service reload) times out, and when the configured watchdog timeout is triggered. If set to on-abnormal, the service will be restarted when the process is terminated by a signal (including on core dump, excluding the aforementioned four signals), when an operation times out, or when the watchdog timeout is triggered. If set to on-abort, the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status. If set to on-watchdog, the service will be restarted only if the watchdog timeout for the service expires. If set to always, the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout.
﻿
systemd作为一个父进程，完全可以根据子进程的死亡原因，决定进一步的对策。比如，我们设置为on-failure,含义就是进程不正常死亡（比如exit code不是0、被会引起coredump的信号比如segment fault而死）的情况下，我们才重新启动它。
﻿
这个完全可以根据真实service的特点而量身定制。比如，对于oneshot的服务（就是开机只需要运行一次的服务，比如开机进行某种设置，完成一个文件系统的check，完成了就自动退出的进程）。这种，我们就不可能执行:
﻿
Restart=always
﻿
或者
﻿
Restart=on-success
﻿
因为，既然这个oneshot服务已经成功执行了，我们没必要再次启动它。
﻿
﻿