首页 > LINUX内核, 操作系统 > fork()父子进程运行先后顺序


2012年7月14日 发表评论 阅读评论 7934次阅读    




Fork函数父子进程运行先后顺序调研... 1
1       背景... 5
2       名词解释... 5
3       调研结论... 6
4       Fork原理... 6
4.1.1         Fork() -> copy_process() 8
5       2.6.9-2.6.23:子进程优先运行... 10
5.1        让子进程优先执行的优点... 12
6       2.6.23-2.6.32:子进程优先运行... 14
6.1        新版wake_up_new_task. 14
6.2        wake_up_new_task-> activate_task() 15
6.3        wake_up_new_task-> task_new () 15
6.4        sysctl_sched_child_runs_first 19
7       2.6.32至今:父进程优先运行... 20
8       结论... 23
9       附件及参考资料... 23

1         背景

Linux fork()函数可以创建新的几乎一模一样的进程。
#include <unistd.h>
pid_t fork(void);
I had read that the operating systems that use copy-on-write mechanism for fork(), it is better if they deliberately allow the CHILD to run first. This would be better because in 99% of the cases child will call exec() and the new address space will be allocated. Instead if the parent is executes first, an unnecessary copy of the pages is made (if parents writes) and later on when child executes, a fresh address space is executed.
So in linux, is a child run first or the parent? Can we rely on this information?



2         名词解释

l  COW
l  完全公平调度程序(Completely Fair Scheduler,CFS)
Linux2.6.23以后引入的进程调度机制,新特性包括模块化调度程序,完全公平调度程序(Completely Fair Scheduler,CFS),CFS 组调度等。具体信息可以参考:http://www.ibm.com/developerworks/linux/library/l-cfs/

3         调研结论

  1. fork后父子进程先后执行关系跟不同linux系统版本有关。
  2. 先后顺序并非完全随机不可预料。
  3. Kernel 2.6.9 到2.6.23(不包括)子进程优先运行;
  4. Kernel 2.6.23到2.6.32(不包括)时子进程优先,kernel.sched_child_runs_first变量为1;
  5. Kernel2.6.32以上版本截止到写此文时父进程优先运行,kernel.sched_child_runs_first变量为0。


Don't do child-runs-first for CLONE_VM processes, as there is obviously no COW benifit to be had.  This is a big one, it enables Andi's workload to run  well without clone balancing, because the OpenMP child threads can get balanced off to other nodes *before* they start running and allocating  memory.

4         Fork原理


long do_fork(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr)
struct task_struct *p;
int trace = 0;
long pid = alloc_pidmap();//分配一个新的pid进程号
p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid);
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
if (!IS_ERR(p)) {
struct completion vfork;
if (clone_flags & CLONE_VFORK) {//如果设置了CLONE_VFORK标志,表示使用vfork创建,因此需要子进程先运行,直到子进程调用exec函数后父进程才会被唤醒。
p->vfork_done = &vfork;
if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) {
* We'll start up with an immediate SIGSTOP.
sigaddset(&p->pending.signal, SIGSTOP);
set_tsk_thread_flag(p, TIF_SIGPENDING);//如果设置CLONE_STOPPED标志,则将当前进程设置为TIF_SIGPENDING状态,挂起之。
if (!(clone_flags & CLONE_STOPPED))
wake_up_new_task(p, clone_flags);
p->state = TASK_STOPPED;
if (clone_flags & CLONE_VFORK) {
if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE))
ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
} else {
pid = PTR_ERR(p);
return pid;


4.1   Fork() -> copy_process()


* This creates a new process as a copy of the old one,
* but does not actually start it yet.
* It copies the registers, and all the appropriate
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
static task_t *copy_process(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
int pid)
int retval;
struct task_struct *p = NULL;
p = dup_task_struct(current);
copy_flags(clone_flags, p);
p->pid = pid;//记录当前子进程的pid。
if ((retval = copy_files(clone_flags, p)))
goto bad_fork_cleanup_semundo;
if ((retval = copy_fs(clone_flags, p)))
goto bad_fork_cleanup_files;
if ((retval = copy_signal(clone_flags, p)))
goto bad_fork_cleanup_sighand;
if ((retval = copy_mm(clone_flags, p)))
goto bad_fork_cleanup_signal;
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
/* Perform scheduler related setup */
* Ok, make it visible to the rest of the system.
* We dont wake it up yet.
p->real_parent = current;
p->parent = p->real_parent;
retval = 0;

5         2.6.9-2.6.23:子进程优先运行

对于Linux,为避免父进程先执行引起不必要的COW(因为很多时候子进程将很快执行exec),生成新进程时曾试图使子进程先于父进程执行(参考《Understanding Linux Kernel》的“ The do_fork( ) function”一节)。

  1. If the CLONE_STOPPED flag is not set, it invokes the wake_up_new_task( )function, which performs the following operations:
    1. Adjusts the scheduling parameters of both the parent and the child (see "The Scheduling Algorithm" in Chapter 7).
    2. If the child will run on the same CPU as the parent,[*] and parent and child do not share the same set of page tables (CLONE_VM flag cleared), it then forces the child to run before the parent by inserting it into the parent's runqueue right before the parent. This simple step yields better performance if the child flushes its address space and executes a new program right after the forking. If we let the parent run first, the Copy On Write mechanism would give rise to a series of unnecessary page duplications.

[*] The parent process might be moved on to another CPU while the kernel forks the new process.

  1. Otherwise, if the child will not be run on the same CPU as the parent, or if parent and child share the same set of page tables (CLONE_VM flag set), it inserts the child in the last position of the parent's runqueue.


观察Linux 2.6.9-23进程创建的相关源码,在版本2.6.23(不包括)之前我们可以看到曾经存在这样一行注释(位于kernel/sched.c的wake_up_new_task函数中):
Fork() -> wake_up_new_task():

* wake_up_new_task - wake up a newly created task for the first time.
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
void fastcall wake_up_new_task(task_t * p, unsigned long clone_flags)
unsigned long flags;
int this_cpu, cpu;
runqueue_t *rq, *this_rq;
rq = task_rq_lock(p, &flags);//锁住当前运行队列,避免互斥。
cpu = task_cpu(p);//得到新任务运行的cpu。
this_cpu = smp_processor_id();//根据SMP获取当前父进程所运行的CPU ID。
if (likely(cpu == this_cpu)) {//如果是当前CPU
if (!(clone_flags & CLONE_VM)) {
* The VM isn't cloned, so we're in a good position to
* do child-runs-first in anticipation of an exec. This
* usually avoids a lot of COW overhead.
if (unlikely(!current->array))
__activate_task(p, rq);
else {
p->prio = current->prio;
list_add_tail(&p->run_list, &current->run_list);
p->array = current->array;
} else
/* Run child last */
__activate_task(p, rq);//否则的话直接加入到运行队列的尾部。这样父进程有更多机会优先执行。
* We skip the following code due to cpu == this_cpu
*   task_rq_unlock(rq, &flags);
*   this_rq = task_rq_lock(current, &flags);
this_rq = rq;
} else {
this_rq = cpu_rq(this_cpu);
__activate_task(p, rq);
task_rq_unlock(this_rq, &flags);//解锁


5.1      让子进程优先执行的优点


From:  "Adam J. Richter"

To:    torvalds@transmeta.com, linux-kernel@vger.kernel.org

Subject: PATCH(?): linux-2.4.4-pre2: fork should run child first

Date:  Thu, 12 Apr 2001 01:55:16 -0700

具体邮件链接为:http://lwn.net/2001/0419/a/children-first.php3 。下面引述其主要观点:

most of the time,the child process from a fork will do just a few things and then doan exec(), releasing its copy-on-write references to the parent'spages.

Linux-2.4.3's fork() does not run the child first.

I have attached the patch below.  I have also adjusted the

comment describing the code.

    也就是作者尝试提议将子进程先运行,理由是为了优化写时复制机制,因为fork后的子进程基本都是会执行exec重新加载其他程序的,因此没必要进行虚拟内存的过多操作。比如如果父进程先运行,进行了写操作,则需进行写时复制。  其他

  1. 这里有个让子进程优先运行的低版本补丁:


  1. 这里可以看到2.6.9的wake_up_new_task 代码:


6         2.6.23-2.6.32:子进程优先运行

自从2.6.23版本以后,linux进程调度进行了很大的改变,之前的优先级队列等都换成了完全公平调度程序(Completely Fair Scheduler,CFS)。完全公平调度力图确保每个进程都获得公平的CPU份额。关于CFS不是这里的重点,具体内容可以参考:
这里是linux2.6.23版本新引入CFS时的介绍:http://kernelnewbies.org/Linux_2_6_23#head-f3a847a5aace97932f838027c93121321a6499e7 从这里可以大概了解CFS的基本原理。

6.1      新版wake_up_new_task



* wake_up_new_task - wake up a newly created task for the first time.
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
unsigned long flags;
struct rq *rq;
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
p->prio = effective_prio(p);
if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);//直接调用activate_task
} else {
* Let the scheduling class do new task startup
* management (if any):
// task_new是调度类设置的钩子函数,从这里可以看出CFS的模块化,我们可以自定义调度程序。如果设置了sysctl_sched_child_runs_first ,为1,表示子进程先运行,那么else命中,否则走入if,运行activate_task函数,见后面说明。
p->sched_class->task_new(rq, p);
inc_nr_running(p, rq);
check_preempt_curr(rq, p);
task_rq_unlock(rq, &flags);


6.2      wake_up_new_task-> activate_task()

当上述wark_up_new_task运行到activate_task(rq, p, 0);时,是出现如下几种情况之一:

  1. 没有设置调度类钩子函数;
  2. 不在当前进程,即父进程的调度实体红黑树里面


static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
enqueue_task(rq, p, wakeup);//将进程放到调度类的红黑树里面
inc_nr_running(p, rq);


6.3      wake_up_new_task-> task_new ()


task1218struct sched_class fair_sched_class __read_mostly = {
        .enqueue_task           = enqueue_task_fair,
       .dequeue_task           = dequeue_task_fair,
        .yield_task             = yield_task_fair,
        .check_preempt_curr     = check_preempt_curr_fair,
        .pick_next_task         = pick_next_task_fair,
       .put_prev_task          = put_prev_task_fair,
        .load_balance           = load_balance_fair,
        .set_curr_task          = set_curr_task_fair,
        .task_tick              = task_tick_fair,
       .task_new               = task_new_fair,//新建进程时调用的钩子函数。



* Share the fairness runtime between parent and child, thus the
* total amount of pressure for CPU stays equal - new tasks
* get a chance to run but frequent forkers are not allowed to
* monopolize the CPU. Note: the parent runqueue is locked,
* the child is not running yet.
static void task_new_fair(struct rq *rq, struct task_struct *p)
struct cfs_rq *cfs_rq = task_cfs_rq(p);
struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
int this_cpu = smp_processor_id();
place_entity(cfs_rq, se, 1);//设置当前新进程的调度实体的vruntime值
/* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.
swap(curr->vruntime, se->vruntime);
enqueue_task_fair(rq, p, 0);//将子进程放入红黑树,vruntime决定了再红黑树中的位置。

上面的swap(curr->vruntime, se->vruntime);很重要,其条件为:

  1. sysctl_sched_child_runs_first为1,表示要子进程先运行;
  2. 新进程的CPU等于当前CPU,也就是父进程的CPU,否则没有必要设置;
  3. 组相同;
  4. 父进程也就是当前进程的vruntime小于新的子进程的vruntime。

后面enqueue_task_fair(rq, p, 0);语句将子进程放入红黑树,其决定了哪个进程先运行,以及子进程被放入红黑树的位置,调用树为:

  1. Do_fork()
  2. ->wake_up_new_task()  唤醒子进程
  3. ->task_new_fair()   CFS创建新进程时的钩子函数
  4. ->enqueue_task_fair()   将进程放入红黑树
  5. ->enqueue_entity()->  更新当前进程的统计数据,调用__enqueue_entity
  6. –> __enqueue_entity()  根据vruntime大小插入红黑树适当节点


* Enqueue an entity into the rb-tree:
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 key = entity_key(cfs_rq, se);
//key 其实就是vruntime的值跟红黑树的最小vruntime值之差:se->vruntime - cfs_rq->min_vruntime;
* Find the right place in the rbtree:
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
* We dont care about collisions. Nodes with
* the same key stay together.
if (key < entity_key(cfs_rq, entry)) {
link = &parent->rb_left;//新进程key小于当前节点,左移
} else {
link = &parent->rb_right;
//否则的话右移。因此,如果2个进程vruntime值相等,那么新进程将作为旧进程的右子孙,因此旧进程优先得到调度运行。这里有个问题不明白:如果父子进程vruntime相等,那么子进程靠右,得到调度的优先级低,如果设置了sysctl_sched_child_runs_first等,但是task_new_fair里面的curr->vruntime < se->vruntime比较不成立,因此就不会swap,这样的话子进程不会优先运行,也就跟sysctl_sched_child_runs_first设置的预期值1表现不一致,不知道是不是一个bug,知道的同学告诉我一下。
leftmost = 0;
* Maintain a cache of leftmost tree entries (it is frequently
* used):
if (leftmost)//如果是最小的key,则记录最小值
cfs_rq->rb_leftmost = &se->run_node;
rb_link_node(&se->run_node, parent, link);//链入找到的节点
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);//调整红黑树,使其保持平衡。


6.4      sysctl_sched_child_runs_first


  1. 1.       父子进程是否在同一个CPU运行;
  2. 2.       父子进程是否共享内存,设置了CLONE_VM标志;
  3. 3.       sysctl_sched_child_runs_first是否为1

答案在:linux+v2.6.23/kernel/sched.c#L1663 ,下面的代码:
* After fork, child runs first. (default) If set to 0 then
* parent will (try to) run first.
const_debug unsigned int sysctl_sched_child_runs_first = 1;

7         2.6.32至今:父进程优先运行

其他方面跟2.6.23-32版本基本一样,只是一个变量变化了。恩,对,是sysctl_sched_child_runs_first。这回默认sysctl_sched_child_runs_first=0 了! 也就是父进程优先运行。
* After fork, child runs first. If set to 0 (default) then
* parent will (try to) run first.
unsigned int sysctl_sched_child_runs_first __read_mostly;

因此得到结论2.6.32至今是父进程优先运行。不过2.6.32以后有没有又变化俺就不能人肉啦^.^ 。

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

[view in full thread]

From: Ingo Molnar

Subject: [GIT PULL] sched/core for v2.6.32

Date: Friday, September 11, 2009 - 12:25 pm


Please pull the sched-core-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-core-for-linus


 - Child-runs-first is now off - i.e. we run parent first.

   [ Warning: this might trigger races in user-space. ]






From: Jesper Juhl

Subject: Re: [GIT PULL] sched/core for v2.6.32

Date: Friday, September 11, 2009 - 3:40 pm


Ouch. Do we dare do that?




From: Linus Torvalds

Subject: Re: [GIT PULL] sched/core for v2.6.32

Date: Friday, September 11, 2009 - 3:58 pm

We would want to at least try.

There are various reasons why we'd like to run the child first, ranging from just pure latency (quite often, the child is the one that is critical) to getting rid of page sharing for COW early thanks to execve etc.

But similarly, there are various reasons to run the parent first, like just the fact that we already have the state active in the TLB's and caches.

Finally, we've never made any guarantees, because the timeslice for the parent might be just about to end, so child-first vs parent-first is never a guarantee, it's always just a preference.[ And we _have_ had that preference expose user-level bugs. Long long ago we hit some problem with child-runs-first and 'bash' being unhappy about  a really low-cost and quick child process exiting even _before_ bash itself had had time to fill in the process tables, and then when the SIGCHLD handler ran bash said &quot;I got a SIGCHLD for something I don't even know about&quot;.

 That was very much a bash bug, but it was a bash bug that forced us to vfork() has always run the child first, since the parent won't even be runnable. The parent will get stuck in


so the &quot;child-runs-first&quot; is just an issue for regular fork or clone, not It really hasn't been that way in Linux. We've done it both ways.




8         结论

从上面的分析中看出,其实父子进程谁先谁后的问题围绕在COW写时复制机制和TLB转换检测缓冲区/告诉缓存 以及哪个进程更重要之间进行。本文只是想详细了解一下这个话题的具体原因,不在为编程人员提供为程序依靠父进程先运行或子进程先运行的理论依据,因为我们永远不应该让自己的代码依靠哪个进程先运行,而应该采用更好的进程间通讯机制进行保证,或者使用vfork解决这个问题。下面引用一下linus祖师爷邮件里的话做个结尾:
Finally, we've never made any guarantees, because the timeslice for the parent might be just about to end, so child-first vs parent-first is never a guarantee, it's always just a preference.

9         附件及参考资料

  1. CFS:http://www.ibm.com/developerworks/linux/library/l-cfs/


  1. openSUSE System Analysis and Tuning Guide :


  1. Various scheduler-related topics: http://lwn.net/Articles/352863/
  2. Linux2.6.9 changeLog:


  1. 内核邮件记录:


  1. Linux重要版本发布日志:


  1. Linux代码:http://www.kernel.org/pub/linux/kernel/v2.6/
  2. 方便的linux代码查看地址:http://lxr.linux.no/linux+v2.6.32


  1. 2014年5月26日11:47 | #1



  1. 本文目前尚无任何 trackbacks 和 pingbacks.

注意: 评论者允许使用'@user空格'的方式将自己的评论通知另外评论者。例如, ABC是本文的评论者之一,则使用'@ABC '(不包括单引号)将会自动将您的评论发送给ABC。使用'@all ',将会将评论发送给之前所有其它评论者。请务必注意user必须和评论者名相匹配(大小写一致)。