Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

aalexgabi · 2016-11-02T10:47:45Z

Version: v0.12.9 - v7.0.0
Platform: Linux ${edited hostname} 4.4.0-45-generic deprecate domains #66~14.04.1-Ubuntu SMP Wed Oct 19 15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Also posted here: http://stackoverflow.com/questions/40306385/missing-lines-when-using-the-stdout-of-a-child-process-as-stdin-of-another

When using the stdout of one child process as stdin for another, it seems that sometimes data is not passed to the next child:

var spawn = require('child_process').spawn;

var pipeId = 0;
var pipeSlots = 6;

var launchProcess = function(cmd, args, stdin, stdout){
  return spawn(cmd, args, {
    stdio: [stdin, stdout, 'ignore']
  });
};

var launch = function(){
  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', 'pipe');
  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], task0.stdout, 'ignore');

  task0.on('error', function(err){
    console.log('Error while processing task0:' + err.stack);
  });
  task1.on('error', function(err){
    console.log('Error while processing task1:' + err.stack);
  });

  pipeId++;
};

// Simulating message queue
setInterval(function(){
  // Simulating how many messages we get from the messaging queue
  var mqMessageCount = Math.floor(Math.random() * (pipeSlots + 1));

  for(var i = 0; i < mqMessageCount; i++){
    launch();
  }
}, 250); // For this test we assume that pipes finish under 250ms

Some files are empty:

ls -lhS /tmp/body-pipeline-*

FYI: task0.stdout.pipe(task1.stdin) solves the issue but the script uses 50% CPU (compared to 0% when passing stdout of task0 as stdin of task1) for the equivalent of yes | tee /tmp//tmp/body-pipeline-x

The text was updated successfully, but these errors were encountered:

bnoordhuis · 2016-11-02T12:06:38Z

You are handing off task0.stdout to the second child process but because task0.stdout starts in flowing mode and since it also exists in the parent process, data can end up in either process.

It might be possible to work around when you create the first child's stdout pipe manually in paused mode but that's probably difficult to do without groping around in node internals.

Perhaps spawn() needs to be taught a pauseOnCreate option, similar to new net.Socket({ handle: handle, pauseOnCreate: true }). Note that it may not be the best possible fix, just the first one that comes to mind.

addaleax · 2016-11-02T12:10:03Z

/cc @targos fyi this seems to be the right answer to nodejs/help#324, too

mscdex · 2016-11-02T12:14:26Z

The problem has to do with task0's stdout stream buffering all of the output before task1 has a chance to start and hook up to task0's stdout. It's racy due to timing/scheduling of the child processes, that is why it does not happen all of the time.

EDIT: Oops, didn't see the comments before this. I was just about to say what @bnoordhuis said. It works if pauseOnCreate: true is passed (internally) and you manually set task0.stdout.readable = false in the userland code in order for flushStdio() to not call stream.resume() which starts buffering data on the next tick.

aalexgabi · 2016-11-03T10:57:14Z

Is there a workaround that doesn't involve task0.stdout.pipe(task1.stdin)? I tried task0.stdout.readable = false but I don't know how to pass pauseOnCreate: true.

@bnoordhuis How to manually create a pipe in non flowing mode?

bnoordhuis · 2016-11-03T11:24:31Z

Is there a workaround that doesn't involve task0.stdout.pipe(task1.stdin)?

Not at the moment, no. Not without monkeying around with node's internals.

aalexgabi · 2016-11-03T16:31:36Z

Is there a way to use task0.stdout.pipe(task1.stdin) without passing all the data through NodeJS VM? I'm thinking of something like:

  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', 'pipe');
  var fakePipe = new net.Socket({fd: task0.stdout._handle.fd})
  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], fakePipe, 'ignore');

How would you go about monkey patching node's internals? Temporarily replacing net.createSocket with a version forcing readable to false?

I'm using NodeJS for an export runner that launches pipelines of scripts to process lots of data and upload it at the end so I would really like to find a way of doing this without the runner having very high CPU usage.

targos · 2016-11-05T09:08:14Z

Thanks @addaleax, it seems to be the same issue indeed!

What I don't understand here is that it happens even when the two calls to spawn are made in the same tick. Is task0.stdout somehow receiving data synchronously?

bnoordhuis · 2016-11-05T10:47:23Z

The file descriptor of task0.stdout exists simultaneously in the parent and the child. Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

addaleax · 2016-11-05T16:00:49Z

Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

I haven’t looked at it in detail, so ignore me if this doesn’t make sense, but is there any reason why the second spawn call shouldn’t do that itself when receiving task0.stdout?

targos · 2016-11-06T09:58:54Z

Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

Just tried both and it works with destroy but not with pause.

aalexgabi · 2016-11-30T08:23:25Z

A colleague of mine had a great idea that fixed this:

  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], 'pipe', 'ignore');
  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', task1.stdin);

The idea is to launch the receiving task before launching the sending task !

masaeedu · 2017-07-13T20:21:43Z

@bnoordhuis Is there a way to create a process with a predefined WritableStream for stdout? I mean the API itself seemed racy when I started using it, because I don't want to start a process and then hook up something to receive stdout. Can't see how to provide a stdout receiver before starting the process with the existing streams API though.

gireeshpunathil · 2018-04-17T11:37:10Z

@aalexgabi - is this still an outstanding issue for you?

aalexgabi · 2018-04-24T14:00:09Z

@gireeshpunathil Yes. A lot of CPU time is wasted because I had to pass all data through the NodeJS process instead of letting the two sub-processes pipe it directly. Changing the algorithm so that the receiving task is launched before the receiving task as pointed in my last comment was too complicated for a workaround in the end.

gireeshpunathil · 2018-05-20T11:05:16Z

cross link #18016 - both cases are root caused by same issue.

gireeshpunathil · 2018-05-20T11:08:59Z

looks like it was suggested earlier, no evidence whether it was attempted: can you please try with:

task0.stdout.pipe(task1.stdin);

It is reliable, and simple.

aalexgabi · 2018-05-20T11:41:00Z

@gireeshpunathil The problem does not arise when using task0.stdout.pipe(task1.stdin) but it has the major downside of passing all data through NodeJS VM which causes very high CPU usage.

gireeshpunathil · 2018-05-20T14:55:45Z

@aalexgabi - if your main motivation is to bypass node from being a middle man between task0 and task1, why don't you combine them together and run under a single child process?

I was under the impression that you want node to gain fine grained control over both the processes, and want them not to be dependent on each other - in which case anyways the data comes through Node, and the additional burden of piping does not cause too much load.

Thinking further on it (between yesterday and today) I believe the issue is neither in your code nor in Node's code. There is a unspecified design at work: when we pass stdin for task1 we specify stdout of task0 the expectation that all of stdin data will be routed to task1, but in reality we are making cloning stdin passing to task1. What is unspecified is what to do with the original stdin - should it still function, or should it be closed?

If we force close original stdin any other parts of the program (you don't have one) may malfunction.

If we leave it as it is, the behavior you found will occur - data will branch out into to destinations

Under default conditions, I guess the data should be made available in both - following this example:

#echo hello |tee 1.txt
hello
#cat 1.txt 
hello
#

aalexgabi · 2018-06-02T20:17:33Z

@gireeshpunathil My application is a task runner which it's agnostic to what tasks it executes. I have a RabbitMQ queue called task_run_requests. Each task_run_request has a type and some parameters. The type of the tasks has a corresponding entry in a configuration file which looks like this:

{
...
  "someFilteringTaskThatUploadsResult": {
    steps: [
      {
        command: "cat",
        args: [ "${request.filename}" ]
      },
      {
        command: "grep",
        args: [ "-v", "${request.filter}" ]
      },
      {
        command: "tee ",
        args: [ "${request.backupFile}" ]
      },
      {
        command: "ftpupload",
        args: [ "-p", "${request.password}", "-u", "${request.username}" ]
      },
    ]
  }
...
}

All the commands the runner executes are just Linux programs. Most of the commands the runner executes are atomic functions (generate file, transform file, merge files, split files etc.). Only the task type assembles different tasks into one pipeline. I cannot write a script for each combination because that would defy the purpose of having a generic runner in the first place. The idea is to have a system where you can mix and match system and custom commands by writing configuration and not code, except when it's necessary. Some of these tasks process gigabytes of data and can run for up to 12 hours. Passing all the data through the Node VM serves no purpose because the runner does not touch the data but only orchestrates the execution of tasks.

There is a need for granular control of tasks in case of a failure. In the event of one task failing, we need to kill all other tasks, store the exit code of each task in the database, cancel a pipeline on user request etc. Some tasks are demand lots of resources so for example we may want to only respawn one child not not the entire pipeline in case of a failure.

I'm sorry but I don't think I understand the second part of your comment. You are talking about cloning stdin and I don't understand why. I will try to explain how I understand the process spawning and the stdin and stdout handling by analogy with bash.

When you want to execute a pipeline like cat myFile | tee backupFile | grep -v myFilter bash will create three fds for stdin stdout and stderr for the first process and spawn it. Because there is no consumer on the stdout of cat myFile, the write call inside of this process will block once the fd buffer is full until something reads from it's stdout. Bash will only create two file descriptors for the tee backupFile because it will pass it exactly the same file descriptor of the stdout of the first command. When the second command will read it's stdin, it will unblock the write from the first command. The same will happen for the last command. With this design no data can be lost because as long as nobody reads the stdout file descriptor, the write calls inside the process that outputs data will block once the write buffer is full.

My understanding may not be accurate but this is how I imagine it and how I expect NodeJS to handle it. I may not be correct but it seems to me that for some reason the stdout of the first process is either a blackhole like /dev/null until the second process is spawned or it's being read from by NodeJS or something.

Note that I don't need to be able to handle any data events on the file descriptors between spawned processes so I don't expect them being NodeJS streams but simple integers that identify file descriptors.

gireeshpunathil · 2018-06-08T08:11:50Z

@aalexgabi - thanks for the detailed explanation.

I understand your use case now, thanks.
I agree I fully messed up my second para. I am re-writing it like this with a better mind set:

There is an unspecified design at work: when we pass stdout of task0 as the stdin for task1, the expectation is that all of task0::stdout is passed to task1::stdin. But the relaity is that we are making a new pipe thatcontainstask0::stdout in one end and task1::stdin at the other. The original copy of task0::stdout is still open and available with parent. What is unspecified is what to do with the original task0::stdout - should it still function, or should it be closed?

Of course, this explanation does not provide any relief to your scenario, it just explains the current design.

While I am not sure about your reasoning around how bash manages this, the outcome of it is what matters - once the piping is completed, bash does not involve in the data flow.

However, bash has information upfront about what are the processes coming for execution and how they want to be connected. So it can precisely do the connectivity in the optimal manner.

our Node program does not have that info. So every child process spawn is independant, and spawns new pipes with one end at the child and the other with the parent, i.e, node.

It becomes a manual work to close out / re-pipe the open ends of the pipe according to the current design.

I will look into the code / engage others to see is there any improvements possible here, without causing side effects to the otherwise stable code.

when t0 and t1 are spawned with t0's outputstream [1, 2] is piped into t1's input, a new pipe is created which uses a copy of the t0's fd. This leaves the original copy in Node parent, unattended. Net result is that when t0 produces data, it gets bifurcated into both the copies Detect the passed handle to be of 'wrap' type and close after the native spawn invocation by which time piping would have been over. Fixes: nodejs#9413 Fixes: nodejs#18016 PR-URL: nodejs#21209 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

when t0 and t1 are spawned with t0's outputstream [1, 2] is piped into t1's input, a new pipe is created which uses a copy of the t0's fd. This leaves the original copy in Node parent, unattended. Net result is that when t0 produces data, it gets bifurcated into both the copies Detect the passed handle to be of 'wrap' type and close after the native spawn invocation by which time piping would have been over. Fixes: #9413 Fixes: #18016 PR-URL: #21209 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

mscdex added the child_process Issues and PRs related to the child_process subsystem. label Nov 2, 2016

mscdex added the feature request Issues that request new features to be added to Node.js. label Nov 2, 2016

targos mentioned this issue Nov 6, 2016

Use of child_process.spawn to pipe data nodejs/help#324

Closed

devinivy mentioned this issue Oct 1, 2017

Handing spawn() stdio as socket wont duplex properly #15714

Open

gireeshpunathil mentioned this issue Apr 13, 2018

Flexible stdios for IPC #20000

Open

gireeshpunathil mentioned this issue May 29, 2018

spawn with pipes #18016

Closed

gireeshpunathil mentioned this issue Jun 8, 2018

child_process: close pipe ends that are re-piped #21209

Merged

4 tasks

gireeshpunathil closed this as completed in #21209 Feb 6, 2019

kraenhansen mentioned this issue May 1, 2024

Passing stdin of child process A to stdout via stdio of child process B, yields "invalid child stdio type" when using spawnSync #52776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

aalexgabi commented Nov 2, 2016

bnoordhuis commented Nov 2, 2016

addaleax commented Nov 2, 2016

mscdex commented Nov 2, 2016 •

edited

Loading

aalexgabi commented Nov 3, 2016

bnoordhuis commented Nov 3, 2016

aalexgabi commented Nov 3, 2016

targos commented Nov 5, 2016

bnoordhuis commented Nov 5, 2016

addaleax commented Nov 5, 2016

targos commented Nov 6, 2016

aalexgabi commented Nov 30, 2016

masaeedu commented Jul 13, 2017 •

edited

Loading

gireeshpunathil commented Apr 17, 2018

aalexgabi commented Apr 24, 2018

gireeshpunathil commented May 20, 2018

gireeshpunathil commented May 20, 2018 •

edited

Loading

aalexgabi commented May 20, 2018

gireeshpunathil commented May 20, 2018

aalexgabi commented Jun 2, 2018

gireeshpunathil commented Jun 8, 2018

Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

Comments

aalexgabi commented Nov 2, 2016

bnoordhuis commented Nov 2, 2016

addaleax commented Nov 2, 2016

mscdex commented Nov 2, 2016 • edited Loading

aalexgabi commented Nov 3, 2016

bnoordhuis commented Nov 3, 2016

aalexgabi commented Nov 3, 2016

targos commented Nov 5, 2016

bnoordhuis commented Nov 5, 2016

addaleax commented Nov 5, 2016

targos commented Nov 6, 2016

aalexgabi commented Nov 30, 2016

masaeedu commented Jul 13, 2017 • edited Loading

gireeshpunathil commented Apr 17, 2018

aalexgabi commented Apr 24, 2018

gireeshpunathil commented May 20, 2018

gireeshpunathil commented May 20, 2018 • edited Loading

aalexgabi commented May 20, 2018

gireeshpunathil commented May 20, 2018

aalexgabi commented Jun 2, 2018

gireeshpunathil commented Jun 8, 2018

mscdex commented Nov 2, 2016 •

edited

Loading

masaeedu commented Jul 13, 2017 •

edited

Loading

gireeshpunathil commented May 20, 2018 •

edited

Loading