js8 3 days ago

I am not sure why would you want to do this. If I take an example of the bottling pipeline, it seems to me that the best way to process the bottles on an SMP system is to have multiple threads where each thread is doing all the work for a single bottle, sequentially.

Maybe, if the processes at each stage are I/O-bound, then it might make sense. But if they are CPU-bound, then I am not sure this way of pipelining helps - you're moving data between different CPUs, destroying cache locality.

  • adrianmonk 3 days ago

    I was about to suggest the same thing, and I still think it makes a lot of sense.

    However, the cache locality thing is complicated. Each bottle has data associated with it, but each processing stage might also have data. For example, maybe a particular stage uses a lookup table. Or maybe stages keep statistics as they process bottles.

    If you have one CPU doing all the work for a particular stage, then per-stage data stays in its cache. But if you have one CPU doing all the work for a particular bottle, then all the per-bottle data stays in its cache. So there's a trade-off that depends on the specifics.

    • kitd 3 days ago

      More importantly, each processing stage may have external resources it needs to manage to do its function. In many cases, it makes more sense to let the stage own the resource for all bottles, than to have a per-bottle process having to contend for the resource at the point it's needed.

      • hinkley 3 days ago

        You get very different Little’s Law shapes from the too for sure.

        Who here hasn’t bought themselves some time from vertically or horizontally scaling a database by hoisting a bunch of calculations up and out of the transaction? Do them before or after the constraint.

        The farther you have to reach to coordinate use of the constraint the better it might be to hand the inputs to the constraint and let an experts handle things.

  • anonymoushn 3 days ago

    You probably gain some ILP or avoid some pipeline stalls by handling a lot of bottles at a time rather than one.

    • hinkley 3 days ago

      Particularly if you’re doing collision detection against static sets of data. Filter this data for entries that look this way. Or don’t look that way.

shermantanktop 3 days ago

I’ve done something like this in large service ecosystems. It has two very unfortunate properties:

- if the actual performance deviates from the predicted (scored) performance, the system easily enters a degenerate bottlenecked state.

- and if that happens, the many internal queues make diagnosis, root causing, and confidence in a fix all exponentially worse.

Now you might assert that this will be applied in situations where scores are accurate and brown failures do not occur. Those aren’t the situations I deal with.

  • Joker_vD 3 days ago

    The algorithm in TFA, as I understand it, requires the scheduler to constantly monitor the queue lengths and reschedule the workers appropriately, as opposed to manually estimating the required number of workers for each stage and then never changing it.

    • shermantanktop 3 days ago

      This is what happens with PID controllers. It is by definition a trailing view, and a single catastrophically bad request can cause catastrophic deviation of the whole system before the queue monitoring can "catch up."

      • hinkley 3 days ago

        It’s possible to write a scheduler that looks like work stealing and can route unordered tasks around a poison pill.

        But we’ve generally found it works better for the worker to pull when its queue is empty, and of course if you have partially ordered events nobody can process (start or reject) those ordered tasks until the borked one finishes or times out.

        So the situation you describe can absolutely happen, but is not a given. It’ll depend on other decisions.

mindcrime 3 days ago

As I came up with this idea of scheduling described above, I bounced it off my friend Daniel Gustafsson who immediately replied “this reminds me a bit of Jefferson’s method” (of allocating seats in parliaments).

Sounds like the author of this would be interested in Queueing Theory[1] (in the sense of being interested in mathematical formalisms to explore this stuff). Apportionment[2] is also studied as a very specific thing unto itself.

There's a huge mass of published research "out there" dealing with queueing and scheduling. Not all of it pertains to "thread scheduling" but there's quite a bit of conceptual overlap between something like thread scheduling and job floor scheduling. And some of the stuff on apportionment likewise probably relates at least by analogy.

[1]: https://en.wikipedia.org/wiki/Queueing_theory

[2]: https://en.wikipedia.org/wiki/Mathematics_of_apportionment

jerf 3 days ago

This also contains a hidden assumption that reallocation from one task to another is very, very expensive, so it's necessary to name the distribution of resources well in advance. This is clearly true for a physical production line. But if you are in the very common circumstance in the computer world where the switching time is significantly less than the amount of progress than can be made on a given task in some time, then the optimum scenario is just to do work as it becomes available, and in many cases, almost any scheduling algorithm that isn't deliberately constructed for maximum pathology will ensure that significant progress is made.

That's why in the vast majority of circumstances you'll be running many things on many CPUs, you just throw all the work at the CPUs and let the chips fall where they may. Deliberate scheduling is a tool, but an unusual one, especially as many times the correct solution to tight scheduling situations is to throw more resources at it anyhow. (Trying to eke out wins by changing your scheduling implies that you're also in a situation where slight increases in workloads will back the entire system up no matter what you do.)

  • hinkley 3 days ago

    Even with lambdas you want a bit of server affinity because setup is nontrivial. If the job absolutely needs to get done we can get it done. But the opportunity cost is high and we can’t build a business around it. You want to prefer the warm path over the cold one.

    • jerf 2 days ago

      "if you are in the very common circumstance in the computer world where the switching time is significantly less than the amount of progress than can be made on a given task in some time, then the optimum scenario is just to do work as it becomes available"

      ... and if you aren't, then clearly, by the way "if-then" statements work, the "then" clause doesn't apply.

limit499karma 3 days ago

I do not believe 'pipelining' and parallelism are interchangeable models and conflating them is a mistake. For example, consider a parallel processing system that in fact works strictly using a 'pipeline' of length 0, that is there is a hand-of from input to processing stage and processing of that input. And you can have n such parallel processing stages and voila 'parallelism'.

Pipelines are strictly processing stages where the 'production of the input' and processing on the inputs are not synchronized. For example, one sends n requests to via a pipeline protocol to a remote server without waiting for acks for each input from the server. There may only be one such processing pipeline (and thus no parallelism) while there is pipelining.

  • bee_rider 3 days ago

    I don’t 100% follow your comment, so sorry if this is not quite right.

    But, I would consider pipelining to be a form or parallelism. It breaks up a task so that parts of it can be run simultaneously (in different stages of the pipeline, simultaneously). There are other ways to do parallelism of course, but it is a way.

    In your example, if there are multiple pipeline stages in this server, then the tasks should be worked on simultaneously, and so parallelism is occurring.

    Multi-core, SIMD, and pipelining. Parallelism has multiple dimensions.

    • spc476 3 days ago

      Not quite. Going back to the bottling example, a bottle has to be filled, capped, then labeled. At time 1, 1 bottle is filling up, pipeline is advanced. At time 2, bottle 2 is filling up, bottle 1 is being capped, pipeline is advanced. At time 3, bottle 3 is filling up, bottle 2 is being capped, and bottle 1 is being labeled. At time 4, bottle 1 is done. Each bottle has to go through the pipeline in sequence, and it takes three units of time for a bottle to go through the pipeline. Yes, once the pipeline is filled up, you have the three operations going on at the same time, but it's still a sequence of steps that need to be performed in order for any given bottle.

      To make it faster, you either have to decrease the time for a step (but you will always be capped with the slowest step), or go for parallelism---a separate pipeline (or pipelines). For example, with one pipeline, each step taking 1 unit of time, once the pipeline is filled, will take six units of time to make a six-pack (the first will take longer due to the latency in filling the pipeline). You can make five other pipelines, and then get a six-pack per unit of time (again, after filling all the pipelines).

      A single pipeline just makes the output have a predictable latency and time; multiple pipelines give you parallelism.

      • Jtsummers 3 days ago

        > To make it faster, you either have to decrease the time for a step (but you will always be capped with the slowest step), or go for parallelism---a separate pipeline (or pipelines).

        > A single pipeline just makes the output have a predictable latency and time; multiple pipelines give you parallelism.

        No, the pipeline does give you parallelism because you're doing three (in the bottling example) pieces of work simultaneously, that is: in parallel. Filling, capping, labeling are each being done on different bottles at the same time. How is that not parallelism?

        Let's use some numbers:

        Filling, capping, labeling take 30s, 15s, 15s each (arbitrary, chosen for easy math). Without a pipeline you will process 60 bottles per hour (say one station that does all 3 tasks and then kicks out the bottle, let's ignore transition times). With a pipeline you can get 120 per hour. You've improved throughput. The latency per bottle is still 60s.

        BTW, you can double throughput again without even needing a full second pipeline just by adding a second filling station (with my example numbers) and feeding filled bottles to the same capping and labeling stages, getting you to 240 per hour.

      • hinkley 3 days ago

        In modern bottling they fill a bunch of bottles at the same time, and in some cases (How It’s Made) the bottling or pastry pouring or cookie cutting equipment moves with the conveyor so it never has to stop.

        But that’s new equipment, so filling the bottle is still the slow step in some past version of the bottling plant. You’re still going to fill three, four, six bottles at a time, then cap them, but the labeling machine might just be serial.

        The way those machines work is also pipelined - add some space between the bottles, slap a label on, use rollers to make the label stick.

        Then you load the cases n bottles at a time, not via pick and pull.

        Several of those steps could feed in from one machine to four and back to two and then to one.

        More equipment means more maintenance but also affords you taking part of the system offline and still keeping output from cratering.

      • yazzku 3 days ago

        Instruction pipelining is literally a form of instruction-level parallelism. To argue it's not parallelism is a weird take.

        https://en.wikipedia.org/wiki/Instruction_pipelining

        Your argument basically amounts to forming an arbitrary definition of "unit" that consists of the whole thing so that then you can state that there is no parallelism. By that token, an 8-core processor has no parallelism because it can only execute a single package of 8 threads at any given time.

      • bee_rider 3 days ago

        I think you are using an unconventional definition of the word parallelism, if you want to exclude

        > Yes, once the pipeline is filled up, you have the three operations going on at the same time, but it's still a sequence of steps that need to be performed in order for any given bottle.

        as not parallel.

        It seems to me that what you are calling parallelism, most people would instead call homogeneous parallelism. Which is a subset of parallelism.

      • Joker_vD 3 days ago

        > For example, with one pipeline, each step taking 1 unit of time, once the pipeline is filled, will take six units of time to make a six-pack

        Compared to 18 units of time needed to make a six-pack without pipelining. Gee, what a wondrous invention this "pipeline" is: having three workers means the work is accomplished thrice as fast, yet there is (according to you) no parallelism at all! So naturally, if we could introduce parallelism inside this single pipeline, we would be able to make another triple reduction in time, and get a production of six-pack take only 2 units of time.

  • Joker_vD 3 days ago

    This is your own personal definition of "pipelining" which most of other people wouldn't subscribe to.

    And to pick a nit, there is always some synchronization between the submission of an input and the processing of that input: the submission must happen before the processing, unfortunately.

pwdisswordfishz 3 days ago

"Those who sacrifice freedom for memory safety deserve neither" or something?

bdcravens 3 days ago

Jefferson said we should rewrite the constitution every 19 years, so I assumed that means if threads haven't completed in some amount of time, to just blow them all away and start over lol