Content deleted Content added
m Task 18 (cosmetic): eval 3 templates: hyphenate params (1×); |
m link scope |
||
(40 intermediate revisions by 20 users not shown) | |||
Line 1:
{{Short description|Data processing chain}}
{{Multiple issues|
{{More citations needed|date=September 2019}}
{{Lead too short|date=July 2024}}
In [[computing]], a '''pipeline''', also known as a '''data pipeline''',<ref>[https://www.dativa.com/data-pipelines/ Data Pipeline Development] Published by Dativa, retrieved 24 May, 2018</ref> is a set of [[data]] processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of [[buffer (computer science)|buffer storage]] is often inserted between elements.▼
}}
In [[computing]], a '''pipeline''', also known as a '''data pipeline''', is a set of [[data]] processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of [[buffer (computer science)|buffer storage]] is often inserted between elements.
== Concept and motivation ==▼
Computer-related pipelines include:▼
Pipelining is a commonly used concept in everyday life.
*[[Instruction pipeline]]s, such as the [[classic RISC pipeline]], which are used in [[central processing unit]]s (CPUs) and other [[Microprocessor|microprocessors]] to allow overlapping execution of multiple instructions with the same [[digital electronics|circuitry]]. The circuitry is usually divided up into stages and each stage processes a specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic and register fetch. They are related to the technologies of [[superscalar execution]], [[operand forwarding]], [[speculative execution]] and [[out-of-order execution]].▼
*[[Graphics pipeline]]s, found in most [[graphics processing unit]]s (GPUs), which consist of multiple [[arithmetic and logical unit|arithmetic unit]]s, or complete [[central processing unit|CPU]]s, that implement the various stages of common rendering operations ([[perspective projection]], window [[clipping (computer graphics)|clipping]], [[color]] and [[light]] calculation, rendering, etc.).▼
* [[pipeline (software)|Software pipeline]]s, which consist of a sequence of computing [[process (computing)|processes]] (commands, program runs, tasks, threads, procedures, etc.), conceptually executed in parallel, with the output stream of one process being automatically fed as the input stream of the next one. The [[Unix]] system call [[pipeline (Unix)|pipe]] is a classic example of this concept.▼
* [[HTTP pipelining]], the technique of issuing multiple [[HTTP]] requests through the same [[TCP connection]], without waiting for the previous one to finish before issuing a new one. ▼
Suppose that assembling one car requires three tasks that take 20, 10, and 15 minutes, respectively.
Some [[operating systems]]{{Such as?|date=July 2020}} may provide [[Unix-like|UNIX-like]] syntax to string several program runs in a pipeline, but implement the latter as simple serial execution, rather than true pipelining — namely, by waiting for each program to finish before starting the next one.{{Citation needed|date=July 2020}}▼
As this example shows, pipelining does not decrease the [[latency (engineering)|latency]], that is, the total time for one item to go through the whole system.
▲==Concept and motivation==
▲Pipelining is a commonly used concept in everyday life. For example, in the [[assembly line]] of a car factory, each specific task — such as installing the engine, installing the hood, and installing the wheels — is often done by a separate work station. The stations carry out their tasks in parallel, each on a different car. Once a car has had one task performed, it moves to the next station. Variations in the time needed to complete the tasks can be accommodated by "buffering" (holding one or more cars in a space between the stations) and/or by "stalling" (temporarily halting the upstream stations), until the next station becomes available.
=== In computing ===
▲Suppose that assembling one car requires three tasks that take 20, 10, and 15 minutes, respectively. Then, if all three tasks were performed by a single station, the factory would output one car every 45 minutes. By using a pipeline of three stations, the factory would output the first car in 45 minutes, and then a new one every 20 minutes.
▲In [[computing]], a
▲Computer-related pipelines include:
▲As this example shows, pipelining does not decrease the [[latency (engineering)|latency]], that is, the total time for one item to go through the whole system. It does however increase the system's [[throughput]], that is, the rate at which new items are processed after the first one.
▲* [[Instruction pipeline]]s, such as the [[classic RISC pipeline]], which are used in [[central processing unit]]s (CPUs) and other [[Microprocessor|microprocessors]] to allow overlapping execution of multiple instructions with the same [[digital electronics|circuitry]]. The circuitry is usually divided up into stages and each stage processes a specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic and register fetch. They are related to the technologies of [[superscalar execution]], [[operand forwarding]], [[speculative execution]] and [[out-of-order execution]].
▲* [[Graphics pipeline]]s, found in most [[graphics processing unit]]s (GPUs), which consist of multiple [[arithmetic and logical unit|arithmetic unit]]s, or complete [[central processing unit|CPU]]s, that implement the various stages of common rendering operations ([[perspective projection]], window [[clipping (computer graphics)|clipping]], [[color]] and [[light]] calculation, rendering, etc.).
▲* [[pipeline (software)|Software pipeline]]s, which consist of a sequence of computing [[process (computing)|processes]] (commands, program runs, tasks, threads, procedures, etc.), conceptually executed in parallel, with the output stream of one process being automatically fed as the input stream of the next one. The [[Unix]] system call [[pipeline (Unix)|pipe]] is a classic example of this concept.
▲* [[HTTP pipelining]], the technique of issuing multiple [[HTTP]] requests through the same [[TCP connection]], without waiting for the previous one to finish before issuing a new one.
== Design considerations ==
=== Balancing the stages ===
Since the throughput of a pipeline cannot be better than that of its slowest element, the designer should try to divide the work and resources among the stages so that they all take the same time to complete their tasks.
=== Buffering ===
Under ideal circumstances, if all processing elements are synchronized and take the same amount of time to process, then each item can be received by each element just as it is released by the previous one, in a single [[clock
More generally, buffering between the pipeline stages is necessary when the processing times are irregular, or when items may be created or destroyed along the pipeline.
The buffer between two stages may be simply a [[hardware register]] with suitable synchronization and signalling logic between the two stages. When a stage A stores a data item in the register, it sends a "data available" signal to the next stage B.
If the processing times of an element are variable, the whole pipeline may often have to stop, waiting for that element and all the previous ones to consume the items in their input buffers.
=== Nonlinear pipelines ===
If some stage takes (or may take) much longer than the others, and cannot be sped up, the designer can provide two or more processing elements to carry out that task in parallel, with a single input buffer and a single output buffer.
=== Dependencies between items ===
In some applications, the processing of an item Y by a stage A may depend on the results or effect of processing a previous item X by some later stage B of the pipeline.
This situation occurs very often in instruction pipelines.
In order to handle such conflicts correctly, the pipeline must be provided with extra circuitry or logic that detects them and takes the appropriate action.
* '''Stalling:''' Every affected stage, such as A, is halted until the dependency is
* '''Reordering items:''' Instead of stalling, stage A may put item Y aside and look for any subsequent item Z in its input stream that does not have any dependencies pending with any earlier item.
* '''Guess and backtrack:''' One important example of item-to-item dependency is the handling of a [[conditional branch]] instruction X by an instruction pipeline.
: Rather than halt while waiting for X to be finished, stage A may guess whether the branch will be taken or not, and fetch the next instruction Y based on that guess.
== Typical software implementations ==
▲* '''Stalling:''' Every affected stage, such as A, is halted until the dependency is resolved — that is, until the required information is available and/or the required state has been achieved.
▲To be effectively implemented, data pipelines need a CPU [[scheduling]] strategy to dispatch work to the available CPU cores, and the usage of [[data structures]] on which the pipeline stages will operate on. For example, [[UNIX]] derivatives may pipeline commands connecting various processes' standard IO, using the pipes implemented by the operating system. Some [[operating systems]]{{Such as?|date=July 2020}} may provide [[Unix-like|UNIX-like]] syntax to string several program runs in a pipeline, but implement the latter as simple serial execution, rather than true
Lower level approaches may rely on the threads provided by the operating system to schedule work on the stages: both [[thread pool]]-based implementations or on a one-thread-per-stage are viable, and exist.<ref>{{cite web | url=https://github.com/dteod/mtdp.git/ | title=MTDP | website=[[GitHub]] | date=September 2022 }}</ref>
▲* '''Reordering items:''' Instead of stalling, stage A may put item Y aside and look for any subsequent item Z in its input stream that does not have any dependencies pending with any earlier item. In instruction pipelines, this technique is called [[out-of-order execution]].
Other strategies relying on [[cooperative multitasking]] exist, that do not need multiple threads of execution and hence additional CPU cores, such as using a round-robin scheduler with a coroutine-based framework. In this context, each stage may be instantiated with its own coroutine, yielding control back to the scheduler after finishing its round task. This approach may need careful control over the process' stages to avoid them abuse their time slice.
▲* '''Guess and backtrack:''' One important example of item-to-item dependency is the handling of a [[conditional branch]] instruction X by an instruction pipeline. The first stage A of the pipeline, that fetches the next instruction Y to be executed, cannot perform its task until X has fetched its operand and determined whether the branch is to be taken or not. That may take many clock cycles, since the operand of X may in turn depend on previous instructions that fetch data from main memory.
== Costs and drawbacks ==▼
▲: Rather than halt while waiting for X to be finished, stage A may guess whether the branch will be taken or not, and fetch the next instruction Y based on that guess. If the guess later turns out to be incorrect (hopefully rarely), the system would have to backtrack and resume with the correct choice. Namely, all the changes that were made to the machine's state by stage A and subsequent stages based on that guess would have to be undone, the instructions following X already in the pipeline would have to be flushed, and stage A would have to restart with the correct [[instruction pointer]]. This [[branch prediction]] strategy is a special case of [[speculative execution]].
A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot share those resources, and because buffering and additional synchronization logic may be needed between the elements.
▲==Costs and drawbacks==
▲A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot share those resources, and because buffering and additional synchronization logic may be needed between the elements.
Moreover, the transfer of items between separate processing elements may increase the latency, especially for long pipelines.
The additional complexity cost of pipelining may be considerable if there are dependencies between the processing of different items, especially if a guess-and-backtrack strategy is used to handle them.
== New technologies ==
It's true that in recent years the demands on applications and their underlying hardware have been significant. For example, building pipelines with single node applications that trawl through the data row by row is no longer feasible with the volume and variety of [[big data]]. However, with the advent of data analytics engines such as [[Hadoop]], or more recently [[Apache Spark]], it's been possible to distribute large datasets across multiple processing nodes, allowing applications to reach heights of efficiency several hundred times greater than was thought possible before. The effect of this today is that even a mid-level PC using distributed processing in this fashion can handle the building and running of big data pipelines.<ref>[https://www.datapipelines.com/blog/what-is-a-data-pipeline/ What is a Data Pipeline?] Published by Data Pipelines, retrieved 11 March 2021</ref>
*[[Dataflow]]▼
*[[Throughput]]▼
*[[Parallel computing|Parallelism]]▼
*[[Instruction pipeline]]▼
**[[Classic RISC pipeline]]▼
*[[Graphics pipeline]]▼
*[[Pipeline (software)]]▼
**[[Pipeline (Unix)]]▼
**[[Hartmann pipeline]] for VM▼
**[[BatchPipes]] for MVS▼
*[[Geometry pipelines]]▼
*[[XML pipeline]]▼
*[[Staged event-driven architecture]]▼
==
▲* [[Dataflow]]
▲* [[Throughput]]
▲* [[Parallel computing|Parallelism]]
▲* [[Instruction pipeline]]
▲** [[Classic RISC pipeline]]
▲* [[Graphics pipeline]]
▲* [[Pipeline (software)]]
▲** [[Pipeline (Unix)]]
▲** [[Hartmann pipeline]] for VM
▲** [[BatchPipes]] for MVS
▲* [[Geometry pipelines]]
▲* [[XML pipeline]]
▲* [[Staged event-driven architecture]]
== References ==
<references/>
== Bibliography ==
* {{cite book| last1=Perez Garcia |first1=Pablo |title=Pipeline DSL a dsl to create a CI/CD pipeline for your projects|
* For a standard discussion on pipelining in parallel computing see {{cite book |title=Parallel Programming in C with MPI and openMP |first=Michael J. |last=Quinn |publisher=McGraw-Hill Professional |year=2004 |isbn=0072822562 |___location=Dubuque, Iowa |url-access=registration |url=https://archive.org/details/parallelprogramm0000quin }}
* {{cite web |url=https://www.datapipelines.com/blog/what-is-a-data-pipeline/ |title=What is a Data Pipeline? |last=Pogonyi |first=Roland |date=February 2021 |access-date=March 11, 2021}}
[[Category:Instruction processing]]
|