Multithreading (computer architecture): Difference between revisions

Content deleted Content added
moved section
WikiCleanerBot (talk | contribs)
m v2.05b - Bot T19 CW#25 - Fix errors for CW project (Heading hierarchy)
 
(401 intermediate revisions by more than 100 users not shown)
Line 1:
{{For|threads in software|Thread (computing)}}
'''''Multithreading''''' computers have hardware support to efficiently execute multiple [[thread (computer science)|threads]].
{{Multiple issues|
{{Refimprove|date=October 2009}}
{{Original research|date=October 2010}}
}}
{{short description|Ability of a CPU to provide multiple threads of execution concurrently}}
 
[[File:Multithreaded process.svg|thumb|A process with two threads of execution, running on a single processor |alt=A process with two threads of execution, running on a single processor. Thread #1 is executed first, eventually starts Thread #2, and waits for a response. When Thread #2 finishes, it signals Thread #1 to resume execution to completion and then finishes.]]
The '''''Multithreading''''' paradigm has become more popular as efforts to further exploit [[instruction level parallelism]] have stalled since the late-1990s. This allowed the concept of ''Throughput computing'' to reemerge to prominence from the more specialized field of [[transaction processing]]:
* Even though it is very difficult to further speed up a single thread or single program, most computer systems are actually multi-tasking among multiple threads or programs.
* Techniques that would allow speedup of the overall system throughput of all tasks would be a meaningful performance gain.
 
In [[computer architecture]], '''multithreading''' is the ability of a [[central processing unit]] (CPU) (or a single core in a [[multi-core processor]]) to provide multiple [[thread (computer science)|threads of execution]].
The two major techniques for ''throughput computing'' are [[multiprocessing]] and multithreading.
 
==Overview==
Some criticism of multithreading include:
The multithreading [[paradigm]] has become more popular as efforts to further exploit [[instruction-level parallelism]] have stalled since the late 1990s. This allowed the concept of [[throughput computing]] to re-emerge from the more specialized field of [[transaction processing]]. Even though it is very difficult to further speed up a single thread or single program, most computer systems are actually multitasking among multiple threads or programs. Thus, techniques that improve the throughput of all tasks result in overall performance gains.
* Multiple threads can interfere with each other when sharing hardware resources such as [[cache]]s or [[Translation Lookaside Buffer|TLB]]s
* Execution times of a single-thread are not improved but can be degraded.
 
Two major techniques for throughput computing are ''multithreading'' and ''[[multiprocessing]]''.
Hardware techniques used to support [[thread (computer science)|multithreading]] often parallel the software techniques used
for [[computer multitasking]] of computer programs.
 
===Advantages===
If a thread gets a lot of [[CPU cache#Cache miss|cache misses]], the other threads can continue taking advantage of the unused computing resources, which may lead to faster overall execution, as these resources would have been idle if only a single thread were executed. Also, if a thread cannot use all the computing resources of the CPU (because instructions depend on each other's result), running another thread may prevent those resources from becoming idle.
 
===Block Multi-threadingDisadvantages===
Multiple threads can interfere with each other when sharing hardware resources such as caches or [[translation lookaside buffer]]s (TLBs). As a result, execution times of a single thread are not improved and can be degraded, even when only one thread is executing, due to lower frequencies or additional pipeline stages that are necessary to accommodate thread-switching hardware.
 
Overall efficiency varies; [[Intel]] claims up to 30% improvement with its [[Hyper-Threading Technology]],<ref>{{cite web|url=http://cache-www.intel.com/cd/00/00/01/77/17705_htt_user_guide.pdf|title=Intel Hyper-Threading Technology, Technical User's Guide|page=13|archive-url=https://web.archive.org/web/20100821074918/http://cache-www.intel.com/cd/00/00/01/77/17705_htt_user_guide.pdf|archive-date=2010-08-21}}</ref> while a synthetic program just performing a loop of non-optimized dependent floating-point operations actually gains a 100% speed improvement when run in parallel. On the other hand, hand-tuned [[assembly language]] programs using [[MMX (instruction set)|MMX]] or [[AltiVec]] extensions and performing data prefetches (as a good video encoder might) do not suffer from cache misses or idle computing resources. Such programs therefore do not benefit from hardware multithreading and can indeed see degraded performance due to contention for shared resources.
====Concept====
 
From the software standpoint, hardware support for multithreading is more visible to software, requiring more changes to both application programs and operating systems than multiprocessing. Hardware techniques used to support [[thread (computer science)|multithreading]] often parallel the software techniques used for [[multitasking of computer programs|computer multitasking]]. Thread scheduling is also a major problem in multithreading.
The simplest type of multi-threading is where one thread runs until it is blocked by an
event that normally would create a long latency stall. Such a stall might be a cache-miss that has to
access off-chip memory, which might take hundreds of CPU cycles for the data to return.
Instead of waiting for the stall to resolve, a threaded processor would switch execution to another
thread that was ready to run. Only when the data for the previous thread had arrived, would the previous
thread be placed back on the list of ready-to-run threads.
 
Merging data from two processes can often incur significantly higher costs compared to processing the same data on a single thread, potentially by two or more orders of magnitude due to overheads such as inter-process communication and synchronization. <ref>{{Cite book |title=Operating System Concepts |isbn=978-0470128725 |last1=Silberschatz |first1=Abraham |last2=Galvin |first2=Peter B. |last3=Gagne |first3=Greg |date=29 July 2008 |publisher=Wiley }}</ref><ref>{{Cite book |title=Computer Organization and Design MIPS Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) |date=2013 |publisher=Morgan Kaufmann |isbn=978-0124077263}}</ref><ref>{{Cite book |title=Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers |date=2005 |publisher=Pearson |isbn=978-0131405639}}</ref>
For example:
# Cycle i : instruction j from thread A is issued
# Cycle i+1: instruction j+1 from thread A is issued
# Cycle i+2: instruction j+2 from thread A is issued, load instruction which misses in all caches
# Cycle i+3: thread scheduler invoked, switches to thread B
# Cycle i+4: instruction k from thread B is issued
# Cycle i+5: instruction k+1 from thread B is issued
 
==Types==
 
{{main|Temporal multithreading}}
Conceptually, it is similiar to cooperative multi-tasking used in [[RTOS]]es in which
tasks voluntarily give up execution time when they need to wait upon some type of event.
 
The simplest type of multithreading occurs when one thread runs until it is blocked by an event that normally would create a long-latency stall. Such a stall might be a cache miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of [[process state#Ready|ready-to-run]] threads.
====Terminology====
This type of multithreading is known as ''Block'' or ''Cooperative'' or ''Coarse-grained'' multithreading.
 
For example:
====Hardware Cost====
# Cycle {{mvar|i}}: instruction {{mvar|j}} from thread {{mvar|A}} is issued.
The goal of multithreading hardware support is to allow quick switching between a blocked
# Cycle {{math|''i'' + 1}}: instruction {{math|''j'' + 1}} from thread {{mvar|A}} is issued.
thread and another thread ready to run. To support this goal, the hardware cost is to
# Cycle {{math|''i'' + 2}}: instruction {{math|''j'' + 2}} from thread {{mvar|A}} is issued, which is a load instruction that misses in all caches.
replicate the program visible registers as well as some processor control registers (such as the program counter).
Switching# fromCycle one{{math|''i'' thread+ to another3}}: thread meansscheduler the hardwareinvoked, switches fromto usingthread one register set to another{{mvar|B}}.
# Cycle {{math|''i'' + 4}}: instruction {{mvar|k}} from thread {{mvar|B}} is issued.
# Cycle {{math|''i'' + 5}}: instruction {{math|''k'' + 1}} from thread {{mvar|B}} is issued.
 
Conceptually, it is similar to cooperative multi-tasking used in [[real-time operating system]]s, in which tasks voluntarily give up execution time when they need to wait upon some type of event. This type of multithreading is known as block, cooperative or coarse-grained multithreading.
Such additional hardware has these benefit:
* The thread switch can be done in one CPU cycle.
* It appears to each thread that they are executing alone and not sharing any hardware resources with any other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading.
 
The goal of multithreading hardware support is to allow quick switching between a blocked thread and another thread ready to run. Switching from one thread to another means the hardware switches from using one register set to another. To achieve this goal, the hardware for the program visible registers, as well as some processor control registers (such as the program counter), is replicated. For example, to quickly switch between two threads, the processor is built with two sets of registers.
In order to switch efficiently between active threads, each active thread needs to have its own
register set. For example, to quickly switch between two threads, the register hardware needs to be instantiated twice.
 
Additional hardware support for multithreading allows thread switching to be done in one CPU cycle, bringing performance improvements. Also, additional hardware allows each thread to behave as if it were executing alone and not sharing any hardware resources with other threads, minimizing the amount of software changes needed within the application and the operating system to support multithreading.
====Examples====
 
Many families of [[microcontroller]]s and embedded processors have multiple register banks to allow quick [[context switch]]ing for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads.{{citation needed|date=October 2010}}
 
=== Fine-grained multithreading ===
{{main|Barrel processor}}
 
The purpose of fine-grained multithreading is to remove all [[data dependency]] stalls from the execution [[pipeline (computing)|pipeline]]. Since one thread is relatively independent from other threads, there is less chance of one instruction in one pipelining stage needing an output from an older instruction in the pipeline. Conceptually, it is similar to [[preemption (computing)|preemptive]] multitasking used in operating systems; an analogy would be that the time slice given to each active thread is one CPU cycle.
===Interleaved Multi-threading===
 
For example:
''See article: [[barrel processor]]''
# Cycle {{math|''i'' + 1}}: an instruction from thread {{mvar|B}} is issued.
# Cycle {{math|''i'' + 2}}: an instruction from thread {{mvar|C}} is issued.
 
This type of multithreading was first called barrel processing, in which the [[stave (wood)|staves]] of a barrel represent the pipeline stages and their executing threads. Interleaved, preemptive, fine-grained or time-sliced multithreading are more modern terminology.
====Concept====
A more complicated type of multithreading is where the processor switches threads every CPU cycle.
For example:
 
In addition to the hardware costs discussed in the block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing between the different threads.
# Cycle i : an instruction from thread A is issued
# Cycle i+1: an instruction from thread B is issued
# Cycle i+2: an instruction from thread C is issued
 
===Simultaneous multithreading===
{{main|Simultaneous multithreading}}
 
The most advanced type of multithreading applies to [[superscalar processor]]s. Whereas a normal superscalar processor issues multiple instructions from a single thread every CPU cycle, in simultaneous multithreading (SMT) a superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of [[instruction-level parallelism]], this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots.
The purpose of this type of multithreading is to remove all [[data dependency]] stalls from the execution [[Pipeline (computer)|pipeline]]. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline.
 
For example:
Conceptually, it is similiar to [[preemption (computing)|pre-exemptive]] multi-tasking used in operating systems. One can
# Cycle {{mvar|i}}: instructions {{mvar|j}} and {{math|''j'' + 1}} from thread {{mvar|A}} and instruction {{mvar|k}} from thread {{mvar|B}} are simultaneously issued.
make the analogy that the time-slice given to each active thread is one CPU cycle.
# Cycle {{math|''i'' + 1}}: instruction {{math|''j'' + 2}} from thread {{mvar|A}}, instruction {{math|''k'' + 1}} from thread {{mvar|B}}, and instruction {{mvar|m}} from thread {{mvar|C}} are all simultaneously issued.
# Cycle {{math|''i'' + 2}}: instruction {{math|''j'' + 3}} from thread {{mvar|A}} and instructions {{math|''m'' + 1}} and {{math|''m'' + 2}} from thread {{mvar|C}} are all simultaneously issued.
 
To distinguish the other types of multithreading from SMT, the term "[[temporal multithreading]]" is used to denote when instructions from only one thread can be issued at a time.
====Terminology====
This type of multithreading was first called ''Barrel processing'', in which the staves
of a barrel represent the pipeline stages and their executing threads. ''Interleaved'' or ''Pre-emptive'' or ''Fine-grained'' multithreading are more modern terminology.
 
In addition to the hardware costs discussed for interleaved multithreading, SMT has the additional cost of each pipeline stage tracking the thread ID of each instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads being processed.
====Hardware Costs====
 
Implementations include [[Digital Equipment Corporation|DEC]] (later [[Compaq]]) [[Alpha 21464|EV8]] (not completed), [[Intel]] [[Hyper-Threading Technology]], [[IBM]] [[POWER5]]/[[POWER6]]/[[POWER7]]/[[POWER8]]/[[POWER9]], IBM [[IBM z13 (microprocessor)|z13]]/[[IBM z14 (microprocessor)|z14]]/[[IBM z15 (microprocessor)|z15]], [[Sun Microsystems]] [[UltraSPARC T2]], [[Cray]] [[Cray XMT|XMT]], and [[AMD]] [[Bulldozer (microarchitecture)|Bulldozer]] and [[Zen (microarchitecture)|Zen]] microarchitectures.
In addition to the hardware costs discussed in the ''Block'' type of multithreading, ''interleaved'' multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to larger to avoid thrashing between the different threads.
 
==Implementation specifics==
====Examples====
A major area of research is the thread scheduler that must quickly choose from among the list of ready-to-run threads to execute next, as well as maintain the ready-to-run and stalled thread lists. An important subtopic is the different thread priority schemes that can be used by the scheduler. The thread scheduler might be implemented totally in software, totally in hardware, or as a hardware/software combination.
 
Another area of research is what type of events should cause a thread switch: cache misses, inter-thread communication, [[direct memory access|DMA]] completion, etc.
* Denelcor [[Heterogeneous Element Processor]]
* [[Intel]] [[Super-threading]]
* [[Sun Microsystem]] [[Ultrasparc T1]]
* [[Lexra]] NetVortex
* [[MIPS architecture|MIPS]] 34K core which implements the Multi-Threaded ASE
* [[Raza Microelectronics Inc]] XLR
 
If the multithreading scheme replicates all of the software-visible state, including privileged control registers and TLBs, then it enables [[virtual machine]]s to be created for each thread. This allows each thread to run its own operating system on the same processor. On the other hand, if only user-mode state is saved, then less hardware is required, which would allow more threads to be active at one time for the same die area or cost.
 
==See also==
===Simultaneous Multi-threading===
*[[Async/await]]
*[[Super-threading]]
*[[Speculative multithreading]]
 
==References==
''See main article [[Simultaneous multithreading]]''
{{reflist}}
 
====Concept==External links==
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.9105&rep=rep1&type=pdf A Survey of Processors with Explicit Multithreading], [[Association for Computing Machinery|ACM]], March 2003, by Theo Ungerer, Borut Robi and Jurij Silc
The most advanced type of multi-threading applies to [[superscalar]] processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor issues one instruction from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of [[instruction level parallelism]], this type of multithreading is trying to exploit parallelism available across multiple threads.
*[https://www.geeksforgeeks.org/operating-system-difference-multitasking-multithreading-multiprocessing/ Operating System | Difference between Multitasking, Multithreading and Multiprocessing] GeeksforGeeks, 6 Sept. 2018.
 
{{Computer science}}
For example:
{{CPU technologies}}
{{Parallel computing}}
 
# Cycle i : instruction j from thread A; instruction k from thread B both simultaneously issued
# Cycle i+1: instruction k+1 from thread B; instruction m from thread C both simultaneously issued
# Cycle i+2: instruction j+1 from thread A; instruction m+1 from thread C both simultaneously issued
 
====Terminology====
To distinguish the other flavors of multithreading from SMT, the term [[Temporal multithreading]] is used to denote when one only thread
can execute at a time.
 
====Hardware Costs====
In addition to the hardware costs discussed for ''interleaved'' multithreading, SMT has the additional cost of each pipeline stage tracking the Thread ID of '''each''' instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads.
 
====Examples====
* [[Alpha AXP]] EV8 (uncompleted)
* [[Intel]] [[Hyperthreading]]
* [[IBM]] [[Power5]]
* Power Processing Element within the [[Cell microprocessor]]
 
[[Category:Computer architecture]]
[[Category:Central processing unit]]
[[Category:Instruction processing]]
[[Category:Microprocessors]]
[[Category:Parallel computing]]
[[Category:MicroprocessorsThreads (computing)]]
 
===Implementation Specifics===
 
A major area of research is the thread scheduler which must quickly choose among
the list of ready-to-run threads to execute next as well as maintain the read-to-run and stalled thread lists.
An important sub-topic are the different thread priority schemes that can be used by the scheduler.
The thread scheduler might be implemented totally in software or totally in hardware or as a hw/sw combination.
 
Another area of research is what type of events should cause a thread switch - cache misses, inter-thread communication,
[[Direct memory access|DMA]] completion, etc.
 
If the multithreading scheme replicates '''all''' software visible state, include priviledged control registers, TLBs, etc., then it enables [[virtual machine]]s to be created for each thread. This allows each thread to run it's own operating system on the same piece of hardware. On the other hand, if only user-mode state is saved, less hardware is required which allows for more threads to be active at one time.
 
 
{{CPU_technologies}}