Predication (computer architecture): Difference between revisions

Content deleted Content added
SIMD, SIMT and vector predication: clarify the vector predication section, it was a bit of a mess
Tags: Mobile edit Mobile web edit Advanced mobile edit
SIMD, SIMT and vector predication: ILLIAC IV had masked Predicated SWAR! only 2 bits (2x32 or 1x64) but still!
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(9 intermediate revisions by the same user not shown)
Line 40:
Besides eliminating branches, less code is needed in total, provided the architecture provides predicated instructions. While this does not guarantee faster execution in general, it will if the <code>do_something</code> and <code>do_something_else</code> blocks of code are short enough.
 
Predication's simplest form is ''partial predication'', where the architecture has ''conditional move'' or ''conditional select'' instructions. Conditional move instructions write the contents of one register over another only if the predicate's value is true, whereas conditional select instructions choose which of two registers has its contents written to a third based on the predicate's value. A more generalized and capable form is ''full predication''. Full predication has a set of predicate registers for storing predicates (which allows multiple nested or sequential branches to be simultaneously eliminated) and most instructions in the architecture have aan (optional) register specifier field to specify which predicate register supplies the predicate.<ref>{{cite conference |last1=Mahlke|first1=Scott A.|last2=Hank|first2=Richard E.|last3=McCormick|first3=James E.|last4=August|first4=David I.|last5=Hwn|first5=Wen-mei W.|title=A Comparison of Full and Partial Predicated Execution Support for ILP Processors |conference=The 22nd International Symposium on Computer Architecture, 22–24 June 1995 |year=1995 |doi=10.1145/223982.225965 |isbn=0-89791-698-0 |citeseerx=10.1.1.19.3187}}</ref>
 
==Advantages==
Line 54:
*Predication complicates the hardware by adding levels of [[control unit|logic]] to critical [[datapath|paths]] and potentially degrades clock speed.
*A predicated block includes cycles for all operations, so shorter [[control-flow graph|paths]] may take longer and be penalized.
* An extra register read is required. A non-predicated ADD would read two registers from a register file, where a Predicated ADD would need to also read the predicate register file. This increases Hazards in [[Out-of-order execution]].
*Predication is not usually speculated and causes a longer dependency chain. For ordered data this translates to a performance loss compared to a predictable branch.<ref>{{cite web |last1=Cordes |first1=Peter |title=assembly - How does Out of Order execution work with conditional instructions, Ex: CMOVcc in Intel or ADDNE (Add not equal) in ARM |url=https://stackoverflow.com/a/50960323 |website=Stack Overflow |quote=Unlike with control dependencies (branches), they don't predict or speculate what the flags will be, so a cmovcc instead of a jcc can create a loop-carried dependency chain and end up being worse than a predictable branch. [https://stackoverflow.com/questions/50959808 gcc optimization flag -O3 makes code slower than -O2] is an example of that.}}</ref>
 
Line 72 ⟶ 73:
{{see also|Single instruction, multiple threads}}
 
Some [[SIMD within a register]] instruction sets, like AVX2, have the ability to use a logical [[Mask (computing)|mask]] to conditionally load/store values to memory, in a parallel form of the conditional move,. andThey may also apply individual mask bits to individual arithmetic units executing a parallel operation. One predicate mask bit is available for each sub-word of the SWAR register or each load/store value.
This form of multi-bit predication is also used in [[vector processors]] at the element level (synonymous with SWAR sub-words):
 
<syntaxhighlight lang="c">
This form of predication is also used in [[vector processors]] (at the sub-word, or element level) and [[single instruction, multiple threads]] computing used in modern [[GPUs]], both to enable/disable individual Processing Elements ''and'', separately and further, to mask-out sub-words within any given PE's SWAR ALU. All the techniques, advantages and disadvantages of single scalar predication apply just as well to the parallel processing case.
for each (sub-word i) of SWAR (or Vector) register
(condition-maskbit i) do_something(sub-word i)
(not condition-maskbit i) do_something_else(sub-word i)
</syntaxhighlight>
 
Masking is an integral part of [[Flynn's taxonomy|Array Processors]] such as the [[ILLIAC IV]]. Array Processors are known today as [[single instruction, multiple threads]] (SIMT), and a predicate bit ''per PE'' used to activate or de-activate each Processing Element. When the PE has no [[SIMD within a register]] instructions, each PE may be individually Predicated:
 
<syntaxhighlight lang="c">
for each (PE j) // of non-SWAR synchronously-concurrent array
(active-maskbit j) broadcast_scalar_instruction_to(PE j)
</syntaxhighlight>
 
Modern SIMT [[GPUs]] use (or used, but ILLIAC IV documentation termed it [[ILLIAC IV#Branches|"branching"]]) predication to enable/disable individual Processing Elements ''and'', separately and furthermore, to ''also'' mask-out sub-words within any given PE's SWAR ALU.
 
<syntaxhighlight lang="c">
for each (PE j) of SIMT synchronously-concurrent array
(active-maskbit j) { // broadcast only to active SWAR PEs
for each (sub-word i) of SWAR register in (PE j)
(condition-maskbit i) do_something(sub-word i)
(not condition-maskbit i) do_something_else(sub-word i)
}
</syntaxhighlight>
 
All the techniques, advantages and disadvantages of single scalar predication apply just as well to the parallel processing case, where the issues associated with branching are made far more complex.<ref>https://gpgpuarch.org/en/basic/simt/</ref>
 
==See also==