Revision as of 11:17, 12 August 2025 edit Lkcl (talk \| contribs) Extended confirmed users 3,004 edits →Description: explain GPU memory strategies better Tags: Mobile edit Mobile web edit Advanced mobile edit ← Previous edit		Revision as of 11:19, 12 August 2025 edit undo Lkcl (talk \| contribs) Extended confirmed users 3,004 edits →Description: add tiled rendering as it contains all the citations needed Tags: Mobile edit Mobile web edit Advanced mobile edit Next edit →
Line 34: The [[ILLIAC IV]] as the world's first known SIMT processor had its [[ILLIAC_IV#Branches\|"branching"]] mechanism extensively documented, however fascinatingly it turns out to be [[Predication_(computer_architecture)#SIMD,_SIMT_and_vector_predication\|"predicate masking"]] in modern terminology. As access time of all the widespread [[random-access memory\|RAM]] types (e.g. [[DDR SDRAM]], [[GDDR SDRAM]], [[XDR DRAM]], etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. As shown in the design of the ILLIAC IV, the individual Processing Elements run at a slower clock rate than a standard CPU, but make up for the "lack" of clock rate by running massively more such PEs in parallel. The upshot is that each PE's (slower) speed is better matched to the speed of RAM. The strategy works due to GPU workloads being inherently parallel, and an example is [[Tiled rendering]]. SIMT is intended to limit [[instruction fetching]] overhead,<ref>{{cite conference \|first1=Sean \|last1=Rul \|first2=Hans \|last2=Vandierendonck \|first3=Joris \|last3=D’Haene \|first4=Koen \|last4=De Bosschere \|title=An experimental study on performance portability of OpenCL kernels \|year=2010 \|conference=Symp. Application Accelerators in High Performance Computing (SAAHPC)\|hdl=1854/LU-1016024 \|hdl-access=free }}</ref> i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of [[Nvidia\|NVIDIA]] and [[AMD]]) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times.

Single instruction, multiple threads: Difference between revisions