Memory access pattern: Difference between revisions

Content deleted Content added
MartinB29 (talk | contribs)
m source optimisation
Random: fmt
 
(33 intermediate revisions by 21 users not shown)
Line 1:
In [[computing]], a '''memory access pattern''' or '''IO access pattern''' is the pattern with which a system or program reads and writes [[Memory (computing)|memory]] oron [[Computer data storage#Secondary storage|secondary storage]]<!--there no limit it levels (1 and 4 level could be used too), but we don't have sources yet -->. These patterns differ in the level of [[locality of reference]] and drastically affect [[Cache (computing)|cache]] performance,<ref name=":0">{{cite web |title =Introduction datato Data-Oriented orientedDesign design|url=http://www.dice.se/wp-content/uploads/2014/12/Introduction_to_Data-Oriented_Design.pdf |url-status=dead |archive-url=https://web.archive.org/web/20191116014412/http://www.dice.se/wp-content/uploads/2014/12/Introduction_to_Data-Oriented_Design.pdf |archive-date=2019-11-16}}</ref> and also have implications for the approach to [[parallelism (computing)|parallelism]]<ref>{{cite journal|last1=Jang|first1=Byunghyun|last2=Schaa|first2=Dana|last3=Mistry|first3=Perhaad|last4=Kaeli|first4=David|lastname-authorlist-ampstyle=yesamp|date=2010-05-27|title=Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures|url=http://ieeexplore.ieee.org/document/5473222/|url-access=subscription|journal=IEEE Transactions on Parallel and Distributed Systems|publisher=[[Institute of Electrical and Electronics Engineers|IEEE]]|publication-place___location=New York|volume=22|issue=1|pages=105-118105–118|eissn=1558-2183|doi=10.1109/TPDS.2010.107|s2cid=15997131|issn=1045-9219|id=NLM unique id 101212014}}</ref><ref>{{cite webbook |titlelast1=xeonJeffers phi|first1=James optimization|url=https://books.google.co.ukcom/books?id=DDpUCwAAQBAJ&pg=PA231&lpg=PA231&dqq=scatter+memory+access+pattern&sourcepg=bl&otsPA231 |title=YiIyhx3udO&sigIntel Xeon Phi Processor High Performance Programming: Knights Landing Edition |last2=QmwMzpNsEbVKma3YuC47Nq0nvmY&hlReinders |first2=en&saJames |last3=X&vedSodani |first3=Avinash |date=0ahUKEwjo2016-ZzHv6XNAhXK1xQKHdvuAGYQ6AEILTAC#v05-31 |publisher=onepage&qMorgan Kaufmann |isbn=scatter%20memory%20access%20pattern&f9780128091951 |edition=false2nd}}</ref> and distribution of workload in [[shared memory system]]s.<ref>{{citeCite book web|titlelast1=Jana |first1=Siddhartha |last2=Schuchart |first2=Joseph |last3=Chapman |first3=Barbara|author3-link=Barbara Chapman |chapter=Analysis of Energy and Performance of Code Transformations for PGAS-based Data Access Patterns |date=2014-10-06 |title=Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models |chapter-url=httphttps://nic.uoregon.edu/pgas14/papers/pgas14_submission_17.pdf |series=PGAS '14 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1–10 |doi=10.1145/2676870.2676882 |isbn=978-1-4503-3247-7}}</ref> Further, [[cache coherency]] issues can affect [[multiprocessor]] performance,<ref>{{citeCite book web|titlelast1=enhancingMarandola cache|first1=Jussara coherent|last2=Louise architectures|first2=Stéphane with|last3=Cudennec memory|first3=Loïc |last4=Acquaviva |first4=Jean-Thomas |last5=Bader |first5=David |chapter=Enhancing Cache Coherent Architectures with access patterns for embedded many-coremanycore systems |date=2012-10-11 |title=2012 International Symposium on System on Chip (SoC) |chapter-url=httphttps://wwwinria.cchal.gatechscience/hal-00741947v1 |pages=1–7 |language=en |publisher=IEEE|doi=10.edu1109/~baderISSoC.2012.6376369 |isbn=978-1-4673-2896-8 |url=https:/papers/EnhancingCachehal.inria.fr/hal-SoC1200741947/file/SoC_2012.pdf }}</ref> which means that certain memory access patterns place a ceiling on parallelism (which [[Manycore processor|manycore]] approaches seek to break).<ref>{{cite web|title=intel terascale|url=https://cseweb.ucsd.edu/classes/fa12/cse291-c/talks/SCC-80-core-cern.pdf}}</ref>).
 
[[Computer memory]] is usually described as "[[random access memory|random access]]", but traversals by software will still exhibit patterns that can be exploited for efficiency. Various tools exist to help system designers<ref>{{cite book |last1=Brown |first1=Mary |url=http://dl.acm.org/citation.cfm?id=838115 |title=Memory Access Pattern Analysis |last2=Jenevein |first2=Roy M. |last3=Ullah |first3=Nasr |date=29 November 1998 |isbn=9780769504506 |series=WWC '98: Proceedings of the Workload Characterization: Methodology and Case Studies |publication-date=1998-11-29 |page=105}}</ref> and programmers understand, analyse and improve the memory access pattern, including [[VTune]] and [[Intel Advisor|Vectorization Advisor]],<ref>{{Cite book |last1=Ostadzadeh |first1=S. Arash |last2=Meeuws |first2=Roel J. |last3=Galuzzi |first3=Carlo |last4=Bertels |first4=Koen |chapter=QUAD – A Memory Access Pattern Analyser |series=Lecture Notes in Computer Science |date=2010 |volume=5992 |editor-last=Sirisuk |editor-first=Phaophak |editor2-last=Morgan |editor2-first=Fearghal |editor3-last=El-Ghazawi |editor3-first=Tarek |editor4-last=Amano |editor4-first=Hideharu |title=Reconfigurable Computing: Architectures, Tools and Applications |chapter-url=http://ce-publications.et.tudelft.nl/publications/207_quad__a_memory_access_pattern_analyser.pdf |language=en |___location=Berlin, Heidelberg |publisher=Springer |pages=269–281 |doi=10.1007/978-3-642-12133-3_25 |isbn=978-3-642-12133-3}}</ref><ref>{{Cite book |last1=Che |first1=Shuai |last2=Sheaffer |first2=Jeremy W. |last3=Skadron |first3=Kevin |chapter=Dymaxion: Optimizing memory access patterns for heterogeneous systems |date=2011-11-12 |title=Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis |chapter-url=https://www.cs.virginia.edu/~skadron/Papers/sc11_dymaxion_dist.pdf |series=SC '11 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1–11 |doi=10.1145/2063384.2063401 |isbn=978-1-4503-0771-0}}</ref><ref>{{Cite book |last=Harrison |first=Luddy |chapter=Examination of a memory access classification scheme for pointer-intensive and numeric programs |date=1996-01-01 |title=Proceedings of the 10th international conference on Supercomputing - ICS '96 |chapter-url=https://dl.acm.org/doi/10.1145/237578.237595 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=133–140 |doi=10.1145/237578.237595 |isbn=978-0-89791-803-9}}</ref><ref>{{cite book|chapter=Online Memory Access Pattern Analysis on an Application Profiling Tool|doi=10.1109/CANDAR.2014.86|isbn=978-1-4799-4152-0|title=2014 Second International Symposium on Computing and Networking|year=2014|last1=Matsubara|first1=Yuki|last2=Sato|first2=Yukinori|pages=602–604|s2cid=16476418}}</ref><ref>{{cite web|title=Putting Your Data and Code in Order: Data and layout|url=https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-data-and-layout-part-2}}</ref> including tools to address [[GPU]] memory access patterns.<ref>{{Cite book |last1=Kim |first1=Yooseong |last2=Shrivastava |first2=Aviral |chapter=CuMAPz: A tool to analyze memory access patterns in CUDA |date=2011-06-05 |title=Proceedings of the 48th Design Automation Conference |chapter-url=https://dl.acm.org/doi/10.1145/2024724.2024754 |series=DAC '11 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=128–133 |doi=10.1145/2024724.2024754 |isbn=978-1-4503-0636-2}}</ref>
Computer memory is usually described as '[[random access memory|random access]]', but traversals by software will still exhibit patterns that can be exploited for efficiency. Various tools exist to help system designers
<ref>{{cite web|title=analysis of memory access patterns|url=http://dl.acm.org/citation.cfm?id=838115}}</ref>
and programmers understand,analyse and improve the memory access pattern, e.g., [[VTune]], [[Intel Advisor|Vectorization Advisor]], and others,<ref>{{cite web|title=QUAD a memory access pattern analyser|url=http://ce-publications.et.tudelft.nl/publications/207_quad__a_memory_access_pattern_analyser.pdf}}</ref><ref>{{cite web|title=Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems|url=http://www.cs.virginia.edu/~skadron/Papers/sc11_dymaxion_dist.pdf}}</ref><ref>{{cite web|title=evaluation of a memory access classification scheme for pointer intensive and numeric programs|url=http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=555D92F6A16D03EE4A8057EEC1AFA80B?doi=10.1.1.48.4163&rep=rep1&type=pdf}}</ref><ref>{{cite web|title=Online Memory Access Pattern Analysis on an Application Profiling Tool|url=http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7052256&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7052256}}</ref><ref>{{cite web|title=Putting Your Data and Code in Order: Data and layout|url=https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-data-and-layout-part-2}}</ref> including tools to address [[GPU]] memory access patterns<ref>{{cite web|title=CuMAPz: a tool to analyze memory access patterns in CUDA|url=http://dl.acm.org/citation.cfm?id=2024754}}</ref>
 
Memory access patterns also have implications for [[security (computing)|security]],<ref>{{Cite book |last1=Kim |first1=Yooseong |last2=Shrivastava |first2=Aviral |chapter=CuMAPz: A tool to analyze memory access patterns in CUDA |date=2011-06-05 |title=Proceedings of the 48th Design Automation Conference |chapter-url=https://dl.acm.org/doi/10.1145/2024724.2024754 |series=DAC '11 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=128–133 |doi=10.1145/2024724.2024754 |isbn=978-1-4503-0636-2}}</ref><ref>{{Cite thesis |last1=Canteaut |first1=Anne |last2=Lauradoux |first2=Cédric |last3=Seznec |first3=André |title=Understanding cache attacks |date=2006 |degree=report |publisher=INRIA |url=https://inria.hal.science/inria-00071387/en/ |issn=0249-6399 |language=en}}</ref> which motivates some to try and disguise a program's activity for [[Privacy (computing)|privacy]] reasons.<ref>{{Cite web |last=Hardesty |first=Larry |date=2013-07-02 |title=Protecting data in the cloud |url=https://news.mit.edu/2013/protecting-data-in-the-cloud-0702 |access-date= |website=MIT News |language=en}}</ref><ref>{{Cite web |last=Rossi |first=Ben |date=2013-09-24 |title=Boosting cloud security with oblivious RAM |url=https://www.information-age.com/boosting-cloud-security-with-oblivious-ram-28698/ |access-date= |website=Information Age |language=en-US}}</ref>
Memory access patterns also have implications for [[security (computing)|security]],<ref>{{cite web|title=Memory Access Pattern Protection for Resource-constrained Devices|url=https://www.cardis.org/proceedings/cardis_2012/CARDIS2012_14.pdf}}</ref><ref>{{cite web|title=understanding cache attacks|url=https://www.rocq.inria.fr/secret/Anne.Canteaut/Publications/RR-5881.pdf}}</ref>
which motivates some to try and disguise a program's activity for [[privacy (computing)|privacy]] reasons.<ref>{{cite web|title=protecting data in the cloud|url=http://news.mit.edu/2013/protecting-data-in-the-cloud-0702}}</ref><ref>{{cite web|title=boosting-cloud-security-with----oblivious-ram|url=http://www.information-age.com/technology/security/123457364/boosting-cloud-security-with----oblivious-ram---}}proposed RAM design avoiding memory-access-pattern vulnerabilities</ref>
 
== Examples ==
[[File:Random vs sequential access.svg|thumb|right| ''Sequential'' and ''Linear'' patterns are incorrectly drawn as counterparts to each other by some publications; while real-world [[workloads]] contain almost innumerable patterns.<ref>{{cite web|author1=Chuck Paridon|title=Storage Performance Benchmarking Guidelines - Part I: Workload Design|url=http://www.snia.org/sites/default/files/PerformanceBenchmarking.Nov2010.pdf|quote=In practice, IO access patterns are as numerous as the stars}}</ref>]]
 
=== Sequential ===
{{main|Sequential access}}
The simplest extreme is the [[sequential access]] pattern, where data is read, processed, and written out with straightforward incremented/decremented addressing. These access patterns are highly amenable to [[prefetching (computing)|prefetching]].
 
=== Strided ===
[[Stride (computing)|Strided]] or simple 2D, 3D access patterns (e.g., stepping through [[multi-dimensional array]]s) are similarly easy to predict, and are found in implementations of [[linear algebra]] algorithms and [[image processing]]. [[Loop tiling]] is an effective approach.<ref>{{Cite book |last1=Kennedy |first1=Ken |last2=McKinley |first2=Kathryn S. |chapter=Optimizing for parallelism and data locality |date=1992-08-01 |title=Proceedings of the 6th international conference on Supercomputing - ICS '92 |chapter-url=https://www.cs.utexas.edu/~mckinley/papers/par-mem-ics-92.pdf |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=323–334 |doi=10.1145/143369.143427 |isbn=978-0-89791-485-7}}</ref> Some systems with [[Direct memory access|DMA]] provided a strided mode for transferring data between subtile of larger [[2D array]]s and [[scratchpad memory]].<ref>{{Cite book |last1=Saidi |first1=Selma |last2=Tendulkar |first2=P. |last3=Lepley |first3=Thierry |last4=Maler |first4=O. |chapter=Optimal 2D Data Partitioning for DMA Transfers on MPSoCs |date=2012 |title=2012 15th Euromicro Conference on Digital System Design |chapter-url=https://www-verimag.imag.fr/~maler/Papers/dma2dim.pdf |pages=584–591 |language=en |publisher=IEEE|doi=10.1109/DSD.2012.99 |isbn=978-0-7695-4798-5 }}</ref>
<ref>{{cite web|title=optimizing for tiling and data locality|url=http://www.cs.utexas.edu/users/mckinley/papers/par-mem-ics-92.pdf}}paper covers loop tiling and implication for parallel code</ref> Some systems with [[Direct memory access|DMA]] provided a strided mode for transferring data between subtile of larger [[2D array]]s and [[scratchpad memory]].<ref>{{cite web|title=Optimal 2D Data Partitioning for DMA Transfers on MPSoCs|url=http://www-verimag.imag.fr/~maler/Papers/dma2dim.pdf}}</ref>
 
=== Linear ===
A linear access pattern is closely related to '"strided'", where a [[memory address]] may be computed from a linear combination of some index. Stepping through indices sequentially with a linear pattern yields [[Stride of an array|strided access]]. A linear access pattern for writes (with any access pattern for non -overlapping reads) may guarantee that an algorithm can be parallelised, which is exploited in systems supporting [[compute kernel]]s.
 
=== Nearest neighbourneighbor ===
Nearest neighbourneighbor memory access patterns appear in simulation, and are related to sequential/ or strided patterns. An algorithm may traverse a data structure using information from the nearest neighboursneighbors of a data element (in one or more dimensions) to perform a calculation. These are common in physics simulations operating on grids.<ref name="PGAS programming"/> Nearest neighbourneighbor can also refer to inter-node communication in a cluster; Physicsphysics simulations which rely on such local access patterns can be parallelized with the data partitioned into cluster nodes, with purely nearest-neighbourneighbor communication between them, which may have advantages for latency and communication bandwidth. This use case maps well onto [[torus network topology]].<ref>{{Cite book |last1=Weinberg |first1=Jonathan |last2=McCracken |first2=Michael O. |last3=Snavely |first3=Allan |last4=Strohmaier |first4=Erich |chapter=Quantifying Locality in the Memory Access Patterns of HPC Applications |date=12–18 November 2005 |page=50 |title=ACM/IEEE SC 2005 Conference (SC'05) |chapter-url=https://www.sdsc.edu/~allans/sc05_locality.pdf |url-status=dead |publisher=IEEE |publication-place=Seattle, WA, USA |doi=10.1109/SC.2005.59 |isbn=1-59593-061-2 |url=https://digital.library.unt.edu/ark:/67531/metadc878129/ |archive-url=https://web.archive.org/web/20160803221218/http://www.sdsc.edu/~allans/sc05_locality.pdf |archive-date=2016-08-03}} mentions nearest neighbor access patterns in clusters</ref>
<ref>{{cite web|title=Quantifying Locality In The Memory Access Patterns of HPC Applications|url=http://www.sdsc.edu/~allans/sc05_locality.pdf}}mentions nearest neighbour access patterns in clusters</ref>
 
=== 2D Spatiallyspatially coherent ===
In [[3D rendering]], access patterns for [[texture mapping]] and [[rasterization]] of small primitives (with arbitrary distortions of complex surfaces) are far from linear, but can still exhibit spatial locality (e.g., in [[screen space]] or [[texture space]]). This can be turned into good ''memory'' locality via some combination of [[morton order]]<ref>{{Cite book |last1=Hakura |first1=Ziyad S. |last2=Gupta |first2=Anoop |chapter=The design and analysis of a cache architecture for texture mapping |date=1997-05-01 |title=Proceedings of the 24th annual international symposium on Computer architecture |chapter-url=https://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/hakura97_texcaching.pdf |series=ISCA '97 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=108–120 |doi=10.1145/264107.264152 |isbn=978-0-89791-901-2}}</ref> and [[Tiling (computer graphics)|tiling]] for [[texture map]]s and [[frame buffer]] data (mapping spatial regions onto cache lines), or by sorting primitives via [[tile based deferred rendering]].<ref>{{Cite book |last1=Nocentino |first1=Anthony E. |last2=Rhodes |first2=Philip J. |chapter=Optimizing memory access on GPUs using morton order indexing |date=2010-04-15 |title=Proceedings of the 48th Annual Southeast Regional Conference |chapter-url=https://john.cs.olemiss.edu/~rhodes/papers/Nocentino10.pdf |url-status=dead |series=ACMSE '10 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1–4 |doi=10.1145/1900008.1900035 |isbn=978-1-4503-0064-3 |archive-url=https://web.archive.org/web/20221208063103/https://john.cs.olemiss.edu/~rhodes/papers/Nocentino10.pdf |archive-date=2022-12-08}}</ref> It can also be advantageous to store matrices in morton order in [[linear algebra libraries]].<ref>{{Cite journal |last1=Wise |first1=David S. |last2=Frens |first2=Jeremy D. |date=1999 |title=Morton-order Matrices Deserve Compilers ' Support Technical Report 533 |s2cid=17192354 }}</ref>
In [[3D rendering]], access patterns for [[texture mapping]] and [[rasterization]] of small primitives (with arbitrary distortions of complex surfaces) are far from linear, but can still exhibit spatial locality (e.g. in [[screen space]] or [[texture space]]) . This can be turned into good ''memory'' locality via some combination of [[morton order]]
<ref>{{cite web|title=The Design and Analysis of a Cache Architecture for Texture Mapping|url=http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/hakura97_texcaching.pdf}}see morton order,texture access pattern</ref>
and [[Tiling (computer graphics)|tiling]] for [[texture map]]s and [[frame buffer]] data (mapping spatial regions onto cache lines), or by sorting primitives via [[tile based deferred rendering]].<ref>{{cite web|title=morton order to accelerate texturing|url=http://john.cs.olemiss.edu/~rhodes/papers/Nocentino10.pdf}}</ref> It can also be advantageous to store matrices in morton order in [[linear algebra libraries]].
<ref>{{cite web|title=Morton-order Matrices Deserve Compilers’ Support Technical Report 533|url=http://www.cs.indiana.edu/pub/techreports/TR533.pdf}}discusses the importance of morton order for matrices</ref>
 
=== Scatter ===
 
A [[Scatter (vector addressing)|scatter]] memory access pattern combines sequential reads with indexed/random addressing for writes.<ref name="gpu gems2">{{cite web |last=Harris |first=Mark |date=April 2005 |title=GPU Gems 2 |url=http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html |url-status=dead |archive-url=https://web.archive.org/web/20160614195224/http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html |archive-date=2016-06-14 |access-date=2016-06-13 |at=31.1.3 Stream Communication: Gather vs. Scatter}}</ref> Compared to gather, It may place less load on a cache hierarchy since a [[processing element]] may dispatch writes in a "fire and forget" manner (bypassing a cache altogether), whilst using predictable prefetching (or even DMA) for its source data.
A [[Scatter (vector addressing)|scatter]] memory access pattern combines sequential reads with indexed/random addressing for writes.
<ref name="gpu gems2">{{cite web|title=gpgpu scatter vs gather|url=http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter31.html}}</ref>
Compared to gather, It may place less load on a cache hierarchy since a [[processing element]] may dispatch writes in a 'fire and forget' manner (bypassing a cache altogether), whilst using predictable prefetching (or even DMA) for its source data.
 
However, it may be harder to parallelise since there is no guarantee the writes do not interact,<ref name="gpu gems"/> and many systems are still designed assuming that a hardware cache will coalesce many small writes into larger ones.
 
In the past, [[forward texture mapping]] attempted to handle the randomness with '"writes'", whilst sequentially reading source texture information.
 
The [[PlayStation 2]] console used conventional inverse texture mapping, but handled any scatter/gather processing '"on-chip'" using EDRAM, whilst 3D model (and a lot of texture data) from main memory was fed sequenatiallysequentially by DMA. This is why it lacked of support for indexed primitives, and sometimes needed to manage textures '"up front'" in the [[display list]].
 
=== Gather ===
In a [[gather (vector addressing)|gather]] memory access pattern, reads are randomly addressed or indexed, whilst the writes are sequential (or linear).<ref name="gpu gems2"/> An example is found in [[inverse texture mapping]], where data can be written out linearly across scanlines[[scan line]]s, whilst random access texture addresses are calculated per [[pixel]].
 
Compared to scatter, the disadvantage is that caching (and bypassing latencies) is now essential for efficient reads of small elements, however it is easier to parallelise since the writes are guaranteed to not overlap. As such the gather approach is more common for [[gpgpuGPGPU]] programming,<ref name="gpu gems"/> where the massive threading (enabled by parallelism) is used to hide read latencies.<ref name = "gpu gems">{{cite book|title = GPU gems|url=https://books.google.com/books?id=lGMzmbUhpiAC&q=scatter+memory+access+pattern&pg=PA51|isbn = 9780123849892|date = 2011-01-13| publisher=Elsevier }}deals with "scatter memory access patterns" and "gather memory access patterns" in the text</ref>
In a [[gather (vector addressing)|gather]] memory access pattern, reads are randomly addressed or indexed, whilst the writes are sequential (or linear).
<ref name="gpu gems2"/>
An example is found in [[inverse texture mapping]], where data can be written out linearly across scanlines, whilst random access texture addresses are calculated per pixel.
 
Compared to scatter, the disadvantage is that caching (and bypassing latencies) is now essential for efficient reads of small elements, however it is easier to parallelise since the writes are guaranteed to not overlap. As such the gather approach is more common for [[gpgpu]] programming,<ref name="gpu gems"/> where the massive threading (enabled by parallelism) is used to hide read latencies.
<ref name = "gpu gems">{{cite web|title = GPU gems|url=https://books.google.co.uk/books?id=lGMzmbUhpiAC&pg=PA51&lpg=PA51&dq=scatter+memory+access+pattern&source=bl&ots=nwKeqfQHeV&sig=lsFTC52P_Fu7EU1-JIlwqDdZozg&hl=en&sa=X&ved=0ahUKEwjtxZOHxqXNAhWEOxoKHWPsAJgQ6AEISTAG#v=onepage&q=scatter%20memory%20access%20pattern&f=false}}deals with 'scatter memory access patterns' and 'gather memory access patterns' in the text</ref>
 
=== Combined gather and scatter ===
 
An algorithm may gather data from one source, perform some computation in local or on chip memory, and scatter results elsewhere. This is essentially the full operation of a [[GPU]] pipeline when performing [[3D rendering]]- gathering indexed vertices and textures, and scattering shaded pixels in [[screen space]]. Rasterization of opaque primitives using a depth buffer is '"commutative'", allowing reordering, which facilitates parallel execution. In the general case synchronisation primitives would be needed.
 
=== Random ===
{{main|Random access}}
 
At the opposite extreme is a truly random memory access pattern. A few multiprocessor systems are specialised to deal with these.<ref>{{Cite conference |last=Wichmann |first=Nathan |date=2005 |title=Cray and HPCC : Benchmark Developments and Results from the Past Year |url=https://cug.org/5-publications/proceedings_attendee_lists/2005CD/S05_Proceedings/pages/Authors/Wichmann/Wichmann_paper.pdf |conference=CUG 2005 Proceedings}} see global random access results for Cray X1. vector architecture for hiding latencies, not so sensitive to cache coherency</ref> The [[Partitioned global address space|PGAS]] approach may help by sorting operations by data on the fly (useful when the problem ''is'' figuring out the locality of unsorted data).<ref name="PGAS programming">{{Cite AV media |url=https://www.youtube.com/watch?v=NU4VfjISk2M |title=Partitioned Global Address Space Programming - Kathy Yelick |date=2013-09-05 |last=CITRIS and the Banatao Institute |access-date=2024-11-02 |via=YouTube}} covers cases where PGAS is a win, where data may not be already sorted, e.g., dealing with complex graphs - see "science across the irregularity spectrum".</ref> Data structures which rely heavily on [[pointer chasing]] can often produce poor [[locality of reference]], although sorting can sometimes help. Given a truly random memory access pattern, it may be possible to break it down (including scatter or gather stages, or other intermediate sorting) which may improve the locality overall; this is often a prerequisite for [[Parallel computing|parallelizing]].
At the opposite extreme is a truly random memory access pattern. A few multiprocessor systems are specialised to deal with these.
<ref>{{cite web|title=Cray and HPCC: Benchmark Developments and Results from the Past Year|url=https://cug.org/5-publications/proceedings_attendee_lists/2005CD/S05_Proceedings/pages/Authors/Wichmann/Wichmann_paper.pdf}}see global random access results for Cray X1. vector architecture for hiding latencies, not so sensitive to cache coherency</ref>
The [[Partitioned global address space|PGAS]] approach may help by sorting operations by data on the fly (useful when the problem *is* figuring out the locality of unsorted data).<ref name="PGAS programming">{{cite web|title=partitioned global address space programming|url=https://www.youtube.com/watch?v=NU4VfjISk2M}}covers cases where PGAS is a win, where data may not be already sorted, e.g. dealing with complex graphs - see 'science across the irregularity spectrum'</ref>
Data structures which rely heavily on [[pointer chasing]] can often produce poor locality of reference, although sorting can sometimes help. Given a truly random memory access pattern, it may be possible to break it down (including scatter or gather stages, or other intermediate sorting) which may improve the locality overall; this is often a pre-requisite for parallelizing.
 
== Approaches ==
 
=== Data -oriented design ===
[[Data -oriented design]] is an approach intended to maximise the locality of reference, by organising data according to how it is traversed in various stages of a program, contrasting with the more common [[Object-oriented programming|object oriented]] approach (i.e., organising such that data layout explicitly mirrors the access pattern).<ref>{{cite web|title name=":0" data oriented design|url=http://www.dice.se/wp-content/uploads/2014/12/Introduction_to_Data-Oriented_Design.pdf}}</ref>
 
== Contrast with locality of reference ==
<!--- this section is only here to spell out why I don't think this article should be merged. I see 'locality of reference' had earlier material merged into it, and IMO it would become too sprawling to try and merge everything referred to here--->
 
[[Locality of reference]] refers to a property exhibited by memory access patterns. A programmer will change the memory access pattern (by reworking algorithms) to improve the locality of reference,<ref>{{cite web|title=optimize-data-structures-and-memory-access-patterns-to-improve-data-locality|url=https://software.intel.com/en-us/articles/optimize-data-structures-and-memory-access-patterns-to-improve-data-locality}}</ref> and/or to increase potential for parallelism.<ref name="gpu gems2"/> A programmer or system designer may create frameworks or abstractions (e.g., [[Template (C++)|C++ templates]] or [[higher-order function]]s) that [[Encapsulation (computer programming)|encapsulate]] a specific memory access pattern.<ref>{{cite web|title=Template-based Memory Access Engine for Accelerators in SoCs|url=http://www.cs.utah.edu/~zfang/asp_dac11.pdf}}</ref><ref>{{cite web|title=Multi-Target Vectorization With MTPS C++ Generic Library|url=http://www.metz.supelec.fr/metz/recherche/publis_pdf/Supelec753.pdf}}a C++ template library for producing optimised memory access patterns</ref>
''[[Locality of reference]]'' refers to a property exhibited by ''memory access patterns''.
A programmer will ''change'' the memory access pattern (by reworking algorithms) to ''improve'' the locality of reference
,<ref>{{cite web|title=optimize-data-structures-and-memory-access-patterns-to-improve-data-locality|url=https://software.intel.com/en-us/articles/optimize-data-structures-and-memory-access-patterns-to-improve-data-locality}}</ref> ''and/or'' to increase potential for parallelism.<ref name="gpu gems2"/>
A programmer or system designer may create frameworks or abstractions (e.g. [[C++ templates]] or [[higher-order functions]]) that encapsulate a specific memory access pattern.
<ref>{{cite web|title=Template-based Memory Access Engine for Accelerators in SoCs|url=http://www.cs.utah.edu/~zfang/asp_dac11.pdf}}</ref>
<ref>{{cite web|title=Multi-Target Vectorization With MTPS C++ Generic Library|url=http://www.metz.supelec.fr/metz/recherche/publis_pdf/Supelec753.pdf}}a C++ template library for producing optimised memory access patterns</ref>
 
Different considerations for memory access patterns appear in parallelism beyond locality of reference, namely the separation of reads and writes. E.g.: even if the reads and writes are '"perfectly'" local, it can be impossible to parallelise due to [[Data dependency|dependencies]]; separating the reads and writes into separate areas yields a ''different'' memory access pattern, maybe initially appear worse in pure ''locality'' terms, but desirable to leverage modern parallel hardware.<ref name="gpu gems2"/>
 
Locality of reference may also refer to individual variables (e.g., the ability of a [[compiler]] to cache them in [[Processor register|registers]]), whilst the term memory access pattern only refers to data held in an indexable memory (especially [[main memory]]).
 
== See also ==
 
* [[Working set]]
* [[Gather-scatter (vector addressing)]]
* [[Locality of reference]]
* [[Parallel computing]]
* [[Working set]]
 
== References ==
{{Reflist|32em}}
 
[[Category:ComputingComputer performance]]
[[Category:Software optimization]]
[[Category:Computer data storage]]