Domain-specific architecture: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Revision as of 13:55, 7 July 2023 edit Jaqen (talk \| contribs) Extended confirmed users 2,349 edits m fix ← Previous edit		Revision as of 03:31, 24 August 2025 edit undo Serebit (talk \| contribs) 401 edits Added short description Tags: Mobile edit Mobile app edit Android app edit App description add Next edit →
(21 intermediate revisions by 11 users not shown)
Line 1: {{Short description\|Computer architecture designed for a specific task}} A '''___domain-specific architecture (DSA)''' is a programmable [[computer architecture]] specifically tailored to operate very efficiently within the confines of a given application ___domain. The term is often used in contrast to general-purpose architectures, such as [[CPU]]~~<nowiki/>~~s, that are designed to operate on any [[computer program]].<ref name=":0" /> == History == In conjunction with the [[Semiconductor device\|semiconductor]] boom that started in the 1960s, computer architects were tasked with finding new ways to exploit the increasingly large number of transistors available. [[Moore's law\|Moore's Law]] and [[Dennard scaling\|Dennard Scaling]] enabled architects to focus on improving the performance of general-purpose [[Microprocessor\|microprocessors]] on general-purpose programs.<ref>{{Cite journal \|last=Moore \|first=G.E. \|date=January 1998 \|title=Cramming More Components Onto Integrated Circuits \|url=http://dx.doi.org/10.1109/jproc.1998.658762 \|journal=Proceedings of the IEEE \|volume=86 \|issue=1 \|pages=82–85 \|doi=10.1109/jproc.1998.658762 \|issn=0018-9219\|url-access=subscription }}</ref><ref>{{Cite journal \|~~last~~last1=Dennard \|~~first~~first1=R.H. \|last2=Gaensslen \|first2=F.H. \|last3=Yu \|first3=Hwa-Nien \|last4=Rideout \|first4=V.L. \|last5=Bassous \|first5=E. \|last6=LeBlanc \|first6=A.R. \|date=October 1974 \|title=Design of ion-implanted MOSFET's with very small physical dimensions \|url=http://dx.doi.org/10.1109/jssc.1974.1050511 \|journal=IEEE Journal of Solid-State Circuits \|volume=9 \|issue=5 \|pages=256–268 \|doi=10.1109/jssc.1974.1050511 \|bibcode=1974IJSSC...9..256D \|s2cid=283984 \|issn=0018-9200\|url-access=subscription }}</ref> These efforts yielded several technological innovations, such as [[Multi-level cache\|multi-level caches]], [[out-of-order execution]], deep instruction [[Instruction pipelining\|pipelines]], [[Multithreading (computer architecture)\|multithreading]], and [[multiprocessing]]. The impact of these innovations was measured on generalist [[Benchmarks in computation\|benchmarks]] such as [[SPEC]], and architects were not concerned with the internal structure or specific characteristics of these programs.<ref name=":0">{{Cite book \|~~last~~last1=Hennessy \|~~first~~first1=John L. \|title=Computer architecture: a quantitative approach \|last2=Patterson \|first2=David A. \|date=2019 \|publisher=Morgan Kaufmann Publishers, an imprint of Elsevier \|others=[[Krste Asanović]] \|page=540 \|isbn=978-0-12-811905-1 \|edition=6 \|___location=Cambridge, Mass}}</ref> The end of Dennard Scaling pushed computer architects to switch from a single, very fast processor to several [[Multi-core processor\|processor cores]]. Performance improvement could no longer be achieved by simply increasing the operating frequency of a single core.<ref>{{Cite web \|last=Schauer \|first=Bryan \|title=Multicore Processors – A Necessity \|url=http://www.csa.com/discoveryguides/multicore/review.pdf \|archive-url=https://web.archive.org/web/20111125035151/http://www.csa.com/discoveryguides/multicore/review.pdf \|archive-date=2011-11-25 \|access-date=2023-07-06 \|website=}}</ref> The end of Moore's Law shifted the focus away from general-purpose architectures towards more specialized hardware. Although general-purpose CPU will likely have a place in any computer system, [[Heterogeneous System Architecture\|heterogeneous systems]] composed of general-purpose and ___domain-specific components are the most recent trend for achieving high performance.~~<ref>~~{{~~Cite journal \|last=Gajendra \|first=Sharma \|last2=Prashant \|first2=Poudel~~ cn\|date=~~2022-11-24~~July \|title=Current trends in heterogeneous systems: A review \|url=https://doi.org/10.17352/tcsit.000055 \|journal=Trends in Computer Science and Information Technology \|volume=7 \|issue=3 \|pages=086–090 \|doi=10.17352/tcsit.000055 \|issn=2641-30862023}}~~</ref>~~ While [[Hardware acceleration\|hardware accelerators]] and [[Application-specific integrated circuit\|ASIC]] have been used in very specialized application domains since the inception of the semiconductor industry, they generally implement a specific function with very limited flexibility. In contrast, the shift towards ___domain-specific architectures wants to achieve a better balance of flexibility and specialization.<ref>{{Cite book \|last=Barr \|first=Keith Elliott \|title=ASIC design in the silicon sandbox: a complete guide to building mixed-signal integrated circuits \|date=2007 \|publisher=McGraw-Hill \|isbn=978-0-07-148161-8 \|___location=New York}}</ref> Line 17 ⟶ 18: == Guidelines for DSA design == [[John L. Hennessy\|John Hennessy]] and [[David Patterson (computer scientist)\|David Patterson]] outlined five principles for DSA design that lead to better area efficiency and energy savings. The objective in these types of architecture is often also to reduce the Non-Recurring Engineering (NRE) costs so that the investment in a specialized solution can be more easily amortized.<ref name=":0" /> ~~Moving~~# Minimize the distance over which data is moved: moving data in general-purpose [[Memory hierarchy\|memory hierarchies]] requires a remarkable amount of energy in order to attempt to minimize the latency to access data. In the case of Domain-Specific Architectures, it is expected that understanding of the application domains by hardware and [[compiler]] designers allows for simpler and specialized memory hierarchies, where the data movement is largely handled in software, with tailor-made memories for specific functions within the ___domain.<ref name=":0" /> ▼ ~~=== Minimize the distance over which data is moved ===~~ ~~Since~~# Invest saved resources into arithmetic units or bigger memories: since a remarkable amount of hardware resources can be saved by dropping general-purpose architectural optimizations such as out-of-order execution, [[Prefetching (computing)\|prefetching]], address [[Coalescing (computer science)\|coalescing]], and hardware speculation, the resources saved should be re-invested to maximally exploit the available [[Parallelism (computing)\|parallelism]], for example, by adding more arithmetic units, or solve any [[memory bandwidth]] issues by adding bigger memories.<ref name=":0" />▼ ▲Moving data in general-purpose [[Memory hierarchy\|memory hierarchies]] requires a remarkable amount of energy in order to attempt to minimize the latency to access data. In the case of Domain-Specific Architectures, it is expected that understanding of the application domains by hardware and [[compiler]] designers allows for simpler and specialized memory hierarchies, where the data movement is largely handled in software, with tailor-made memories for specific functions within the ___domain.<ref name=":0" /> ~~Since~~# Use the easiest form of parallelism that matches the ___domain: since the target application domains almost always present an inherent form of parallelism, it is important to decide how to take advantage of this parallelism and expose it to the software. If, for example, a [[Simd\|SIMD]] architecture can work in the ___domain, it would be easier for the programmer to use than a [[MIMD]] architecture.<ref name=":0" />▼ ~~Whenever~~# Reduce data size and type to the simplest needed for the ___domain: whenever possible, using narrower and simpler [[Data type\|data types]] yields several advantages. For example, it reduces the cost of moving data for [[Memory-bound function\|memory-bound]] applications, and it can also reduce the amount of resources required to implement the respective arithmetic units.<ref name=":0" />▼ ~~=== Invest saved resources into arithmetic units or bigger memories ===~~ ~~One~~# Use a ___domain-specific programming language to port code to the DSA: one of the challenges for DSAs is ease of use, and more specifically, being able to effectively program the architecture and run applications on it. Whenever possible, it is advised to use existing [[Domain-specific language\|Domain-Specific ~~Language~~Languages]]~~<nowiki/>s~~ (DSL) such as [[Halide (programming language)\|Halide]]<ref>{{Cite web \|last=Ragan-Kelley \|first=Jonathan \|title=Halide \|url=https://halide-lang.org/ \|access-date=2023-07-06 \|website=halide-lang.org \|language=en}}</ref> and [[TensorFlow]]<ref>{{Cite web \|title=TensorFlow \|url=https://www.tensorflow.org/ \|access-date=2023-07-06 \|website=TensorFlow \|language=en}}</ref> to more easily program a DSA. Re-use of existing compiler toolchains and software frameworks makes using a new DSA significantly more accessible.<ref name=":0" />▼ ▲Since a remarkable amount of hardware resources can be saved by dropping general-purpose architectural optimizations such as out-of-order execution, [[Prefetching (computing)\|prefetching]], address [[Coalescing (computer science)\|coalescing]], and hardware speculation, the resources saved should be re-invested to maximally exploit the available [[Parallelism (computing)\|parallelism]], for example, by adding more arithmetic units, or solve any [[memory bandwidth]] issues by adding bigger memories.<ref name=":0" /> ~~=== Use the easiest form of parallelism that matches the ___domain ===~~ ▲Since the target application domains almost always present an inherent form of parallelism, it is important to decide how to take advantage of this parallelism and expose it to the software. If, for example, a [[Simd\|SIMD]] architecture can work in the ___domain, it would be easier for the programmer to use than a [[MIMD]] architecture.<ref name=":0" /> ~~=== Reduce data size and type to the simplest needed for the ___domain ===~~ ▲Whenever possible, using narrower and simpler [[Data type\|data types]] yields several advantages. For example, it reduces the cost of moving data for [[Memory-bound function\|memory-bound]] applications, and it can also reduce the amount of resources required to implement the respective arithmetic units.<ref name=":0" /> ~~=== Use a ___domain-specific programming language to port code to the DSA ===~~ {{See also\|Domain-specific language}} ▲One of the challenges for DSAs is ease of use, and more specifically, being able to effectively program the architecture and run applications on it. Whenever possible, it is advised to use existing [[Domain-specific language\|Domain-Specific Language]]<nowiki/>s (DSL) such as [[Halide (programming language)\|Halide]]<ref>{{Cite web \|last=Ragan-Kelley \|first=Jonathan \|title=Halide \|url=https://halide-lang.org/ \|access-date=2023-07-06 \|website=halide-lang.org \|language=en}}</ref> and [[TensorFlow]]<ref>{{Cite web \|title=TensorFlow \|url=https://www.tensorflow.org/ \|access-date=2023-07-06 \|website=TensorFlow \|language=en}}</ref> to more easily program a DSA. Re-use of existing compiler toolchains and software frameworks makes using a new DSA significantly more accessible.<ref name=":0" /> == DSA for deep neural networks == One of the application domains where ~~Domain-Specific Architectures~~DSA have found the most amount of success is that of [[artificial intelligence]]. In particular, several architectures have been developed for the acceleration of [[Deep neural networks\|Deep Neural Networks]] (DNN).<ref>{{Citation \|last=Ghayoumi \|first=Mehdi \|title=Deep Neural Networks (DNNs) Fundamentals and Architectures \|date=2021-10-12 \|url=http://dx.doi.org/10.1201/9781003025818-5 \|work=Deep Learning in Practice \|pages=77–107 \|access-date=2023-07-06 \|place=Boca Raton \|publisher=Chapman and Hall/CRC\|doi=10.1201/9781003025818-5 \|isbn=9781003025818 \|s2cid=241427658 \|url-access=subscription }}</ref> In the following sections, we report some examples. {{Infobox CPU architecture Line 63 ⟶ 53: === TPU === {{See also\|Tensor Processing Unit}} Google's ~~[[Tensor Processing Unit\|~~TPU]] was developed in 2015 to accelerate DNN inference, since the company projected that the use of voice search would require to double the computational resources allocated at the time for neural network inference.<ref>{{Cite book \|~~last~~last1=Hennessy \|~~first~~first1=John L. \|title=Computer architecture: a quantitative approach \|last2=Patterson \|first2=David A. \|date=2019 \|publisher=Morgan Kaufmann Publishers, an imprint of Elsevier \|others=[[Krste Asanović]] \|isbn=978-0-12-811905-1 \|edition=6 \|___location=Cambridge, Mass \|pages=557}}</ref> The TPU was designed to be a [[Coprocessor\|co-processor]] communicating via a [[PCI Express\|PCIe]] bus, to be easily incorporated in existing servers. It is primarily a [[Matrix multiplication\|matrix-multiplication]] engine following a CISC (Complex Instruction Set Computer) [[Instruction set architecture\|ISA]]. The multiplication engine uses [[Systolic array\|systolic execution]] to save energy, reducing the number of writes to [[Volatile memory\|SRAM]].<ref name=":2">{{Cite book \|~~last~~last1=Hennessy \|~~first~~first1=John L. \|title=Computer architecture: a quantitative approach \|last2=Patterson \|first2=David A. \|date=2019 \|publisher=Morgan Kaufmann Publishers, an imprint of Elsevier \|others=[[Krste Asanović]] \|isbn=978-0-12-811905-1 \|edition=6 \|___location=Cambridge, Mass \|pages=560}}</ref> The TPU was fabricated with a 28-nm process, and clocked at 700MHz. The portion of the application that runs on the TPU is implemented in TensorFlow.<ref name=":2" /> The TPU computes ~~primarly~~primarily reduced precision integers, which further contributes to energy savings and increased performance.<ref name=":2" /> === Microsoft Catapult === [[Microsoft]]'s Project Catapult<ref>{{Cite web \|title=Project Catapult \|url=https://www.microsoft.com/en-us/research/project/project-catapult/ \|access-date=2023-07-06 \|website=Microsoft Research \|language=en-US}}</ref> put an [[Field-programmable gate array\|FPGA]] connected through a PCIe bus into data center servers, with the idea of using the FPGA to accelerate various applications running on the server, leveraging the reconfiguration capabilities of FPGA to accelerate many different applications. Differently from Google's TPU, the Catapult FPGA needed to be programmed via [[Hardware description language\|hardware-description ~~language~~languages]]~~<nowiki/>s~~ such as [[Verilog]] and [[VHDL]]. For this reason, a major concern for the authors of the framework was the limited programmability.<ref>{{Cite journal \|~~last~~last1=Putnam \|~~first~~first1=Andrew \|last2=Caulfield \|first2=Adrian M. \|last3=Chung \|first3=Eric S. \|last4=Chiou \|first4=Derek \|last5=Constantinides \|first5=Kypros \|last6=Demme \|first6=John \|last7=Esmaeilzadeh \|first7=Hadi \|last8=Fowers \|first8=Jeremy \|last9=Gopal \|first9=Gopi Prashanth \|last10=Gray \|first10=Jan \|last11=Haselman \|first11=Michael \|last12=Hauck \|first12=Scott \|last13=Heil \|first13=Stephen \|last14=Hormati \|first14=Amir \|last15=Kim \|first15=Joo-Young \|date=2016-10-28 \|title=A reconfigurable fabric for accelerating large-scale datacenter services \|url=http://dx.doi.org/10.1145/2996868 \|journal=Communications of the ACM \|volume=59 \|issue=11 \|pages=114–122 \|doi=10.1145/2996868 \|s2cid=3826382 \|issn=0001-0782}}</ref> Microsoft designed a [[Convolutional neural network\|CNN]] accelerator for the Catapult framework that was ~~primarly~~primarily designed to accelerate the ranking function in the [[Microsoft Bing\|Bing]] search engine. The proposed architecture provided a runtime reconfigurable design based on a two-dimensional systolic array.<ref>{{Cite book \|~~last~~last1=Hennessy \|~~first~~first1=John L. \|title=Computer architecture: a quantitative approach \|last2=Patterson \|first2=David A. \|date=2019 \|publisher=Morgan Kaufmann Publishers, an imprint of Elsevier \|others=[[Krste Asanović]] \|isbn=978-0-12-811905-1 \|edition=6 \|___location=Cambridge, Mass \|pages=573}}</ref><ref>{{Cite web \|title=A peck between penguins \|url=https://www.bing.com/?form=HPFBBK&ssd=20230706_0700&mkt=en-US \|access-date=2023-07-06 \|website=Bing \|language=en}}</ref> === NVDLA === Line 86 ⟶ 76: === Pixel Visual Core === {{See also\|Pixel Visual Core}} The Pixel Visual Core (PVC) is an of [[ARM architecture\|ARM-based]] [[Image processor\|image processors]] designed by [[Google]]. The PVC is a fully programmable [[Image processor\|image]], [[Vision processing unit\|vision]] and [[AI accelerator\|AI]] multi-core ___domain-specific architecture (DSA) for mobile devices and in future for [[Internet of things\|IoT]]. It first appeared in the [[Google Pixel 2\|Google Pixel 2 and 2 XL]] which were introduced on October 19, 2017. It has also appeared in the [[Google Pixel 3\|Google Pixel 3 and 3 XL]]. Starting with the [[Pixel 4]], this chip was replaced with the [[Pixel Neural Core]].<ref>{{Cite web \|last=Cutress \|first=Ian \|title=Hot Chips 2018: The Google Pixel Visual Core Live Blog (10am PT, 5pm UTC) \|url=https://www.anandtech.com/show/13241/hot-chips-2018-the-google-pixel-visual-core-live-blog \|archive-url=https://web.archive.org/web/20180820204207/https://www.anandtech.com/show/13241/hot-chips-2018-the-google-pixel-visual-core-live-blog \|url-status=dead \|archive-date=August 20, 2018 \|access-date=2023-07-07 \|website=www.anandtech.com}}</ref> === Anton3 === [[File:Anton3 CoreTiles and EdgeTile.svg\|thumb\|upright=1.5\|The architecture of the Anton3 specialized cores. Geometry Cores carry out general-purpose computation while specialized hardware accelerate force-fields computation.]] Anton3 is a ~~___domain-specific architecture~~DSA designed to efficiently compute [[Molecular dynamics\|molecular-dynamics]] simulations. It uses a specialized 3D [[torus]] topology interconnection network to connect several computing nodes.<ref name=":1">{{Cite ~~journal~~book \|~~last~~last1=Shaw \|~~first~~first1=David E. \|last2=Adams \|first2=Peter J. \|last3=Azaria \|first3=Asaph \|last4=Bank \|first4=Joseph A. \|last5=Batson \|first5=Brannon \|last6=Bell \|first6=Alistair \|last7=Bergdorf \|first7=Michael \|last8=Bhatt \|first8=Jhanvi \|last9=Butts \|first9=J. Adam \|last10=Correia \|first10=Timothy \|last11=Dirks \|first11=Robert M. \|last12=Dror \|first12=Ron O. \|last13=Eastwood \|first13=Michael P. \|last14=Edwards \|first14=Bruce \|last15=Even \|first15=Amos \|~~date~~title=~~2021-11-14~~Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis \|~~title~~chapter=Anton 3: ~~twenty~~Twenty microseconds of molecular dynamics simulation before lunch \|date=2021-11-14 \|url=https://dl.acm.org/doi/10.1145/3458817.3487397 \|language=en \|publisher=ACM \|pages=1–11 \|doi=10.1145/3458817.3487397 \|isbn=978-1-4503-8442-1\|s2cid=239036976 }}</ref> Each computing node contains a set of 64 cores interconnected through a [[Mesh topology\|mesh]]. The cores implement a specialized deep pipeline to efficiently compute the [[Force field (chemistry)\|force-field]] between molecules. This heterogeneous system combines general-purpose hardware and ___domain-specific components to achieve record-breaking simulation speed.<ref>{{Cite web \|last=Russell \|first=John \|date=2021-09-02 \|title=Anton 3 Is a 'Fire-Breathing' Molecular Simulation Beast \|url=https://www.hpcwire.com/2021/09/01/anton-3-is-a-fire-breathing-molecular-simulation-beast/ \|access-date=2023-07-06 \|website=HPCwire \|language=en-US}}</ref> == References == <references /> == Further ~~readings~~reading == * Computer Architecture. A Quantitative Approach. Sixth Edition. John L. Hennessy. Stanford University. David A. Patterson. University of California, Berkeley. == See also == * [[Hardware acceleration\|Hardware Accelerator]] * [[AI accelerator\|AI Accelerator]] * [[Application-specific integrated circuit\|ASIC]] * [[Field-programmable gate array\|FPGA]] [[Category:Computer architecture]]