Direct memory access: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:06, 22 October 2018 edit AutisticCatnip (talk \| contribs) 306 edits →Transfer types Tag: 2017 wikitext editor ← Previous edit		Latest revision as of 22:51, 20 August 2025 edit undo Kvng (talk \| contribs) Extended confirmed users, New page reviewers 116,180 edits Reverted good faith edits by Lkcl (talk): This is still under discussion Tags: Twinkle Undo Disambiguation links added
(164 intermediate revisions by 92 users not shown)
Line 1: {{Short description\|Feature of computer systems}} [[File:AMD DirectGMA.svg\|thumb\|AMD DirectGMA is a form of DMA. It enables low-latency peer-to-peer data transfers between devices on the [[PCI Express\|PCIe bus]] and [[AMD FirePro]]-branded products. [[Serial digital interface]] (SDI) devices supporting DirectGMA can write directly into the graphics memory of the GPU and vice versa the GPU can directly access the memory of a peer device.]] {{more footnotes needed\|date=May 2023}} '''Direct memory access''' ('''DMA''') is a feature of computer systems that allows certain hardware subsystems to access main system [[computer ~~storage~~memory\|memory]] ~~([[random-access memory]]), independent~~independently of the [[central processing unit]] (CPU).<ref name="DMAfundamentals" /> Without DMA, when the CPU is using [[programmed input/output]], it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in progress, and it finally receives an [[interrupt]] from the DMA controller (DMAC) when the operation is done. This feature is useful at any time that the CPU cannot keep up with the rate of data transfer, or when the CPU needs to perform work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA, including [[Disk storage\|disk drive]] controllers, [[Video card\|graphics cards]], [[Network interface controller\|network cards]] and [[sound card]]s. DMA is also used for intra-chip data transfer in [[multi-core processor]]s. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without DMA channels. Similarly, a [[processing element]] inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel. Many hardware systems use DMA, including [[disk drive]] controllers, [[graphics card]]s, [[network card]]s and [[sound card]]s. DMA is also used for intra-chip data transfer in some [[multi-core processor]]s. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without DMA channels. Similarly, a [[processing element\|processing circuitry]] inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel. DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or [[Vectored I/O\|scatter-gather]] operations, from the CPU to a dedicated DMA engine. An implementation example is the [[I/O Acceleration Technology]]. DMA is of interest in [[Network on a chip\|network-on-chip]] and [[In-memory processing\|in-memory computing]] architectures.▼ ▲DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or [[~~Vectored I/O\|~~scatter-gather]] operations, from the CPU to a dedicated DMA engine. An implementation example is the [[I/O Acceleration Technology]]. DMA is of interest in [[~~Network on a chip\|~~network-on-chip]] and [[~~In-memory processing\|~~in-memory computing]] architectures. == Principles== === Third-party=== [[File:NeXTcube motherboard.jpg\|thumb\|[[Motherboard]] of a [[NeXTcube]] computer (1990). The two large [[integrated circuit]]s below the middle of the image are the DMA controller (l.) and - unusual - an extra dedicated DMA controller (r.) for the [[magneto-optical disc]] used instead of a [[hard disk drive]] in the first series of this computer model.]] Standard DMA, also called third-party DMA, uses a DMA controller. A DMA controller can generate [[memory address]]es and initiate memory read or write cycles. It contains several [[hardware register]]s that can be written and read by the CPU. These include a memory address register, a byte count register, and one or more control registers. Depending on what features the DMA controller provides, these control registers might specify some combination of the source, the destination, the direction of the transfer (reading from the I/O device or writing to the I/O device), the size of the transfer unit, and/or the number of bytes to transfer in one burst.<ref name=Osborne80>{{cite book \|first=Adam \|last=Osborne \|title=An Introduction to Microcomputers: Volume 1: Basic Concepts \|edition=2nd \|publisher=Osborne McGraw Hill \|year=1980 \|isbn=0931988349 \|pages=[https://archive.org/details/introductiontomi00adam/page/5 5–64 through 5–93] \|url=https://archive.org/details/introductiontomi00adam/page/5 }}</ref> To carry out an input, output or memory-to-memory operation, the host processor initializes the DMA controller with a count of the number of [[word (computer architecture)\|words]] to transfer, and the memory address to use. The CPU then commands the peripheral device to initiate a data transfer. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a byte of data is ready to be transferred between the peripheral device and memory, the DMA controller increments its internal address register until the full block of data is transferred. Some examples of buses using third-party DMA are [[Parallel ATA\|PATA]], [[USB]] (before [[USB4]]), and [[SATA]]; however, their [[host controller]]s use [[bus mastering]].{{cn\|date=July 2024}} ~~=== Bus mastering ===~~ In a [[bus mastering]] system, also known as a first-party DMA system, the CPU and peripherals can each be granted control of the memory bus. Where a peripheral can become bus master, it can directly write to system memory without involvement of the CPU, providing memory address and control signals as required. Some measure must be provided to put the processor into a hold condition so that bus contention does not occur.▼ === ~~Transfer~~Bus ~~types~~mastering === ▲In a [[bus mastering]] system, also known as a first-party DMA system, the CPU and peripherals can each be granted control of the memory bus. Where a peripheral can become a bus master, it can directly write to system memory without the involvement of the CPU, providing memory address and control signals as required. Some ~~measure~~measures must be provided to put the processor into a hold condition so that [[bus contention]] does not occur. DMA transfers can transfer either one byte at a time or all at once in burst mode. If they transfer a byte at a time, this can allow the CPU to access memory on alternating bus cycles – this is called [[cycle stealing]] since the CPU and either the DMA controller or the bus master contend for memory access. In ''burst mode DMA'', the CPU can be put on hold while the DMA transfer occurs and a full block of possibly hundreds or thousands of bytes can be moved.<ref name=Art89>{{cite book \|first=Paul \|last=Horowitz \|first2=Winfield \|last2=Hill \|title=The Art of Electronics \|edition=Second \|publisher=Cambridge University Press \|year=1989 \|isbn=0521370957 \|page=702 }}</ref> When memory cycles are much faster than processor cycles, an ''interleaved'' DMA cycle is possible, where the DMA controller uses memory while the CPU cannot. == Modes of operation== === Burst mode === AnIn ''burst mode'', an entire block of data is transferred in one contiguous sequence. Once the DMA controller is granted access to the system bus by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU, but renders the CPU inactive for relatively long periods of time. The mode is also called "Block Transfer Mode". === Cycle stealing mode === The ''[[cycle stealing]] mode'' is used in systems in which the CPU should not be disabled for the length of time needed for burst transfer modes. In the cycle stealing mode, the DMA controller obtains access to the system bus the same way as in burst mode, using ''BR ([[Bus Request]])'' and ''BG ([[Bus Grant]])'' signals, which are the two signals controlling the interface between the CPU and the DMA controller. However, in cycle stealing mode, after one ~~byte~~unit of data transfer, the control of the system bus is deasserted to the CPU via BG. It is then continually requested again via BR, transferring one ~~byte~~unit of data per request, until the entire block of data has been transferred.<ref>{{cite book \|title=Computer Architecture and Organization \|last=Hayes \|first=John.P \|isbn=0-07-027363-4 \|year=1978 \|page=426-427 \|publisher=McGraw-Hill International Book Company }}</ref> By continually obtaining and releasing the control of the system bus, the DMA controller essentially interleaves instruction and data transfers. The CPU processes an instruction, then the DMA controller transfers one data value, and so on. ~~On the one hand, the data block~~Data is not transferred as quickly ~~in cycle stealing mode as in burst mode~~, but ~~on the other hand the~~ CPU is not idled for as long as in burst mode. Cycle stealing mode is useful for controllers that monitor data in real time. === Transparent mode === Transparent mode takes the most time to transfer a block of data, yet it is also the most efficient mode in terms of overall system performance. In transparent mode, the DMA controller transfers data only when the CPU is performing operations that do not use the system buses. The primary advantage of transparent mode is that the CPU never stops executing its programs and the DMA transfer is free in terms of time, while the disadvantage is that the hardware needs to determine when the CPU is not using the system buses, which can be complex. This is also called "''Hidden DMA data transfer mode''". == Cache coherency == [[File:Cache incoherence write.svg\|~~frame~~upright=1\|Cache incoherence due to DMA]] DMA can lead to [[cache coherency]] problems. Imagine a CPU equipped with a cache and an external memory that can be accessed directly by devices using DMA. When the CPU accesses ___location X in the memory, the current value will be stored in the cache. Subsequent operations on X will update the cached copy of X, but not the external memory version of X, assuming a [[~~Write back cache\|~~write-back cache]]. If the cache is not flushed to the memory before the next time a device tries to access X, the device will receive a stale value of X. Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU will operate on a stale value of X. This issue can be addressed in one of two ways in system design: Cache-coherent systems implement a method in hardware, called [[bus snooping]], whereby external writes are signaled to the cache controller which then performs a [[cache invalidation]] for DMA writes or cache flush for DMA reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range is not accessed by any running threads in the meantime. The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually. Hybrids also exist, where the secondary L2 cache is coherent while the L1 cache (typically on-CPU) is managed by software. Line 46 ⟶ 49: === ISA === In the original [[IBM PC]] (and the follow-up [[PC/XT]]), there was only one [[Intel 8237]] DMA controller capable of providing four DMA channels (numbered 0–3),. asThese ~~part~~DMA ofchannels ~~the~~performed so8-~~called~~bit ~~[[Industry~~transfers ~~Standard~~(as ~~Architecture]],~~the or8237 ~~ISA.~~was ~~These DMA channels performed~~an 8-bit ~~transfers~~device, ~~and~~ideally matched to the PC's [[i8088]] CPU/bus architecture), could only address the first ([[i8086]]/8088-standard) megabyte of RAM., ~~With~~and ~~the~~were limited to addressing single 64 [[~~IBM PC/AT~~kilobyte\|kB]], asegments ~~second~~within ~~8237~~that ~~DMA~~space ~~controller~~(although ~~was~~the ~~added~~source and destination (channels ~~5–7;~~could ~~channel~~address 4different issegments). ~~dedicated~~Additionally, asthe acontroller ~~cascade~~could ~~channel~~only ~~for~~be ~~the~~used ~~first~~for ~~8237~~transfers ~~controller)~~to, ~~and~~from ~~the~~or ~~page~~between ~~register~~expansion ~~was~~bus ~~rewired~~I/O todevices, ~~address~~as the ~~full~~8237 16could only MBperform memory-to-memory ~~address~~transfers ~~space~~using channels 0 & 1, of which channel 0 in the ~~80286~~PC ~~CPU~~(& XT) was dedicated to [[dynamic memory]] [[memory refresh\|refresh]]. This ~~second~~prevented ~~controller~~it ~~performed~~from being used as a 16general-~~bit~~purpose ~~transfers~~"[[Blitter]]", and consequently block memory moves in the PC, limited by the general PIO speed of the CPU, were very slow. With the [[IBM PC/AT]], the enhanced [[AT bus]] (more familiarly retronymed as the [[Industry Standard Architecture]] (ISA)) added a second 8237 DMA controller to provide three additional, and as highlighted by resource clashes with the XT's additional expandability over the original PC, much-needed channels (5–7; channel 4 is used as a cascade to the first 8237). ISA DMA's extended 24-bit address bus width allows it to access up to 16 MB lower memory.<ref>{{Cite web \|title=ISA DMA - OSDev Wiki \|url=https://wiki.osdev.org/ISA_DMA \|access-date=2025-04-20 \|website=wiki.osdev.org}}</ref> The page register was also rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller was also integrated in a way capable of performing 16-bit transfers when an I/O device is used as the data source and/or destination (as it actually only processes data itself for memory-to-memory transfers, otherwise simply ''controlling'' the data flow between other parts of the 16-bit system, making its own data bus width relatively immaterial), doubling data throughput when the upper three channels are used. For compatibility, the lower four DMA channels were still limited to 8-bit transfers only, and whilst memory-to-memory transfers were now technically possible due to the freeing up of channel 0 from having to handle DRAM refresh, from a practical standpoint they were of limited value because of the controller's consequent low throughput compared to what the CPU could now achieve (i.e., a 16-bit, more optimised [[80286]] running at a minimum of 6 MHz, vs an 8-bit controller locked at 4.77 MHz). In both cases, the 64 kB [[x86 memory segmentation\|segment boundary]] issue remained, with individual transfers unable to cross segments (instead "wrapping around" to the start of the same segment) even in 16-bit mode, although this was in practice more a problem of programming complexity than performance as the continued need for DRAM refresh (however handled) to monopolise the bus approximately every 15 [[μs]] prevented use of large (and fast, but uninterruptible) block transfers. Due to their lagging performance (2.5 Mbit/s<ref>Intel publication 03040, Aug 1989</ref>), these devices have been largely obsolete since the advent of the [[80386]] processor in 1985 and its capacity for 32-bit transfers. They are still supported to the extent they are required to support built-in legacy PC hardware on modern machines. The only pieces of legacy hardware that use ISA DMA and are still fairly common are [[Super I/O]] devices on motherboards that often integrate a built-in [[floppy disk]] controller, an [[Infrared Data Association\|IrDA]] infrared controller when FIR (fast infrared) mode is selected, and a [[IEEE 1284]] parallel port controller when ECP mode is selected. Due to their lagging performance (1.6 [[megabyte\|MB]]/s maximum 8-bit transfer capability at 5 MHz,<ref name="i8237sheet">{{cite web \|title=Intel 8237 & 8237-2 Datasheet \|url=http://www.jbox.dk/rc702/hardware/intel-8237.pdf \|website=JKbox RC702 subsite \|access-date=20 April 2019}}</ref> but no more than 0.9 MB/s in the PC/XT and 1.6 MB/s for 16-bit transfers in the AT due to ISA bus overheads and other interference such as memory refresh interruptions<ref name="DMAfundamentals">{{cite web \|title=DMA Fundamentals on various PC platforms, National Instruments, pages 6 & 7 \|url=https://cires1.colorado.edu/jimenez-group/QAMSResources/Docs/DMAFundamentals.pdf \|access-date=26 April 2025 \|website=University of Colorado Boulder}}</ref>) and unavailability of any speed grades that would allow installation of direct replacements operating at speeds higher than the original PC's standard 4.77 MHz clock, these devices have been effectively obsolete since the late 1980s. Particularly, the advent of the [[80386]] processor in 1985 and its capacity for 32-bit transfers (although great improvements in the efficiency of address calculation and block memory moves in Intel CPUs after the [[80186]] meant that PIO transfers even by the 16-bit-bus [[80286\|286]] and [[80386SX\|386SX]] could still easily outstrip the 8237), as well as the development of further evolutions to ([[Extended Industry Standard Architecture\|EISA]]) or replacements for ([[Micro Channel architecture\|MCA]], [[VESA local bus\|VLB]] and [[Peripheral Component Interconnect\|PCI]]) the "ISA" bus with their own much higher-performance DMA subsystems (up to a maximum of 33 MB/s for EISA, 40 MB/s MCA, typically 133 MB/s VLB/PCI) made the original DMA controllers seem more of a performance millstone than a booster. They were supported to the extent they are required to support built-in legacy PC hardware on later machines. The pieces of legacy hardware that continued to use ISA DMA after 32-bit expansion buses became common were [[Sound Blaster]] cards that needed to maintain full hardware compatibility with the [[Sound Blaster standard]]; and [[Super I/O]] devices on motherboards that often integrated a built-in [[floppy disk]] controller, an [[IrDA]] infrared controller when FIR (fast infrared) mode is selected, and an [[IEEE 1284]] parallel port controller when ECP mode is selected. In cases where an original 8237s or direct compatibles were still used, transfer to or from these devices may still be limited to the first 16 MB of main [[RAM]] regardless of the system's actual address space or amount of installed memory. Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then instructs the DMA hardware to begin the transfer. When the transfer is complete, the device [[interrupt]]s the CPU. Line 56 ⟶ 61: DRQ stands for ''Data request''; DACK for ''Data acknowledge''. These symbols, seen on hardware [[schematic]]s of computer systems with DMA functionality, represent electronic signaling lines between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge line. A device that uses DMA must be configured to use both lines of the assigned DMA channel. 16-bit ISA permitted bus mastering.<ref>{{Citation \|title=PC Architecture for Technicians: Level 1 \|contribution=Chapter 12: ISA Bus \|contribution-url=http://faculty.chemeketa.edu/csekafet/elt256/pcarch-full_isa-bus.pdf \|author=Intel Corp. \|date=2003-04-25 \|access-date=2015-01-27}}</ref> Standard ISA DMA assignments:▼ ▲Standard ISA DMA assignments:{{cn\|date=November 2024}} {{ordered list\|start=0 \| [[DRAM]] ~~Refresh~~refresh (obsolete), \| User hardware, usually ISA sound card ~~8-bit DMA~~ \| [[Floppy disk]] controller, \| [[~~Hard~~WDMA ~~disk~~(computer)\|WDMA]] ~~(obsoleted by~~for [[~~programmed~~hard ~~input/output\|PIO~~disk]] ~~modes, and~~controller (replaced by [[UDMA]] modes), ~~Parallel~~parallel ~~Port~~port (ECP capable port), or certain SoundBlaster Clones like the OPTi 928. \| ~~Cascade from XT~~[[8237]] DMA controller, \| Hard ~~Disk~~disk controller ([[PS/2]] only), or user hardware ~~for~~usually ~~all others, usually~~ISA sound card ~~16-bit DMA~~ \| User hardware. \| User hardware. }} === PCI === A [[Peripheral Component Interconnect\|PCI]] architecture has no central DMA controller, unlike ISA. Instead, ~~any~~A PCI ~~component~~device can request control of the bus ("become the [[bus master]]") and request to read from and write to system memory. More precisely, a PCI component requests bus ownership from the PCI bus controller (usually ~~the~~PCI ~~[[Southbridge~~host bridge, ~~(computing)~~and PCI to PCI bridge<ref>{{Cite web\|~~southbridge]]~~title=Bus inSpecifics a- ~~modern~~Writing PCDevice ~~design~~Drivers for Oracle® Solaris 11.3\|url=https://docs.oracle.com/cd/E53394_01/html/E54850/hwovr-25520.html\|access-date=2020-12-18\|website=docs.oracle.com}}</ref>), which will [[Arbiter (electronics)\|arbitrate]] if several devices request bus ownership simultaneously, since there can only be one bus master at one time. When the component is granted ownership, it will issue normal read and write commands on the PCI bus, which will be claimed by the PCI bus controller ~~and will be forwarded to the memory controller using a scheme which is specific to every chipset~~. As an example, on ~~a modern [[AMD]]~~an [[~~Socket~~Intel ~~AM2~~Core]]-based PC, the southbridge will forward the transactions to the [[~~Northbridge~~memory ~~(computing)\|northbridge~~controller]] (which is [[Integrated circuit design\|integrated]] on the CPU die) using [[~~HyperTransport~~Direct Media Interface\|DMI]], which will in turn convert them to ~~[[DDR2 SDRAM\|DDR2]]~~DDR operations and send them out on the ~~DDR2~~ memory bus. As ~~can be~~a ~~seen~~result, there are quite a number of steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI bus itself are an order of magnitude slower than the rest of the components (see [[list of device bandwidths]]). A modern x86 CPU may use more than 4 GB of memory, either utilizing the native 64-bit mode of [[x86-64]] CPU, or the [[Physical Address Extension]] (PAE), a 36-bit addressing mode~~, or the native 64-bit mode of [[x86-64]] CPUs~~. In such a case, a device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The new [[Double Address Cycle]] (DAC) mechanism, if implemented on both the PCI bus and the device itself,<ref>{{cite web\|url=http://www.microsoft.com/whdc/system/platform/server/PAE/PAEdrv.mspx#E2D\|title=Physical Address Extension — PAE Memory and Windows\|publisher=Microsoft Windows Hardware Development Central\|year=2005\|~~accessdate~~access-date=2008-04-07}}</ref> enables 64-bit DMA addressing. Otherwise, the operating system would need to work around the problem by either using costly [[double buffering (DMA)\|double buffer]]s (DOS/Windows nomenclature) also known as [[bounce buffer]]s ([[FreeBSD]]/Linux), or it could use an [[IOMMU]] to provide address translation services if one is present. === I/OAT === As an example of DMA engine incorporated in a general-purpose CPU, ~~newer~~some Intel [[Xeon]] chipsets include a DMA engine called [[I/O Acceleration Technology]] (I/OAT), which can offload memory copying from the main CPU, freeing it to do other work.<ref>{{cite web \| last = Corbet \| first = Jonathan \| title = Memory copies in hardware \| work = [[LWN.net]] \| date = December 8, 2005 \| url = https://lwn.net/Articles/162966/ }}</ref> In 2006, Intel's [[Linux kernel]] developer Andrew Grover performed benchmarks using I/OAT to offload network traffic copies and found no more than 10% improvement in CPU utilization with receiving workloads~~, and no improvement when transmitting data~~.<ref name="linuxnet-ioat">{{cite web \|first=Andrew \|last=Grover \|title=I/OAT on LinuxNet wiki \|work=Overview of I/OAT on Linux, with links to several benchmarks \|date=2006-06-01 \|url=http://www.linuxfoundation.org/collaborate/workgroups/networking/i/oat \|~~accessdate~~access-date=2006-12-12 \|archive-date=2016-05-05 \|archive-url=https://web.archive.org/web/20160505034410/http://www.linuxfoundation.org/collaborate/workgroups/networking/i/oat \|url-status=dead }}</ref> === DDIO === Further performance-oriented enhancements to the DMA mechanism have been introduced in Intel [[Xeon E5]] processors with their '''Data Direct I/O''' ('''DDIO''') feature, allowing the DMA "windows" to reside within [[CPU cache]]s instead of system RAM. As a result, CPU caches are used as the primary source and destination for I/O, allowing [[network interface controller]]s (NICs) to ~~talk~~DMA directly to the ~~caches~~Last level cache (L3 cache) of local CPUs and avoid costly fetching of the I/O data from system RAM. As a result, DDIO reduces the overall I/O processing latency, allows processing of the I/O to be performed entirely in-cache, prevents the available RAM bandwidth/latency from becoming a performance bottleneck, and ~~lowers~~may lower the power consumption by allowing RAM to remain longer in low-powered state.<ref>{{cite web \| url = http://www.intel.com/content/dam/www/public/us/en/documents/faqs/data-direct-i-o-faq.pdf \| title = Intel Data Direct I/O (Intel DDIO): Frequently Asked Questions \| date = March 2012 \| ~~accessdate~~access-date = 2015-10-11 \| publisher = [[Intel]] \|}}</ref><ref>{{cite ~~format = PDF~~web ~~}}</ref><ref>{{cite web~~ \| url = http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/ \| title = Pushing the Limits of Kernel Networking \| date = 2015-09-29 \| ~~accessdate~~access-date = 2015-10-11 \| author = Rashid Khan \| website = redhat.com }}</ref><ref>{{cite web \| url = http://www.solarflare.com/content/userfiles/documents/intel_solarflare_webinar_paper.pdf \| title = Achieving Lowest Latencies at Highest Message Rates with Intel Xeon Processor E5-2600 and Solarflare SFN6122F 10 GbE Server Adapter \| date = 2012-06-07 \| ~~accessdate~~access-date = 2015-10-11 \| website = solarflare.com \|}}</ref><ref>{{cite ~~format = PDF~~web \| url = ~~http~~https://events.~~linuxfoundation~~static.linuxfound.org/sites/events/files/slides/pushing-kernel-networking.pdf~~#page=5~~▼ ~~}}</ref><ref>{{cite web~~ ▲ \| url = http://events.linuxfoundation.org/sites/events/files/slides/pushing-kernel-networking.pdf#page=5 \| title = Pushing the Limits of Kernel Networking \| date = 2015-08-19 \| ~~accessdate~~access-date = 2015-10-11 \| author = Alexander Duyck \| website = linuxfoundation.org ~~\| format = PDF~~ \| page = 5 }}</ref> Line 110 ⟶ 115: Therefore, high bandwidth devices such as network controllers that need to transfer huge amounts of data to/from system memory will have two interface adapters to the AHB: a master and a slave interface. This is because on-chip buses like AHB do not support [[Three-state logic\|tri-stating]] the bus or alternating the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is bus-mastering, but an [[Arbiter (electronics)\|arbiter]] is required in case of multiple masters present on the system. Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent [[~~Vectored I/O\|~~scatter-gather]] operations as programmed by the software. === Cell === {{main\|Cell (microprocessor)}} As an example usage of DMA in a [[multiprocessor-system-on-chip]], IBM/Sony/Toshiba's [[~~Cell (microprocessor)\|~~Cell processor]] incorporates a DMA engine for each of its 9 processing elements including one Power processor element (PPE) and eight synergistic processor elements (SPEs). Since the SPE's load/store instructions can read/write only its own local memory, an SPE entirely depends on DMAs to transfer data to and from the main memory and local memories of other SPEs. Thus the DMA acts as a primary means of data transfer among cores inside this [[~~central processing unit\|~~CPU]] (in contrast to cache-coherent CMP architectures such as Intel's cancelled [[GPGPU\|general-purpose GPU]], [[Larrabee (microarchitecture)\|Larrabee]]). DMA in Cell is fully [[#Cache coherency\|cache coherent]] (note however local stores of SPEs operated upon by DMA do not act as globally coherent cache in the [[CPU cache\|standard sense]]). In both read ("get") and write ("put"), a DMA command can transfer either a single block area of size up to 16 KB, or a list of 2 to 2048 such blocks. The DMA command is issued by specifying a pair of a local address and a remote address: for example when a SPE program issues a put DMA command, it specifies an address of its own local memory as the source and a virtual memory address (pointing to either the main memory or the local memory of another SPE) as the target, together with a block size. According to an experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200 GB per second.<ref name="petrini-cell">{{cite ~~web~~journal \|first=Michael \|last=Kistler \|title=Cell Multiprocessor Communication Network: ~~\|work=Extensive~~Built ~~benchmarks~~for ofSpeed ~~DMA~~\|journal=[[IEEE ~~performance~~Micro]] in\|date=May ~~Cell~~2006\|volume=26 ~~Broadband Engine~~\|issue=3 \|~~date~~pages=~~May~~10–23 \|doi=10.1109/MM.2006.49 \|s2cid=7735690 \|url=http://portal.acm.org/citation.cfm?id=1158825.1159067 \|url-access=subscription }}</ref> == DMA controllers == * [[Intel 8257]] * Am9517<ref>{{Cite web \|url=http://www.bitsavers.org/components/amd/_dataSheets/Am9517A.pdf \|title=Am9517A Multimode DMA Controller \|accessdate=2024-01-06}}</ref> * [[Intel 8237]] * Z80 DMA<ref>{{Cite web \|url=http://www.bitsavers.org/components/zilog/z80/Z80_DMA_Product_Specification_Feb80.pdf \|title=Z80® DMA Direct Memory Access Controller \|accessdate=2024-01-07}}</ref> * LH0083,<ref>{{Cite web \|url=http://www.bitsavers.org/components/sharp/_dataBooks/1986_Sharp_MOS_Semiconductor_Data_Book.pdf#page=262 \|title=Sharp 1986 Semiconductor Data Book \|accessdate=2024-01-13 \|page=255-269}}</ref> compatible to Z80 DMA * μPD71037,<ref>{{Cite web \|url=http://bitsavers.informatik.uni-stuttgart.de/components/nec/_dataBooks/1990_NEC_16-bit_V-Series_Microprocessor_Data_Book.pdf#page=832 \|title=pPD71037 Direct Memory Access (DMA) Controller \|page=832(5b1) \|accessdate=2024-01-06}}</ref> capable of addressing a 64K-byte of memory * μPD71071,<ref>{{Cite web \|url=http://bitsavers.informatik.uni-stuttgart.de/components/nec/_dataBooks/1990_NEC_16-bit_V-Series_Microprocessor_Data_Book.pdf#page=940 \|title=µPD71071 DMA Controller \|page=940(5g1)\|accessdate=2024-01-05}}</ref> capable of addressing a 16M-byte of memory == Pipelining == Line 123 ⟶ 137: == See also == * [[{{Annotated link\|AT Attachment]]}}▼ ~~{{Portal\|Computing}}~~ * [[{{Annotated link\|Autonomous peripheral operation]]}}▼ * {{Annotated link\|Blitter}} * [[{{Annotated link\|Channel I/O]]}}▼ * {{Annotated link\|DMA attack}} * {{Annotated link\|Memory-mapped I/O}} [[ {{Annotated link\|Hardware acceleration]]}}▼ [[ {{Annotated link\|In-memory processing]]}}▼ * {{Annotated link\|Memory management}} [[ {{Annotated link\|Network on a chip]]}}▼ [[{{Annotated link\|Polling (computer science)]]}}▼ * [[{{Annotated link\|Remote direct memory access]]}}▼ * {{Annotated link\|UDMA}} * {{Annotated link\|Virtual DMA Services}} == References == ▲* [[AT Attachment]] * [[Blitter]] ▲* [[Channel I/O]] * [[DMA attack]] ▲* [[Polling (computer science)]] ▲* [[Remote direct memory access]] * [[UDMA]] ▲* [[Autonomous peripheral operation]] ▲[[Network on a chip]] ▲[[In-memory processing]] ▲[[Hardware acceleration]] ~~== Notes ==~~ {{Reflist\|30em}} == ~~References~~Sources == {{refbegin}} [http://cires.colorado.edu/jimenez-group/QAMSResources/Docs/DMAFundamentals.pdf DMA Fundamentals on Various PC Platforms], from A. F. Harvey and Data Acquisition Division Staff NATIONAL INSTRUMENTS * [http://www.xml.com/ldd/chapter/book/ch13.html mmap() and DMA], from ''Linux Device Drivers, 2nd Edition'', Alessandro Rubini & [[Jonathan Corbet]] * [http://www.oreilly.com/catalog/linuxdrive3/book/ch15.pdf Memory Mapping and DMA], from ''Linux Device Drivers, 3rd Edition'', [[Jonathan Corbet]], Alessandro Rubini, [[Greg Kroah-Hartman]] * [http://www.eventhelix.com/RealtimeMantra/FaultHandling/dma_interrupt_handling.htm DMA and Interrupt Handling] * [http://www.pcguide.com/ref/hdd/if/ide/modesDMA-c.html DMA Modes & Bus Mastering] Line 158 ⟶ 173: [[Category:Computer storage buses]] [[Category:Hardware acceleration]] [[Category: Input/output]]