Neural processing unit

As of 2016, AI accelerators are an emerging class of microprocessor designed to accelerate artificial neural networks, machine vision and other machine learning algorithms for robotics, internet of things and other data-intensive/sensor driven tasks. They are frequently manycore designs (mirroring the massively-parallel nature of biological neural networks). They are targeted at practical Narrow AI applications, rather AGI research.

They are distinct from GPUs which are commonly used for the same role in that they lack any fixed function units for graphics, and generally focus on lower precision arithmetic.

History

One or more DSPs have been used as neural network accelerators^[1]. Other architectures such as the Cell microprocessor have exhibited features significantly overlap with AI accelerators (support for packed low precision arithmetic, dataflow architecture, throughput over latency). The Physics processing unit was yet another example of an attempt to fill the gap between CPU and GPU in PC hardware, however physics tends to require 32bit precision and up, whilst much lower precision can be a better tradeoff for AI. ^[2]

Vendors of graphics processing units saw the opportunity and generalised their pipelines with specific support for GPGPU ^[3] (which killed off the market for a dedicated physics accelerator, and superseded Cell in video game consoles), and led to their use in running convolutional neural networks such as AlexNet. As such, as of 2016 most AI work is done on these. However at least a factor of 10 in efficiency^[4] can still be gained with a more specific design. The memory access pattern of AI calculations differs from graphics, with more a more predictable but deeper dataflow ,rather than 'gather' from texture-maps & 'scatter' to frame buffers.

As of 2016, vendors are pushing their own terms, in the hope that their designs and APIs will dominate. In the past after graphics accelerators emerged, the industry eventually adopted NVidias self assigned term "GPU" as the collective noun for "graphics accelerators", which had settled on an overall pipeline patterned around Direct3D. There is no consensus on the boundary between these devices, nor the exact form they will take, however several examples clearly aim to fill this new space.

Examples

Vision processing units
- e.g. Movidius Myriad 2, which is a many-core VLIW AI accelerator at it's heart, complemented with video fixed function units.

Tensor processing unit - presented as an accelerator for Google's TensorFlow framework, which is extensively used for convolutional neural networks. Focusses on a high volume of 8-bit precision arithmetic.

SpiNNaker, a many-core design coming traditional ARM cores with an enhanced network fabric design specialised for simulating a large neural network.

TrueNorth The most unconventional example, a manycore design based on spiking neurons rather than traditional arithmetic. Frequency of pulses represents signal intensity. As of 2016 there is no consensus amongst AI researchers if this is the right way to go,^[5]but some results are promising, with large energy savings demonstrated for vision tasks.

Zeroth NPU a design by Qualcom aimed squarely at bringing speech and image recognition capabilities to mobile devices.

Adapteva epiphany is targeted as a coprocessor, featuring a network on a chip scratchpad memory model, suitable for a dataflow programming model as is suitable for many machine learning tasks.

References

^ "convolutional neural network demo from 1993 featuring DSP32 accelerator".
^ ""Deep Learning with Limited Numerical Precision"" (PDF).
^ "nvidia tesla microarchitecture" (PDF).
^ "google boosts machine learning with TPU".mentions 10x efficiency
^ "yann lecun on IBM truenorth".

This computer hardware article is a stub. You can help Wikipedia by expanding it.

[1] "convolutional neural network demo from 1993 featuring DSP32 accelerator".

[2] ""Deep Learning with Limited Numerical Precision"" (PDF).

[3] "nvidia tesla microarchitecture" (PDF).

[4] "google boosts machine learning with TPU".mentions 10x efficiency

[5] "yann lecun on IBM truenorth".

[1]

[2]

[3]

[4]

[5]