No items found.
March 14, 2019

Google’s Edge TPU. What? How? Why?

Contributors
No items found.
Subscribe to newsletter
Share this post

As a premium partner of Google, we at ML6 were able to get early access to the newest machine learning toy: the Edge TPU! The Edge TPU is basically the Raspberry Pi of machine learning. It’s a device that performs inference at the Edge with its TPU.

The Edge TPU (mounted on the Coral dev board). Credits for this image go to Google.

Wait what?! That last sentence contained a lot of vague and very technical words. Don’t worry, all of these words will become clear throughout this blogpost.

Cloud vs Edge

The Edge TPU obviously runs on the edge, but what is the edge and why wouldn’t we want to run everything on the cloud?

Cloud vs Edge

Running code in the cloud means that you use CPUs, GPUs and TPUs of a company that makes those available to you via your browser. The main advantage of running code in the cloud is that you can assign the necessary amount of computing power for that specific code (training large models can take a lot of computation).

The edge is the opposite of the cloud. It means that you are running your code on premise (which basically means that you are able to physically touch the device the code is running on). The main advantage of running code on the edge is that there is no network latency. As IoT devices usually generate frequent data, running code on the edge is perfect for IoT based solutions.

CPU vs GPU vs TPU

A TPU (Tensor Processing Unit) is another kind of processing unit like a CPU or a GPU. There are, however, some big differences between those. The biggest difference is that a TPU is an ASIC, an Application-Specific Integrated Circuit). An ASIC is optimized to perform a specific kind of application. For a TPU this specific task is performing multiply-add operations which are typically used in neural networks. As you probably know, CPUs and GPUs are not optimized to do one specific kind of application so these are not ASICs.

Since we are comparing CPUs, GPUs and TPUs, let’s quickly look at how they respectively perform multiply-add operations with their architecture:

Multiply-add operation on CPU. Credits for this image go to this Google Cloud article.

A CPU performs the multiply-add operation by reading each input and weight from memory, multiplying them with its ALU (the calculator in the figure above), writing them back to memory and finally adding up all the multiplied values.

Modern CPUs are strengthened by a massive cache, branch prediction and high clock rate on each of its cores. Which all contribute to a lower latency of the CPU

A GPU does the same thing but has thousands of ALU’s to perform its calculations. A calculation can be parallelised over all ALU’s. This is called a SIMD and a perfect example of this is the muliply-add operation in neural networks.

A GPU does however not use the fancy features which lower the latency (mentioned above). It also needs to orchestrate its thousands of ALU’s which further decreases the latency.

In short, a GPU drastically increases its throughput by parallelising its computation in exchange for an increase in its latency. Or in other words:

A CPU is a Spartan warrior which is strong and well-trained while the GPU is like a giant army of peasants which can defeat the Spartan because they are with that ma

A TPU, on the other hand, operates very differently. Its ALU’s are directly connected to each other without using the memory. They can directly give pass information which will drastically decrease latency.

In the figure above you can see that all weights of the neural network are loaded into the ALU’s. Once this is done, the input of the neural network will be loaded into these ALU’s to perform the multiply-add operations. This process can be observed in the figure below.

Multiply-add operation on TPU. Credits for this image go to this Google Cloud article.

As you can see in the figure above, all the input for the neural network does not get inserted into the ALU’s at the same time but rather in a step-by-step basis from left to right. This is done to prevent memory access as the output of ALU’s will propagate to the next ALU’s. This is all done with the methodology of a systolic array which is graphically shown in the figure below.

Using a systolic array for the multiply-add operation. Credits for this image go to this Google Cloud article.

Each grey cell in the figure above represents an ALU in the TPU (which contains a weight). In ALU’s a multiplication-add operation is performed by taking the input an ALU got from the top, multiplying it by its weight and then summing it up with the value it got from the left. The result of this is propagated to the right to further complete the multiply-add operation. The input the ALU got from the top gets propagated to the bottom to perform the multiply-add operation for the next neuron in the neural network layer.

At the end of each row, the result of the multiply-add operation for each neuron in the layer can be found without using memory in between operations.

Using this systolic array significantly increases the performance of the Edge TPU. If you want to know how much exactly, you can check out our benchmark blogpost.

Quantization

A last important note on TPUs is quantization. Since Google’s Edge TPU uses 8-bit weights to do its calculations while typically 32-bit weights are uses we should be able to convert weights from 32 bits to 8 bits. This process is called quantization.

Quantization basically rounds the more accurate 32-bit number to the nearest 8-bit number. This process is visually shown in the figure below.

Quantization

By rounding numbers, accuracy decreases. However, neural networks are very good in generalization (e.g. dropout) and therefore do not take a big hit when quantization is applied as shown in the figure below.

Accuracy of non-quantized models vs quantized models.

The advantages of quantization are more significant. It reduces computation and memory needs which leads to more energy efficient computation.

In conclusion

The Edge TPU performs inference faster than any other processing unit architecture. It is not only faster, its also more eco-friendly by using quantization and using less memory operations.

We, at ML6, are fans!

‍

Related posts

View all
No results found.
There are no results with this criteria. Try changing your search.
Large Language Model
Foundation Models
Corporate
People
Structured Data
Chat GPT
Sustainability
Voice & Sound
Front-End Development
Data Protection & Security
Responsible/ Ethical AI
Infrastructure
Hardware & sensors
MLOps
Generative AI
Natural language processing
Computer vision