Dark Light

Les Kohn: ‘L4 Will Need Multiple Big Chips’ Leave a comment

[ad_1]

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SANTA CLARA, Calif.—“When we first started talking to customers, every one said: your chip has too much AI performance,” Ambarella CTO Les Kohn recalled in a recent, exclusive interview with EE Times. “Now [demand for AI performance] is starting to increase a lot.”

Ambarella’s automotive customers are thirsty for AI performance, it seems. The Santa Clara company’s CV3-AD domain controller family targets perception, multi-sensor fusion and path planning in L2+ to L4 vehicles. These domain controllers with built-in proprietary AI acceleration can process up to 20 streams of image data at once.

The industry is moving toward domain controllers (away from AI processing at the sensor edge) as the number of cameras in a car goes above 10, plus radar and other sensor modalities.

Ambarella CTO Les Kohn giving a demonstration at CES 2023. (Source: Ambarella)

“There’s a lot of processing that can potentially be applied to each one of those sensors, and if you do it all at the edge then you have to fix what processing is going to be done in the sensor into a fixed allocation,” Kohn said. “As a result, you typically cannot make it powerful enough for certain challenging scenarios, and it may be too powerful for most of the cases you’re dealing with.”

With a domain controller, it is easier to balance this typical versus peak processing requirement. It also means more advanced sensor fusion is possible, by combining raw sensor data without pre-processing.

“This can give you a better result than doing so sensor by sensor because when you fuse it after you’ve done all the sensor-perception processing, you’ve already lost a lot of information,” he said.

Domain controllers

There are several reasons why demands for AI capabilities in domain controllers are growing.

While a traditional autonomous vehicle (AV) stack relied on a lot of classical algorithms running on Arm CPUs for perception, fusion and path planning, AI-based approaches are creeping in, starting with perception, and will eventually cover the entire L3 and L4 stack.

Kohn said customers also need to allow headroom for future software improvements, including features added after deployment. Processing AI efficiently in the domain controller can also be a way of keeping power use in check: While it may not make a huge difference in a single-camera system, for a bigger L3 computer, its power use may directly impact an electric vehicle’s range.

More complex L3 and L4 systems also “for sure need some type of redundancy” in order to meet functional safety requirements—and that pushes up the amount of AI processing needed, he said. But how do we square strict functional safety criteria with an algorithm that is by definition less than 100% accurate?

“The way I look at it, any L3 or L4 type of algorithm, whether it is classical or deep-learning based, is going to make mistakes,” Kohn said. “These classical algorithms, from everything we’ve seen, they make more mistakes than a good deep-learning algorithm. That’s why people migrated toward deep learning. What that means is, if you really want to aim for ASIL-D reliability, you still need to implement a diverse stack.”

A diverse stack might mean having certain checks that are implemented classically. But Kohn said he believes two different implementations are ultimately necessary—both deep learning-based but independent of each other.

“As long as they’re truly independent, they don’t make the same mistake at the same time, then you get the same kind of ASIL-D reliability that you get with classical [algorithms],” he said.

Neural vector processor

Ambarella CV3-AD chip
Ambarella’s CV3-AD family can handle data streams from up to 20 cameras. (Source: Ambarella)

Ambarella’s CV3-AD family has a dedicated accelerator on-chip for AI, the homegrown neural vector processing (NVP) engine. There is also a list of other specialized engines: a general vector processor (GVP), image signal processor (ISP), engines for stereo and optical flow processing, and encoder engines. Is there scope to separate out more chunks of the AI workload for additional engines?

“You run the risk of not finding the right balance between the different types of AI processing that you’re doing, especially right now when the workload is changing character a lot. It’s a bit premature,” Kohn said.

The relevance of transformer networks in vision is continuing to grow. The CV3-AD family already supports transformers, one of the first domain-specific edge accelerators to do so.

Transformers have become more important over the last year, “particularly related to deep fusion where transformers are definitely the best way to combine all the sensors together, or are a key component of that,” he said. “Everybody wants a transformer now.”

Ambarella’s NVP brings together a number of elements that, when combined, improve latency and power efficiency.

Key to the NVP’s efficiency is its data-flow–programming model. Instead of lists of low-level instructions, higher-level operators for convolution or matrix multiplication are combined into graphs that describe the connections between the operators and how the data flows through the processor. All the communication between those operators is done with on-chip memory, unlike in a GPU where, for every layer, data is read in from DRAM and then results stored back to DRAM. This can be more than 10× more efficient, Kohn said.

The set of operators on the NVP is something Ambarella has worked hard on: The company’s “algorithm-first” approach has it studying customers’ neural networks and classical algorithms to optimize the set of operators for them, then the company designs optimized datapaths for those operators.

Ambarella CV3-AD block diagram
Ambarella’s CV3-AD family has accelerator engines for AI, vector processing, image processing, stereo and optical flow, and encoder engines. (Source: Ambarella)

Sparse processing

Another contributor to performance is support for sparse processing, which Kohn said is important for both matrix multiplication and convolution.

“Many people say they support sparse processing, but it usually means they’re doing what’s called structured pruning, which basically means just chopping channels out of the network—changing the network,” he said. “Another form is to say, within every four coefficients you can zero out two of them, but it’s still quite a restricted form of sparsification. This has a much heavier impact on the accuracy, when you constrain the way you sparsify it so much.”

Ambarella’s design supports random sparsity: Any weight in any location can be zero, and if more than half the weights are zero, you do not still have to process the rest (alternate schemes would still need to process two zeros out of four).

This flexibility means networks can be sparsified (reduced in size) to a greater extent than in competing schemes, which makes networks run faster as less processing is required. However, it requires a retraining process, which gradually sparsifies until accuracy limits are reached; retraining at every step means the accuracy loss is minimized. This process is handled by Ambarella’s toolchain.

Separate from the NVP, the GVP’s key workload is radar processing algorithms, though Kohn said workloads that do not use much convolution or matrix multiplication can run on the GVP with similar speed to the NVP, but with better power efficiency as it is a smaller block of silicon.

Ambarella radar demo
EE Times got a live demo of Ambarella’s radar technology. (Source: EE Times)

Lower precision

The NVP accelerator in the CV3-AD supports 16-, 8- and 4-bit precision. Kohn previously told EE Times that mixed precision would probably be the most realistic solution, but since then we have seen few edge applications progress below 8-bit.

“It gets a lot harder to go beyond 8 bits for applications more complex than very low power embedded applications,” he said. “The thing that is particularly challenging is the activation data. The weights are more easily compressed beyond 8 bits, in fact, we’re already doing that in some cases, but going beyond 8-bit activations in a complex network means it’s not very easy to maintain accuracy.”

Four-bit weights can definitely help in terms of memory bandwidth, which could mean a performance improvement in some cases, and some layers can even run in pure 4-bit, he said. But some layers may need 16-bit activations.

Ambarella’s tools handle mixed-precision quantization automatically.

“It all comes down to having a good training data set,” he said. “We will have a version of quantization that doesn’t require any retraining but still requires some calibration data, which is even quicker. But if you really want to push the limit of what’s possible, you still need quantization aware retraining.”

RISC architectures

Kohn is a longtime RISC evangelist, having been chief architect on Intel’s first RISC chip, the i860, in the late 1980s. The CV3-AD family features Arm cores; can Kohn see a day when the company looks at RISC-V cores for Ambarella products?

“It’s definitely something we’ve looked at,” he said. “The biggest challenge for us is getting something that competes with high-end Arm processor performance and meets the functional safety requirements… It’s not quite there yet. Another problem is actually whether our customers would accept it.”

Automotive customers tend to be more conservative about adopting new architectures, he said. Ambarella has core designs internally based on OpenRISC (which pre-dates RISC-V), which could potentially be switched to RISC-V. “The real win would come if we could have a common architecture for the main processor and [others on the chip],” he said.

On Ambarella’s roadmap are bigger, faster, more powerful chips, Kohn said, to keep up with customers’ increasing demands. While Ambarella will also add smaller, more cost-effective chips for L2 and L2+, for wide operational design domain (ODD) L4, “it’s going to be multiple chips, and multiple big chips,” he said.



[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *