Dark Light

Nvidia Leads, Habana Challenges on MLPerf GPT-3 Benchmark Leave a comment

[ad_1]

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

The latest round of MLPerf training benchmarks includes GPT-3, the model ChatGPT is based on, for the first time. The GPT-3 training crown was claimed by cloud provider CoreWeave using more than 3,000 Nvidia H100 GPUs. What’s more surprising is that there were no entries from previous training submitters Google, Graphcore and others, or other competitors like AMD. It was left to Intel’s Habana Labs to be the only challenger to Nvidia on GPT-3 with its Gaudi2 accelerator.

CoreWeave used 3,584 Nvidia HGX-H100s to train a representative portion of GPT-3 in 10.94 minutes (this is the biggest number of GPUs the cloud provider could make available at one time, and is not the full size of its cluster). A portion of GPT-3 is used for the benchmark since it would be impractical to insist submitters train the entirety of GPT-3, which could take months and cost millions of dollars. Submitters instead train an already partially-trained GPT-3 from a particular checkpoint until it converges to a certain accuracy. The portion used is about 0.4% of the total training workload for GPT-3; based on CoreWeave’s 10.94 minutes score, 3,584 GPUs would take almost two days to train the whole thing.

Nvidia graph - H100 versus Gaudi2, Xeon
Nvidia’s graph shows per-accelerator performance for its H100 versus Intel Xeon CPUs and Habana Labs Gaudi2 (normalized to H100 result—taller is better). (Source: Nvidia)

Nvidia H100s were used for the bulk of the GPT-3 submissions. This is the leading hardware for AI training on the market. Its software includes Nvidia’s Transformer Engine, designed specifically to speed up training and inference of networks based on the same architecture as GPT-3, by lowering precision to FP8 to improve throughput wherever possible.

CoreWeave’s scores increased to 23.611 minutes using 1,536 GPUs, or 45.606 minutes using 786 GPUs. This represents 89% performance scaling efficiency from hundreds to thousands of GPUs.

Nvidia’s own scores came in at 44.816 min for 768 H100s—fractionally faster than CoreWeave’s score for the same size system—and 64.264 min for 512 GPUs. Nvidia used its work-in-progress Eos cluster for this benchmark, which is so large that 512 GPUs was the smallest system for which Nvidia submitted results.

While there’s no power metric for training benchmarks, Nvidia H100’s thermal design power (TDP) of 700 W is often referenced, but this should be done with caution, advised Dave Salvator, director of AI, benchmarking and cloud at Nvidia.

“There is a tendency to look at TDP and say if the TDP is high, the power is high, but that’s not necessarily true,” he said, adding that moving from previous-generation, A100-based hardware to current-generation H100, the same performance across a mix of training and inference workloads would be 3.5× more energy efficient, in large part down to the reduction of the number of nodes (and accompanying networking) by a factor of 5.

There were no GPT-3 scores from previous-generation Nvidia A100 hardware from Nvidia or its partners. However, scores for Grace Hopper—Nvidia’s CPU/GPU combination superchip that boosts the total memory available to the GPU to 576 GB —are “coming to future rounds of MLPerf,” Salvator said.

Salvator also showed a slide marking 2024 for the release of the generation succeeding Hopper, the architecture on which the H100 is based. The message was clear: to companies that claim their chips can beat the H100 (like AMD) or will soon be able to (Habana Labs, see below), any lead you can gain won’t last long.

Nvidia roadmap
Nvidia’s timeline shows Hopper’s successor coming in 2024 (Source: Nvidia)

Habana Labs

Habana Labs’ Gaudi2 training chips were the only challenger for Nvidia’s H100 on GPT-3.

“The market needs an alternative,” Jordan Plawner, senior director of Intel’s AI products, told EE Times. “We see [Gaudi2] as the only viable alternative to [Nvidia] H100 for training large language models, based on being the only company or product that’s submitted for GPT-3 in this MLPerf round.”

Intel graph showing Gaudi2 vs Nvidia H100 and A100
Habana compares its training results to Nvidia A100 and H100, noting that Habana hasn’t enabled software support for FP8 yet (smaller is better). (Source: Intel/Habana Labs)

384 Gaudi2s can train the GPT-3 benchmark in 311.945 minutes (a little over five hours). A back-of-the-envelope calculation suggests this system might take 54 days to train GPT-3 from start to finish. 256 Gaudi2s can train the benchmark in a little over seven hours. This represents a 95% performance scaling efficiency, albeit from only 256 to 384 chips (an order of magnitude smaller than Nvidia’s system above).

“We don’t need Infiniband to scale perfectly,” Plawner said. “What’s the difference between Infiniband and Ethernet? Nvidia owns Infiniband and they can monetize it. We don’t see it’s needed even for high performance accelerators.”

Habana used Microsoft’s Deepspeed optimization library for its scores. This library enables support for data, tensor and pipeline parallelism concurrently, which is useful for very large models like GPT-3.

Habana’s performance is based on the software setup customers get out-of-the-box; its Gaudi2 scores improved 10% for BERT and 4% for ResNet since the last round.

“Gaudi2’s performance is fast enough, it’s cost-efficient, and it’s better than the A100,” Plawner said.

Habana’s scores were achieved using BF16 precision. Plawner said Habana expects to achieve software support for FP8 by September, which should dramatically improve performance. He expects favorable price/performance comparisons for Gaudi2 versus H100 when that happens, he said, noting that next-gen Habana hardware (Gaudi3) will have a similar architecture with integrated Ethernet.

Intel CPU scores

Intel showed training scores for its fourth-gen Xeon (Sapphire Rapids) CPUs. These CPUs are the first to use Intel’s advanced matrix extensions (AMX), which are dedicated to speeding up AI performance. 32 Xeon CPUs can train ResNet in 88.173 minutes, RetinaNet in 232.405 minutes and Bert in 47.929 minutes.

“We are not competing against GPUs, and we’re not competing against [Habana] Gaudi,” Plawner said, pointing out that there were no other training results for CPUs. “If you happen to be out of GPUs, and you want to train intermittently, 232 minutes doesn’t sound like a lot…. If you’re training one model, this is just fine.”

Intel’s seeing increasing demand in the market for fine-tuning, Plawner said, and this is where CPUs can play a significant role: fine-tuning and inferencing for models up to tens of billions of parameters.

“People are fine-tuning these models down to smaller and smaller sizes,” he said. “It kind of makes sense when you realize we’re going to eventually have these models on our phones. Clearly, we need to go from 100 billion 200 billion down to sub 1 billion [parameters].”

As an example, Plawner showed non-MLPerf DistilBERT fine-tuning in fewer than five minutes on a single Xeon CPU node.

“Some people will say, ‘Hey, can I just run this on the Xeon that’s in front of me? Can I take a 10-20 billion parameter model that’s already been tuned and compressed, and fine tune it with my data in 15 minutes on a Xeon node?’ And the answer is, ‘Yes’,” he said. “We think this is a really great one-two punch for getting in the market—and that we’re stronger together with the two products [Xeon and Gaudi2].”

MLPerf Tiny results

Released at the same time as the training results, MLPerf tiny benchmarks showcase the opposite end of the scale: inference on microcontrollers (MCUs) and tiny accelerators.

Syntiant showed image and audio workloads on the NDP120, which uses its second-gen core (Syntiant Core 2). This part’s designed for ultra-low power AI inference, but unlike the first-gen core, which was designed specifically for audio keyword spotting, this core can also handle image data.

Syntiant NDP120
Syntiant submitted benchmark scores for its NDP120 across image and audio processing for the first time (Source: Syntiant)

The NDP120 can perform keyword spotting in 1.5 ms for 43.8 microJoules of energy. For energy-sensitive applications, Syntiant can clock the device slower (30 MHz versus 98 MHz) and reduce the supply voltage from 1.1 to 0.9 V. This reduced the energy per inference to 31.5 uJ but slows it down to 4.4 ms. The next lowest power score from the MCU entries was over 1000 uJ.

For the visual wakewords benchmark, Syntiant’s NDP120 can perform the inference in 4.1 ms using 97.2 uJ of energy. The only better scores among commercially available parts came from an Arm Cortex-A9 implemented on FPGA. For comparison, STMicroelectronics’ (STMicro) Cortex-M7 can do it in 29.6 ms but needs 3669 uJ.

In the preview category (for systems not yet commercially available), Bosch showed off its hardware-aware lowering engine (HALE)—a code-generation engine that can generate generic C code for any MCU, or target-specific code optimized for a particular piece of hardware. Currently, the optimized version supports Cortex-M MCUs but Bosch plans to expand this, as well as adding support for more layers and datatypes. Bosch is already using HALE for embedded AI projects.

Many software-differentiated entries picked STMicro’s STM32 Cortex-M4 (STM32L4R5ZI) part running at 120 MHz as a comparison point. STMicro’s own XCube-AI software stack can execute visual wakeword inference in 118.7 ms, image classification in 214.0 ms, keyword-spotting in 62.9 ms and anomaly detection in 6.9 ms (STMicro points out its scores are around 20% faster than in the last round as it continues to work on Xcube-AI. Features recently added include support for quantized ONNX models).

Nuvoton M467HJ
Taiwanese microcontroller maker Nuvoton submitted latency scores for its Cortex-M4F-based M467HJHAN. (Source: Nuvoton)

Using Plumerai’s inference engine improved on ST’s current scores by a further 20% (40% for anomaly detection). Bosch’s HALE engine was considerably slower than ST’s XCube-AI for visual wakeword, and its keyword-spotting score was similar, but Bosch improved on STMicro’s scores 16% and 17% for image classification and anomaly detection. Taiwanese AI software company Skymizer showed off its implementation of ONNC—TinyONNC, which uses Arm’s CMSIS-NN library and a proprietary post-training quantization engine. Their scores were no match for STMicro’s own scores, but the company said it plans to improve beyond what CMSIS-NN can offer in the future.

Taiwanese MCU maker Nuvoton made its MLPerf debut with scores for its M467HJHAN MCU (based on Arm Cortex-M4F). While they emerged very close to leader Plumerai for all the M4 submissions in terms of latency, they picked a faster operating point (200 MHz) than the others (120 MHz) and didn’t submit power results. Nuvoton uses Skymizer’s version of ONNC.



[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *