AI Inference Competition Heats Up

Claudio Ctin3 weeks ago16 mins

While the dominance of
Nvidia GPUs for AI training remains undisputed, we may be seeing early signs that, for AI inference, the competition is gaining on the tech giant, particularly in terms of power efficiency. The sheer performance of Nvidia’s new Blackwell chip, however, may be hard to beat.

This morning,
ML Commons released the results of its latest AI inferencing competition, ML Perf Inference v4.1. This round included first-time submissions from teams using AMD Instinct accelerators, the latest Google Trillium accelerators, chips from Toronto-based startup UntetherAI, as well as a first trial for Nvidia’s new Blackwell chip. Two other companies, Cerebras and FuriosaAI, announced new inference chips but did not submit to MLPerf.

Much like an Olympic sport, MLPerf has many categories and subcategories. The one that saw the biggest number of submissions was the “datacenter-closed” category. The closed category (as opposed to open) requires submitters to run inference on a given model as-is, without significant software modification. The data center category tests submitters on bulk processing of queries, as opposed to the edge category, where minimizing latency is the focus.

Within each category, there are 9 different benchmarks, for different types of AI tasks. These include popular use cases such as image generation (think Midjourney) and LLM Q&A (think ChatGPT), as well as equally important but less heralded tasks such as image classification, object detection, and recommendation engines.

This round of the competition included a new benchmark, called
Mixture of Experts. This is a growing trend in LLM deployment, where a language model is broken up into several smaller, independent language models, each fine-tuned for a particular task, such as regular conversation, solving math problems, and assisting with coding. The model can direct each query to an appropriate subset of the smaller models, or “experts”. This approach allows for less resource use per query, enabling lower cost and higher throughput, says Miroslav Hodak, MLPerf Inference Workgroup Chair and senior member of technical staff at AMD.

The winners on each benchmark within the popular datacenter-closed benchmark were still submissions based on Nvidia’s H200 GPUs and GH200 superchips, which combine GPUs and CPUs in the same package. However, a closer look at the performance results paint a more complex picture. Some of the submitters used many accelerator chips while others used just one. If we normalize the number of queries per second each submitter was able to handle by the number of accelerators used, and keep only the best performing submissions for each accelerator type, some interesting details emerge. (It’s important to note that this approach ignores the role of CPUs and interconnects.)

On a per accelerator basis, Nvidia’s Blackwell outperforms all previous chip iterations by 2.5x on the LLM Q&A task, the only benchmark it was submitted to. Untether AI’s speedAI240 Preview chip performed almost on-par with H200’s in its only submission task, image recognition. Google’s Trillium performed just over half as well as the H100 and H200s on image generation, and AMD’s Instinct performed about on-par with H100s on the LLM Q&A task.

The power of Blackwell

One of the reasons for Nvidia Blackwell’s success is its ability to run the LLM using 4-bit floating-point precision. Nvidia and its rivals have been driving down the number of bits used to represent data in portions of transformer models like ChatGPT to speed computation. Nvidia introduced 8-bit math with the H100, and this submission marks the first demonstration of 4-bit math on MLPerf benchmarks.

The greatest challenge with using such low-precision numbers is maintaining accuracy, says Nvidia’s product marketing director
Dave Salvator. To maintain the high accuracy required for MLPerf submissions, the Nvidia team had to innovate significantly on software, he says.

Another important contribution to Blackwell’s success is it’s almost doubled memory bandwidth, 8 terabytes/second, compared to H200’s 4.8 terabytes/second.

Nvidia GB2800 Grace Blackwell SuperchipNvidia

Nvidia’s Blackwell submission used a single chip, but Salvator says it’s built to network and scale, and will perform best when combined with Nvidia’s
NVLink interconnects. Blackwell GPUs support up to 18 NVLink 100 gigabyte-per-second connections for a total bandwidth of 1.8 terabytes per second, roughly double the interconnect bandwidth of H100s.

Salvatore argues that with the increasing size of large language models, even inferencing will require multi-GPU platforms to keep up with demand, and Blackwell is built for this eventuality. “Blackwell is a platform,” Salvator says.

Nvidia submitted their
Blackwell chip-based system in the preview subcategory, meaning it is not for sale yet but is expected to be available before the next MLPerf release, six months from now.

Untether AI shines in power use and at the edge

For each benchmark, MLPerf also includes an energy measurement counterpart, which systematically tests the wall plug power that each of the systems draws while performing a task. The main event (the datacenter-closed energy category) saw only two submitters this round: Nvidia and Untether AI. While Nvidia competed in all the benchmarks, Untether only submitted for image recognition.

Submitter

Accelerator

Number of accelerators

Queries per second

Watts

Queries per second per Watt

NVIDIA

NVIDIA H200-SXM-141GB

480,131.00

5,013.79

95.76

UntetherAI

UntetherAI speedAI240 Slim

309,752.00

985.52

314.30

The startup was able to achieve this impressive efficiency by building chips with an approach it calls at-memory computing. UntetherAI’s chips are built as a grid of memory elements with small processors interspersed directly adjacent to them. The processors are parallelized, each working simultaneously with the data in the nearby memory units, thus greatly decreasing the amount of time and energy spent shuttling model data between memory and compute cores.

“What we saw was that 90 percent of the energy to do an AI workload is just moving the data from DRAM onto the cache to the processing element,” says Untether AI vice president of product
Robert Beachler. “So what Untether did was turn that around … Rather than moving the data to the compute, I’m going to move the compute to the data.”

This approach proved particularly successful in another subcategory of MLPerf: edge-closed. This category is geared towards more on-the-ground use cases, such as machine inspection on the factory floor, guided vision robotics, and autonomous vehicles—applications where low energy use and fast processing are paramount, Beachler says.

Submitter

GPU type

Number of GPUs

Single Stream Latency (ms)

Multi-Stream Latency (ms)

Samples/s

Lenovo

NVIDIA L4

0.39

0.75

25,600.00

Lenovo

NVIDIA L40S

0.33

0.53

86,304.60

UntetherAI

UntetherAI speedAI240 Preview

0.12

0.21

140,625.00

On the image recognition task, again the only one UntetherAI reported results for, the speedAI240 Preview chip beat NVIDIA L40S’s latency performance by 2.8x and its throughput (samples per second) by 1.6x. The startup also submitted power results in this category, but their Nvidia-accelerated competitors did not, so it is hard to make a direct comparison. However, the nominal power draw per chip for UntetherAI’s speedAI240 Preview chip is 150 Watts, while for Nvidia’s L40s it is 350 W, leading to a nominal 2.3x power reduction with improved latency.

Cerebras, Furiosa skip MLPerf but announce new chips

Furiosa’s new chip implements the basic mathematical function of AI inference, matrix multiplication, in a different, more efficient way. Furiosa

Yesterday at the
IEEE Hot Chips conference at Stanford, Cerebras unveiled its own inference service. The Sunnyvale, Calif. company makes giant chips, as big as a silicon wafer will allow, thereby avoiding interconnects between chips and vastly increasing the memory bandwidth of their devices, which are mostly used to train massive neural networks. Now it has upgraded its software stack to use its latest computer CS3 for inference.

Although Cerebras did not submit to MLPerf, the company claims its platform beats an H100 by 7x and competing AI startup
Groq’s chip by 2x in LLM tokens generated per second. “Today we’re in the dial up era of Gen AI,” says Cerebras CEO and cofounder Andrew Feldman. “And this is because there’s a memory bandwidth barrier. Whether it’s an H100 from Nvidia or MI 300 or TPU, they all use the same off chip memory, and it produces the same limitation. We break through this, and we do it because we’re wafer-scale.”

Hot Chips also saw an announcement from Seoul-based
Furiosa, presenting their second-generation chip, RNGD (pronounced “renegade”). What differentiates Furiosa’s chip is its Tensor Contraction Processor (TCP) architecture. The basic operation in AI workloads is matrix multiplication, normally implemented as a primitive in hardware. However, the size and shape of the matrixes, more generally known as tensors, can vary widely. RNGD implements multiplication of this more generalized version, tensors, as a primitive instead. “During inference, batch sizes vary widely, so its important to utilize the inherent parallelism and data re-use from a given tensor shape,” Furiosa founder and CEO June Paik said at Hot Chips.

Although it didn’t submit to MLPerf, Furiosa compared the performance of its RNGD chip on MLPerf’s LLM summarization benchmark in-house. It performed on-par with Nvidia’s edge-oriented L40S chip while using only 185 Watts of power, compared to L40S’s 320 W. And, Paik says, the performance will improve with further software optimizations.

IBM also
announced their new Spyre chip designed for enterprise generative AI workloads, to become available in the first quarter of 2025.

At least, shoppers on the AI inference chip market won’t be bored for the foreseeable future.

Please follow and like us:

Stiri similare

Short-Term Panic For Bitcoin Likely After Upcoming Fed Rate Cut, Market Expert Warns

Claudio Ctin25 mins ago10 mins ago

With the Federal Reserve’s rate cut only a few days away, a crypto expert has shed light on the aftermath of Bitcoin‘s performance once the interest rate is decreased, particularly on September 18, which has been a major discussion within the general community. Bitcoin’s Short-Term Panic Is A “High Probability” In a cautionary post on…

First Trailer for ‘The Remarkable Life of Ibelin’ Gamer Documentary

Claudio Ctin33 mins ago20 mins ago

“It was just a virtual kiss, but boy I could almost feel it.” Netflix has revealed the first official trailer for an acclaimed documentary film titled The Remarkable Life of Ibelin, formerly known as just Ibelin (the name of his game character) at its debut earlier this year. The film premiered at the 2024 Sundance…

The Boys Season 5 Teased by Giancarlo Esposito: ‘A Whirlwind Season’

Claudio Ctin37 mins ago20 mins ago

Image via Amazon Studios Though no exact release date has been set for The Boys Season 5, series star Giancarlo Esposito has some good words about the highly anticipated final season. Season 4, which ended on a high note only a few months ago, has sparked a lot of anticipation for the fifth and final…

Former Revolut Executives Launch Crypto App & Raise Record Investment of $6.7M

Claudio Ctin38 mins ago11 mins ago

[PRESS RELEASE – London, United Kingdom, September 16th, 2024] Three former Revolut executives have launched Neverless, an app offering commission-free crypto trading alongside an automated passive investment account, which aims to revolutionise returns for its users. Neverless’ team raised $6.7 million in one of the largest pre-seed funding rounds in European history. Since then, the…

Exclusive $150K Presale: Bitlauncher Debuts Masterbots.ai, the Revolutionary AI Poised to Rival ChatGPT

Claudio Ctin38 mins ago11 mins ago

[PRESS RELEASE – Panama City, Panama, September 16th, 2024] Bitlauncher, the groundbreaking platform at the intersection of Artificial Intelligence (AI) and cryptocurrency, is thrilled to announce its presale event starting on September 16th, 2024, with a goal to raise $150,000. This presale offers enthusiasts and investors an opportunity to shape the next wave of global…

Bitcoin Frontier Fund, Stacks Foundation, and Tokensoft Partner to Help Bitcoin Builders Get to Market Faster

Claudio Ctin39 mins ago11 mins ago

[PRESS RELEASE – New York, New York, September 16th, 2024] Just a few weeks after launching their ‘LegalZoom for crypto companies,’ Tokensoft Foundation has partnered with two major Stacks contributors: Stacks Foundation and Bitcoin Frontier Fund. Together, they’re crafting an edition of Tokensoft’s unique offering tailored to the needs of founders building on Bitcoin. As…

Helium HNT Drops 3.11% Leaving Investors Portfolio’s Stagnant, GoodEgg Becomes Top Pick for Explosive Growth Potential

Claudio Ctin55 mins ago10 mins ago

As the cryptocurrency market continues to experience fluctuations, Helium (HNT), once a high-performing token in the Solana (SOL) ecosystem, has recently lost its upward momentum. Over the past week, Helium (HNT) has seen a 3.11% decline, leaving investors frustrated with stagnant portfolio growth. In contrast, GoodEgg (GEGG) has emerged as a top choice for those…

Crypto ‘Virtual Dating Platform’ Makes Presale Look Like Mexican Wave, Solana Community Rush GoodEgg (GEGG) Doors

Claudio Ctin55 mins ago10 mins ago

The cryptocurrency world is constantly evolving, but it’s rare to see a presale generate the same kind of excitement as GoodEgg (GEGG). With its unique combination of AI-powered social dating and Play-to-Date features, GoodEgg has become the new darling of the Solana (SOL) community, who are flocking to invest before it’s too late. As the…

Ethereum In Danger: Analyst Explains What Could Trigger Crash To $1,800

Claudio Ctin56 mins ago11 mins ago

An analyst has explained how losing this on-chain demand zone could cause Ethereum to witness a crash to as low as $1,800. Ethereum Is Currently Retesting A Major On-Chain Support Zone In a new post on X, analyst Ali Martinez has discussed about how Ethereum is looking like in terms of investor cost basis distribution…

Solana Up 2.1%, Still Being Over Shadowed By Hybrid Presale Token GoodEgg after Selling 1.9B Token In 2 Days

Claudio Ctin56 mins ago9 mins ago

The crypto market is buzzing with activity, and while Solana (SOL) continues to see incremental gains, it’s GoodEgg (GEGG) that’s stealing the spotlight. Despite Solana’s 2.1% price increase, GoodEgg (GEGG) has sold over 1.9 billion tokens in just two days, overshadowing many layer-1 cryptos in both excitement and potential. Solana’s Struggles Despite Price Gains Solana…

Ben Whishaw says he doesn’t expect to play Q again in the next James Bond and the new movie will benefit from a new cast

Claudio Ctin1 hour ago20 mins ago

Although the transfer of the James Bond character from lead actor to lead actor seemingly starts a new continuity, there is a slight tradition of supporting characters returning to reprise their roles. Dame Judy Dench bridged the Pierce Brosnan and Daniel Craig eras by carrying over as M. Desmond Llewelyn would portray the gadget genius…

From Season 3 TV Review: The supernatural mystery series is back with more chills

Claudio Ctin1 hour ago20 mins ago

PLOT: FROM unravels the mystery of a nightmarish town that traps all those who enter. As the unwilling residents fight to keep a sense of normalcy and search for a way out, they must also survive the threats of the surrounding forest – including the terrifying creatures that come out when the sun goes down. In…

Tom Cruise refused payment for his Summer Olympics stunts

Claudio Ctin1 hour ago1 hour ago

If the opening ceremony to the 2024 Summer Olympics was marred in controversy (which overshadowed even the highlights), then the closing ceremony was filled with some of the most exciting moments of the entire event, not the least of which was headed by Tom Cruise, who put on one hell of a show – as…