Skip to main content

Microsoft’s Tutel optimizes AI model training

View of a Microsoft logo on March 10, 2021, in New York.
View of a Microsoft logo on March 10, 2021, in New York.

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Microsoft this week announced Tutel, a library to support the development of mixture of experts (MoE) models — a particular type of large-scale AI model. Tutel, which is open source and has been integrated into fairseq, one of Facebook’s toolkits in PyTorch, is designed to enable developers across AI disciplines to “execute MoE more easily and efficiently,” a statement from Microsoft explained.

MoE are made up of small clusters of “neurons” that are only active under special, specific circumstances. Lower “layers” of the MoE model extract features and experts are called upon to evaluate those features. For example, MoEs can be used to create a translation system, with each expert cluster learning to handle a separate part of speech or special grammatical rule.

Compared with other model architectures, MoEs have distinct advantages. They can respond to circumstances with specialization, allowing the model to display a greater range of behaviors. The experts can receive a mix of data, and when the model is in operation, only a few experts are active — even a huge model needs only a small amount of processing power.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

In fact, MoE is one of the few approaches demonstrated to scale to more than a trillion parameters, paving the way for models capable of powering computer vision, speech recognition, natural language processing, and machine translation systems, among others. In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, especially in the language domain, the correlation between the number of parameters and sophistication has held up well.

Tutel mainly focuses on the optimizations of MoE-specific computation. In particular, the library is optimized for Microsoft’s new Azure NDm A100 v4 series instances, which provide a sliding scale of Nvidia A100 GPUs. Tutel has a “concise” interface intended to make it easy to integrate into other MoE solutions, Microsoft says. Alternatively, developers can use the Tutel interface to incorporate standalone MoE layers into their own DNN models from scratch.

A line graph comparing the end-to-end performance of Meta’s MoE language model using Azure NDm A100 v4 nodes with and without Tutel. The x-axis is the number of A100 (80GB) GPUs, beginning at 8 and going up to 512, and the y-axis is the throughput (K tokens/s), beginning with 0 and going up to 1,000 in intervals of 100. Tutel always achieves higher throughput than fairseq.

Above: For a single MoE layer, Tutel achieves an 8.49 times speedup on an NDm A100 v4 node with 8 GPUs and a 2.75 times speedup on 64 NDm A100 v4 nodes with 512 A100 GPUs, Microsoft claims.

“Because of the lack of efficient implementations, MoE-based models rely on a naive combination of multiple off-the-shelf operators provided by deep learning frameworks such as PyTorch and TensorFlow to compose the MoE computation. Such a practice incurs significant performance overheads thanks to redundant computation,” Microsoft wrote in a blog post. (Operators provide a model with a known dataset that includes desired inputs and outputs). “Tutel designs and implements multiple highly optimized GPU kernels to provide operators for MoE-specific calculation.”

Tutel is available in open source on GitHub. Microsoft says that the Tutel development team will “be actively integrating” various emerging MoE algorithms from the community into future releases.

“MoE is a promising technology. It enables holistic training based on techniques from many areas, such as systematic routing and network balancing with massive nodes, and can even benefit from GPU-based acceleration. We demonstrate an efficient MoE implementation, Tutel, that resulted in significant gain over the fairseq framework. Tutel has been integrated [with our] DeepSpeed framework, as well, and we believe that Tutel and related integrations will benefit Azure services, especially for those who want to scale their large models efficiently,” Microsoft added.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.