Skip to main content

MAX FAQ

We tried to anticipate your questions about MAX Engine on this page. If this page doesn't answer all your questions, please ask us on our Discord channel.

Motivation

Do we really need yet another inference engine?

We believe so, because the AI deployment landscape is littered with tools that provide only a small portion of what AI developers need. They support only a single framework, a single hardware target, a single OS and platform, a small subset of models, or just small- to medium-sized model execution. The result is that developers end up dealing with flaky model converters, constantly rewriting and re-optimizing their models to work with different tools, and have to work across different test and deployment environments. This is why we built MAX Engine – the only inference engine in the world that doesn’t require you to compromise because it is both general purpose and incredibly fast.

Distribution

When will it be generally available to the public?

You can get the MAX SDK right now.

This is still a developer preview and we'll soon release tools for production deployment. Sign up for updates.

Will it be open-sourced?

We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we have learned that the important thing is to create an inclusive and vibrant community – and that takes a lot of work. We will need to figure out the details, but as we do so, we will share more. Please stay tuned.

Does the MAX SDK collect telemetry?

Yes, the MAX SDK collects basic system information, session durations, compiler events, and crash reports that enable us to identify, analyze, and prioritize issues.

This telemetry is crucial to help us quickly identify problems and improve our products. Without this telemetry, we would have to rely on user-submitted bug reports, and in our decades of experience building developer products, we know that most people don’t do that. The telemetry provides us the insights we need to build better products for you.

You can opt-out of some telemetry, such as compiler events and crash reports. However, package install/update/uninstall events, basic system information,and session durations (the amount of time spent running MAX Engine) cannot be disabled (see the MAX SDK terms).

To disable crash reports, use this command:

modular config-set crash_reporting.enabled=false

To reduce other telemetry to only the required telemetry events, use this command:

modular config-set telemetry.level=0

There are 3 telemetry levels: 0 records only the required details such as hardware information and session durations; 1 records high-level events such as when the compiler is invoked; and 2 records more specific events such as when the compiler uses a framework fallback op.

Functionality

How do I use MAX Engine?

You can execute a trained model using our Python, C, and Mojo API libraries. Learn more about our APIs and view sample code in the MAX Engine docs.

Why does MAX Engine take so long to "load" a model?

The first time you load your model (such as with the Python load() function), MAX Engine must first compile the model.

This might seem strange if you're used to "eager execution" in PyTorch or TensorFlow, but this compilation step is how MAX Engine optimizes the graph to deliver more performance. This is an up-front cost that occurs only when you first load the model, and it pays dividends with major latency savings provided by our next-generation graph compiler. So it's worth the wait. :)

What types of models does MAX Engine support and from which training frameworks?

MAX Engine supports most models from PyTorch (in TorchScript format), ONNX, and TensorFlow (in SavedModel format).

However, TensorFlow support is available for enterprise customers only and no longer included in the MAX SDK. We removed TensorFlow support in the MAX SDK because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Also, it cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please contact us.

What hardware is currently supported by MAX Engine?

MAX Engine currently supports all CPU types from major vendors (Intel, AMD, Graviton).

Support for NVIDIA GPUs is in the works.

Does MAX Engine support generic ARM architectures?

Yes, but we officially support only Graviton because it’s the most commonly used ARM chip for server deployments, and our benchmarks are designed to match what users use most often in production.

We also support Apple Silicon (M1/M2) for local development with MAX.

Which operating systems does MAX Engine support?

MAX Engine currently supports Ubuntu Linux and macOS. In a future release, we'll add support for Windows. For details, see the system requirements.

Which programming languages does it support? Can I use Mojo?

We currently provide MAX Engine API bindings in Python, C, and—yes—Mojo. If there are other languages you’d like to see us support, please share in our Discord.

How quickly will it support new model architectures as they become available in ML frameworks?

One of the design principles of MAX Engine is full compatibility with models from supported ML frameworks. Hence, we will quickly support new operators as soon as a stable version of the training framework is available.

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

No. We believe forcing developers to deploy and manage different packages for different deployment targets is a major friction factor that shouldn’t exist. The same engine package works regardless of the hardware available in your deployment platform.

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

No. With MAX Engine, you write your code once and deploy anywhere. For example, if you are currently running on an Intel instance in AWS but want to experiment with a Graviton instance, that’s no problem. Just redeploy the same engine package to the Graviton instance, and you’re good to go.

Can I use Modular’s MAX Engine on existing cloud platforms?

Yes. You can deploy MAX Engine to infrastructure provided by any major cloud platform (AWS, GCP, Azure) via traditional container solutions. For more information, read about MAX Serve.

Does MAX Engine support quantization and sparsity?

It supports some quantized models today (models with Int data types, and we are working to add support for more) and we will be adding support for sparsity soon.

If you're building a model with MAX Graph, you can quantize your model with several different quantization encodings.

Will MAX Engine support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of a single device. This isn't available yet, so stay tuned!

Will MAX Engine support mobile?

We are currently focused on server deployment, but we plan to support deployment to many different platforms, including mobile. We will share more about our mobile support in the future, so stay tuned!

Can I extend MAX Engine with a new hardware backend?

Yes, you can. MAX Engine can support other hardware backends, including specialized accelerators, via Mojo, which can talk directly to MLIR and LLVM abstractions (for an example of Mojo talking directly to MLIR, see our low-level IR in Mojo notebook). The exact details about how to add additional hardware support are still being ironed out, but you can read our vision for pluggable hardware.

Can I integrate ops written in Mojo into MAX Engine?

Yes, Mojo works natively with MAX Engine! In fact, all of the MAX Engine’s in-house operations are written in Mojo. In an upcoming release, you'll also be able to write your own custom ops in Mojo.

Can the runtime be separated from model compilation (for edge deployment)?

Yes, our runtime (and our entire stack) is designed to be modular. It scales down very well, supports heterogeneous configs, and scales up to distributed settings as well. That being said, this isn't available yet, but we'll share details about more deployment scenarios we support over time.

Performance

How can I see some real performance numbers?

Just install the MAX SDK and then use our benchmark tool!

We've also created and interactive dashboard where you can select from a number of industry-standard models and production-grade compute instances to see our real inference performance compared to other frameworks. Take a look at our performance.

Why does MAX Engine perform slowly on my computer compared to PyTorch?

So far, we've been focused on building optimizations for data center CPUs. In many cases, these optimizations carry over to desktop x86 CPUs. However, modern desktop/laptop CPUs include specialty compute cores not found in data center CPUs, such as performance cores ("P cores," which are often paired with efficiency "E cores" in an "asymmetric core"). If your CPU includes these P/E cores, MAX Engine simply doesn't use them thoughtfully yet. We have plans to fix this.

Additionally, we haven't optimized for systems with 32+ cores yet. This will also be fixed soon.

If you're seeing slow results that aren't explained here, please let us know.

Why do you only show performance numbers on the CPU?

Our stack is completely extensible to any type of hardware architecture, from commodity silicon to newer types of AI-specific accelerators. We are starting with CPUs because many real-world inference workloads still heavily depend on CPUs. Our stack currently supports x86-64 CPUs from all major hardware vendors, such as Intel and AMD. It also supports the Graviton CPUs available in AWS. We are actively bringing up support for GPU execution, starting with NVIDIA’s GPU. We will share more about our GPU support soon.

Why test with batch size 1?

We started our benchmarking with batch size 1 for a couple reasons: 1) it’s a common batch size for production inference scenarios and 2) it puts runtime efficiencies front-and-center, which helps ensure we are building the most performant possible stack. We have also tested and seen the same relative performance improvements with larger batch sizes. We’ll be releasing those results on our performance dashboard in the near future.

Do you have any benchmarks on GPUs?

We are currently working on adding GPU support. Stay tuned for benchmarks in the near future.

Why are you benchmarking across so many different sequence lengths?

Production NLP deployment scenarios typically involve variable sequence lengths. One of the defining features of MAX Engine is that it supports full dynamic shapes, meaning that it’s not padding shorter sequence lengths or having to recompile when sequence lengths change. Therefore, we benchmark on a variety of sequence lengths to show the relative speedups you should expect depending on the distribution of sequence lengths in your data.

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

We are excited to bring state of the art innovations across many layers of the AI lifecycle, including layers that are necessary to serve increasingly large AI models on cloud infrastructure. To that end—as mentioned in our launch keynote—we are on a journey to build a next-generation cloud compute platform that will significantly improve server utilization by distributing inference across many nodes. It will effectively scale out and down to meet dynamic changes in traffic volume, and overall significantly reduce the amount of time and effort required to bring up and maintain large models in production. Stay tuned for more information about our plans here.

Do you also intend to support training?

Right now we are focused on inference because it’s a more fragmented landscape than training and because it is where organizations have a majority of their AI operating expenses. That being said, there’s no reason why the technology we’ve built for MAX Engine can’t scale to support training workloads with similar performance improvements. Stay tuned!