MAX FAQ

We tried to anticipate your questions about MAX Engine on this page. If this page doesn't answer all your questions, please ask us on our Discord channel.

Distribution

What are the system requirements?

Mac
Linux
Windows

Apple silicon (M1/M2/M3 processor)
macOS Ventura (13) or later
Python 3.9 - 3.12
Xcode or Xcode Command Line Tools
Homebrew

What versions of torch are supported?

MAX supports torch from version 2.2.2 through 2.4.0.

MAX currently defaults to use CPU-only torch, but CUDA/ROCm supporting torch versions should work as intended.

Will it be open-sourced?

We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we have learned that the important thing is to create an inclusive and vibrant community – and that takes a lot of work. We will need to figure out the details, but as we do so, we will share more. Please stay tuned.

Why bundle Mojo with MAX?

Integrating Mojo and MAX into a single package is the best way to ensure interoperability between Mojo and MAX for all users, and avoid version conflicts that happen when installing them separately.

We built Mojo as a core technology for MAX, and you can use it to extend MAX Engine, so MAX clearly depends on Mojo. On the other hand, writing Mojo code that runs on both CPUs and GPUs (and other accelerators) requires runtime components and orchestration logic that falls outside the domain of Mojo, and into the domain of MAX. That is, MAX isn't just a framework for AI development, it's also a framework for general heterogeneous compute. As such, writing Mojo programs that can execute across heterogeneous hardware depends on MAX.

Nothing has changed for Mojo developers—you can still build and develop in Mojo like you always have. The only difference is that you're now able to seamlessly step into general-purpose GPU programming (coming soon).

Does the MAX SDK collect telemetry?

Yes, the MAX SDK collects basic system information, session durations, compiler events, and crash reports that enable us to identify, analyze, and prioritize issues.

This telemetry is crucial to help us quickly identify problems and improve our products. Without this telemetry, we would have to rely on user-submitted bug reports, and in our decades of experience building developer products, we know that most people don’t do that. The telemetry provides us the insights we need to build better products for you.

You can opt-out of some telemetry, such as compiler events and crash reports. However, package install/update/uninstall events, basic system information,and session durations (the amount of time spent running MAX Engine) cannot be disabled (see the MAX SDK terms).

If you installed with `magic`

When using Magic, telemetry is configured separately for each project. To disable telemetry for compiler events and crash reports, run this command in your project environment:

magic telemetry --disable

For details, see the magic telemetry docs.

If you installed with `modular` (deprecated)

To disable crash reports, use this command:

modular config-set crash_reporting.enabled=false

To reduce other telemetry to only the required telemetry events, use this command:

modular config-set telemetry.level=0

There are 3 telemetry levels: 0 records only the required details such as hardware information and session durations; 1 records high-level events such as when the compiler is invoked; and 2 records more specific events such as when the compiler uses a framework fallback op.

Installation

Can I install both stable and nightly builds?

Yes, when you use the Magic CLI, it's safe and easy to use the stable and nightly builds for different projects, each with their own virtual environment and package dependencies.

To install nightlies, see how to create a project with MAX or get started with Mojo.

How do I uninstall everything?

It depends on whether you installed using the modular CLI (deprecated) or the magic CLI.

If you installed with `magic`

Just delete any project paths that you created with magic init or that contain a pixi.toml or mojoproject.toml file. Then delete the magic binary:

rm ~/.modular/bin/magic

That's it.

If you installed with `modular` (deprecated)

You can use the modular uninstall command to remove just the max or mojo packages and configurations:

modular uninstall max

That removes the executable tools and configurations, but it leaves the MAX Python package, the Modular CLI tool, and some other files. To remove everything else, use these commands:

Uninstall the MAX Python package. You can do this in one of two ways:
- Delete your virtual environment completely, which removes the max-engine Python packages along with any other dependencies. For example:
  - venv
  - conda
  rm -rf ~/max-venv
  conda env remove --name max
- Or, keep the virtual environment and delete just the max-engine packages inside:
  - venv
  - conda
  source ~/max-venv/bin/activate
  conda activate max
```
pip uninstall max-engine max-engine-libs -y
```

Uninstall the modular CLI:

Mac
Linux

brew uninstall modular

sudo apt update && sudo apt remove modular -y

Delete the Modular home directory:
```
rm -rf ~/.modular
```

That should do it for situations in which you're troubleshooting your installation and want to reinstall from scratch.

If you want to remove absolutely all trace of MAX, here are a few more actions:

Delete the lines that mention MODULAR_HOME and .modular in ~/.bash_rc, ~/.bash_profile, or ~/.zshrc.

Remove the Modular repository name:

Mac
Linux

brew untap modularml/packages/modular

sudo rm -rf /etc/apt/sources.list.d/modular-installer.list

If you used Jupyter notebooks with Mojo, delete the Mojo kernels:
```
jupyter kernelspec list
```
```
jupyter kernelspec remove <KERNEL_NAME>
```
Uninstall the Mojo VS Code extension via the VS Code extensions tab.

Functionality

How do I use MAX Engine?

You can execute a trained model using our Python, C, and Mojo API libraries. Learn more about our APIs and view sample code in the MAX Engine docs.

Why does MAX Engine take so long to "load" a model?

The first time you load your model (such as with the Python load() function), MAX Engine must first compile the model.

This might seem strange if you're used to "eager execution" in PyTorch or TensorFlow, but this compilation step is how MAX Engine optimizes the graph to deliver more performance. This is an up-front cost that occurs only when you first load the model, and it pays dividends with major latency savings provided by our next-generation graph compiler. So it's worth the wait. :)

What types of models does MAX Engine support and from which training frameworks?

MAX Engine supports most models from PyTorch (in TorchScript format), ONNX, and TensorFlow (in SavedModel format).

However, TensorFlow support is available for enterprise customers only and no longer included in the MAX SDK. We removed TensorFlow support in the MAX SDK because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Also, it cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please contact us.

What hardware is currently supported by MAX Engine?

MAX Engine currently supports all CPU types from major vendors (Intel, AMD, Graviton).

Support for NVIDIA GPUs is in the works.

Does MAX Engine support generic ARM architectures?

Yes, but we officially support only Graviton because it’s the most commonly used ARM chip for server deployments, and our benchmarks are designed to match what users use most often in production.

We also support Apple Silicon (M1/M2) for local development with MAX.

Which programming languages does it support? Can I use Mojo?

We currently provide MAX Engine API bindings in Python, C, and—yes—Mojo. If there are other languages you’d like to see us support, please share in our Discord.

How quickly will it support new model architectures as they become available in ML frameworks?

One of the design principles of MAX Engine is full compatibility with models from supported ML frameworks. Hence, we will quickly support new operators as soon as a stable version of the training framework is available.

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

No. We believe forcing developers to deploy and manage different packages for different deployment targets is a major friction factor that shouldn’t exist. The same engine package works regardless of the hardware available in your deployment platform.

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

No. With MAX Engine, you write your code once and deploy anywhere. For example, if you are currently running on an Intel instance in AWS but want to experiment with a Graviton instance, that’s no problem. Just redeploy the same engine package to the Graviton instance, and you’re good to go.

Can I use Modular’s MAX Engine on existing cloud platforms?

Yes. You can deploy MAX Engine to infrastructure provided by any major cloud platform (AWS, GCP, Azure) via traditional container solutions. For more information, read about MAX Serve.

Does MAX Engine support quantization and sparsity?

It supports some quantized models today (models with Int data types, and we are working to add support for more) and we will be adding support for sparsity soon.

If you're building a model with MAX Graph, you can quantize your model with several different quantization encodings.

Will MAX Engine support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of a single device. This isn't available yet, so stay tuned!

Will MAX Engine support mobile?

We are currently focused on server deployment, but we plan to support deployment to many different platforms, including mobile. We will share more about our mobile support in the future, so stay tuned!

Can I extend MAX Engine with a new hardware backend?

Yes, you can. MAX Engine can support other hardware backends, including specialized accelerators, via Mojo, which can talk directly to MLIR and LLVM abstractions (for an example of Mojo talking directly to MLIR, see our low-level IR in Mojo notebook). The exact details about how to add additional hardware support are still being ironed out, but you can read our vision for pluggable hardware.

Can I integrate ops written in Mojo into MAX Engine?

We're working on it. Learn more about our extensibility API for MAX Engine.

Can the runtime be separated from model compilation (for edge deployment)?

Yes, our runtime (and our entire stack) is designed to be modular. It scales down very well, supports heterogeneous configs, and scales up to distributed settings as well. That being said, this isn't available yet, but we'll share details about more deployment scenarios we support over time.

Performance

Why does MAX Engine perform slowly on my computer compared to PyTorch?

So far, we've been focused on building optimizations for data center CPUs. In many cases, these optimizations carry over to desktop x86 CPUs. However, modern desktop/laptop CPUs include specialty compute cores not found in data center CPUs, such as performance cores ("P cores," which are often paired with efficiency "E cores" in an "asymmetric core"). If your CPU includes these P/E cores, MAX Engine simply doesn't use them thoughtfully yet. We have plans to fix this.

Additionally, we haven't optimized for systems with 32+ cores yet. This will also be fixed soon.

If you're seeing slow results that aren't explained here, please let us know.

Why do you only show performance numbers on the CPU?

Our stack is completely extensible to any type of hardware architecture, from commodity silicon to newer types of AI-specific accelerators. We are starting with CPUs because many real-world inference workloads still heavily depend on CPUs. Our stack currently supports x86-64 CPUs from all major hardware vendors, such as Intel and AMD. It also supports the Graviton CPUs available in AWS. We are actively bringing up support for GPU execution, starting with NVIDIA’s GPU. We will share more about our GPU support soon.

Why test with batch size 1?

We started our benchmarking with batch size 1 for a couple reasons: 1) it’s a common batch size for production inference scenarios and 2) it puts runtime efficiencies front-and-center, which helps ensure we are building the most performant possible stack. We have also tested and seen the same relative performance improvements with larger batch sizes. We’ll be releasing those results on our performance dashboard in the near future.

Do you have any benchmarks on GPUs?

We are currently working on adding GPU support. Stay tuned for benchmarks in the near future.

Why are you benchmarking across so many different sequence lengths?

Production NLP deployment scenarios typically involve variable sequence lengths. One of the defining features of MAX Engine is that it supports full dynamic shapes, meaning that it’s not padding shorter sequence lengths or having to recompile when sequence lengths change. Therefore, we benchmark on a variety of sequence lengths to show the relative speedups you should expect depending on the distribution of sequence lengths in your data.

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

We are excited to bring state of the art innovations across many layers of the AI lifecycle, including layers that are necessary to serve increasingly large AI models on cloud infrastructure. To that end—as mentioned in our launch keynote—we are on a journey to build a next-generation cloud compute platform that will significantly improve server utilization by distributing inference across many nodes. It will effectively scale out and down to meet dynamic changes in traffic volume, and overall significantly reduce the amount of time and effort required to bring up and maintain large models in production. Stay tuned for more information about our plans here.

Do you also intend to support training?

Right now we are focused on inference because it’s a more fragmented landscape than training and because it is where organizations have a majority of their AI operating expenses. That being said, there’s no reason why the technology we’ve built for MAX Engine can’t scale to support training workloads with similar performance improvements. Stay tuned!

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub

😔 What went wrong?

Distribution​

What are the system requirements?​

What versions of torch are supported?​

Will it be open-sourced?​

Why bundle Mojo with MAX?​

Does the MAX SDK collect telemetry?​

If you installed with magic​

If you installed with modular (deprecated)​

Installation​

Can I install both stable and nightly builds?​

How do I uninstall everything?​

If you installed with magic​

If you installed with modular (deprecated)​

Functionality​

How do I use MAX Engine?​

Why does MAX Engine take so long to "load" a model?​

What types of models does MAX Engine support and from which training frameworks?​

What hardware is currently supported by MAX Engine?​

Does MAX Engine support generic ARM architectures?​

Which programming languages does it support? Can I use Mojo?​

How quickly will it support new model architectures as they become available in ML frameworks?​

Will I need different packages depending on the platform or hardware where my code needs to be deployed?​

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?​

Can I use Modular’s MAX Engine on existing cloud platforms?​

Does MAX Engine support quantization and sparsity?​

Will MAX Engine support distributed inference of large models?​

Will MAX Engine support mobile?​

Can I extend MAX Engine with a new hardware backend?​

Can I integrate ops written in Mojo into MAX Engine?​

Can the runtime be separated from model compilation (for edge deployment)?​

Performance​

Why does MAX Engine perform slowly on my computer compared to PyTorch?​

Why do you only show performance numbers on the CPU?​

Why test with batch size 1?​

Do you have any benchmarks on GPUs?​

Why are you benchmarking across so many different sequence lengths?​

Future work​

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?​

Do you also intend to support training?​

Distribution

What are the system requirements?

What versions of torch are supported?

Will it be open-sourced?

Why bundle Mojo with MAX?

Does the MAX SDK collect telemetry?

If you installed with `magic`

If you installed with `modular` (deprecated)

Installation

Can I install both stable and nightly builds?

How do I uninstall everything?

If you installed with `magic`

If you installed with `modular` (deprecated)

Functionality

How do I use MAX Engine?

Why does MAX Engine take so long to "load" a model?

What types of models does MAX Engine support and from which training frameworks?

What hardware is currently supported by MAX Engine?

Does MAX Engine support generic ARM architectures?

Which programming languages does it support? Can I use Mojo?

How quickly will it support new model architectures as they become available in ML frameworks?

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

Can I use Modular’s MAX Engine on existing cloud platforms?

Does MAX Engine support quantization and sparsity?

Will MAX Engine support distributed inference of large models?

Will MAX Engine support mobile?

Can I extend MAX Engine with a new hardware backend?

Can I integrate ops written in Mojo into MAX Engine?

Can the runtime be separated from model compilation (for edge deployment)?

Performance

Why does MAX Engine perform slowly on my computer compared to PyTorch?

Why do you only show performance numbers on the CPU?

Why test with batch size 1?

Do you have any benchmarks on GPUs?

Why are you benchmarking across so many different sequence lengths?

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

Do you also intend to support training?