Skip to main content

Get started with MAX

Welcome to the MAX quickstart guide!

Within a matter of minutes, you’ll install MAX and run inference with some code examples. This will give you a look at the MAX Engine performance compared to the stock PyTorch and ONNX runtimes, and show you how to benchmark your own models from the command line.

Preview release

We're excited to share this preview version of MAX! For details about what's included, see the MAX changelog, and for details about what's yet to come, see the roadmap and known issues.

1. Install MAX

See the MAX install guide.

2. Run your first model

Let's start with something simple, similar to a "Hello world," just to make sure MAX is installed and working.

First, clone the code examples:

git clone
Nightly branch

If you installed the nightly build, make sure to checkout the nightly branch:

(cd max && git checkout nightly)

Now let's run inference using a TorchScript model and our Python API. We'll start with a version of BERT that's trained to predict the masked words in a sentence.

  1. Starting from where you cloned the repo, go into the example and install the Python requirements:

    cd max/examples/inference/bert-python-torchscript
    python3 -m pip install -r requirements.txt
  2. Download and run the model with this script:


    This script downloads the BERT model and runs it with some input text.

You should see results like this:

input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.

Cool, it works! (If it didn't work, let us know.)

Compile time

The first time you run an example, it will take some time to compile the model. This might seem strange if you're used to "eager execution" in ML frameworks, but this is where MAX Engine optimizes the graph to deliver more performance. This happens only when you load the model, and it's an up-front cost that pays dividends with major latency savings at run time.

Okay, so we executed a PyTorch model. It wasn't meant to blow your mind. It's just an API example that shows how to use our Python API to load and run a model, and make sure it all works.

However, once the compilation is done, MAX Engine does execute models very fast, without any changes to the models. To see how MAX Engine compares when executing different models on different CPU architectures, see our performance dashboard.

Figure 1. MAX Engine latency speed-up when running Mistral-7B vs PyTorch (MAX Engine is 2.5x faster).

But seeing is believing. So, we created a program that compares our performance to PyTorch.

3. Run the performance showcase

The premise for this program is simple: It runs the same model (downloaded from HuggingFace) in PyTorch and MAX Engine, and measures the average execution time over several inferences.

Let's go!

  1. Starting again from where you cloned the repo, change directories and install the requirements:

    cd max/examples/performance-showcase
    python3 -m pip install -r requirements.txt
  2. Now start the showcase by specifying the model to run:

    python3 -m roberta

Again, this might take a few minutes to compile the first time.

When it's done, you'll see the inference queries per second (QPS; higher is better) listed for each runtime, like this (results vary based on hardware):

Running with PyTorch
.............................................................. QPS: 18.41

Running with MAX Engine
Compiling model.
.............................................................. QPS: 33.11
MAX Performance

There are no tricks here! (See the code for yourself.) MAX Engine wins because our compiler uses next-generation technology to optimize the graph and extract more performance, without any accuracy loss. And our performance will only get faster and faster in future versions! If you got slow results, see this answer.

To start using MAX Engine in your own project, just drop in the MAX Engine API and start calling it for each inference request. For details, see how to run inference with Python or with C.

But, maybe you're thinking we're showing only the models that make us look good here. Well, see for yourself by benchmarking any model!

4. Benchmark any model

With the benchmark tool, you can benchmark any compatible model with an MLPerf scenario. It runs the model several times with generated inputs (or inputs you provide), and prints the performance results.

For example, here’s how to benchmark an example model from HuggingFace:

  1. Download the model with this script in our GitHub repo:

    cd max/examples/tools/common/resnet50-pytorch
    bash --output resnet50.torchscript
  2. Then benchmark the model:

    max benchmark resnet50.torchscript --input-data-schema=input-spec.yaml

This compiles the model, runs it several times, and prints the benchmark results. (Again, it might take a few minutes to compile the model before benchmarking it.)

The output is rather long, so this is just part of what you should see (your results will differ based on hardware):

Additional Stats
QPS w/ loadgen overhead : 44.024
QPS w/o loadgen overhead : 44.048

Min latency (ns) : 21909338
Max latency (ns) : 24319980
Mean latency (ns) : 22702682
50.00 percentile latency (ns) : 22698762
90.00 percentile latency (ns) : 23095239
95.00 percentile latency (ns) : 23212431
97.00 percentile latency (ns) : 23325674
99.00 percentile latency (ns) : 23489326
99.90 percentile latency (ns) : 24319980

Now try benchmarking your own model! Just be sure it's in one of our supported model formats.

Also be aware that the benchmark tool needs to know the model's input shapes so it can generate inputs, and not all models provide input shape metadata. If your model doesn't include that metadata, then you need to specify the input shapes. Or, you can provide your own input data in a NumPy file. Learn more in the benchmark guide.

Share Feedback

We’d love to hear about about your experience benchmarking other models. If you have any issues, let us know. For details about the known issues and features we're working on, see the roadmap and known issues.

Next steps

There's still much more you can do with MAX. Check out some of these other docs to learn more:

And this is just the beginning! In the coming months, we'll add support for GPU hardware, more extensibility APIs, and more solutions for production deployment with MAX.

Join the discussion

Get in touch with other MAX developers, ask questions, and share feedback on Discord and GitHub.