Get started with MAX
Welcome to the MAX quickstart guide!
This page provides a brief tour of what MAX has to offer you and your AI workloads. There's a lot to explore at the end of this walkthrough, so let's get started.
We're excited to share this preview version of MAX! For details about what's included, see the MAX changelog, and for details about what's yet to come, see the roadmap and known issues.
1. Install MAX
2. Run your first model
Let's start with something simple, just to make sure MAX is installed and working.
First, clone the code examples:
git clone https://github.com/modularml/max.git
If you installed the nightly build, make sure to checkout the nightly branch:
(cd max && git checkout nightly)
Now let's run a PyTorch model using our Python API:
-
Starting from where you cloned the repo, install the Python requirements:
cd max/examples/inference/bert-python-torchscript
python3 -m pip install -r requirements.txt
-
Download and run the model:
bash run.sh
This script downloads the BERT model and runs it with some input text.
You should see results like this:
input text: Paris is the [MASK] of France.
filled mask: Paris is the capital of France.
Cool, it works! (If it didn't work, let us know.)
The first time you run the example, it takes some time to compile the model. This might be unfamiliar if you're used to "eager execution" in ML frameworks, but MAX Engine uses next-generation compiler technology to optimize the graph and extract more performance, without any accuracy loss. This happens only when you load the model, and it pays dividends with significant speed-ups at run time.
The bash script we used takes care of setup, and then runs the model with a Python script, which you can see on GitHub. Our Python API allows you to use MAX Engine as a drop-in replacement for your existing runtime, with just 3 lines of code. You can also use MAX Engine with our C and Mojo APIs.
If you're interested in how our performance compares to stock frameworks on different CPU architectures, check out our performance dashboard.
![](/assets/images/max-perf-06a5dfa7247076f60bb737deba0ecff6.png)
3. Try Llama3 on MAX
In the previous example, we ran a PyTorch model with Python, but MAX is about much more than that. You can also use MAX to build high-performance, state-of-the-art AI models in Mojo.
Mojo is a systems programming language built from the ground up to deliver maximum performance on any hardware and enable programmability across the entire AI software stack. You don't have to write a single line of Mojo to accelerate your models with MAX Engine. However, MAX Engine and Mojo share essential compiler technologies, which means Mojo has unique abilities that unlock new levels of performance for your models in MAX Engine.
Take a look for yourself. We've built the Llama 3 large language model entirely in Mojo, using the MAX Graph API. It's incredibly fast and you can try it right now:
-
Navigate back to the path where you cloned our repo. Then navigate to the Llama 3 pipeline:
cd max/examples/graph-api/pipelines/llama3
-
Execute the model:
mojo ../../run_pipeline.🔥 llama3 --prompt "I believe the meaning of life is"
After we download the weights and compile the model, you'll see it print the response in real-time as tokens are emitted by the model. This is all running locally on your CPU. When it's done, you'll also see some performance stats.
Check out the Mojo Llama 3 code on GitHub.
Next steps
MAX is available now as a developer preview, and there's much more to come. To understand how MAX can accelerate your AI workloads and simplify your workflows, try some of the following:
Browse MAX examples
See more examples that execute PyTorch and ONNX models using Python, C, and Mojo, plus GenAI models built in Mojo.
Benchmark a model
Bring your own PyTorch or ONNX model and benchmark it on MAX Engine, without writing a single line of code.
Run inference
Learn how to run inference with a PyTorch or ONNX model, using our Python, C, or Mojo API.
Create a custom op
Redefine graph ops in your model and implement new ones using Mojo, no matter which model format you're using.
Build a graph in Mojo
Learn more about how to build high-performance AI models using Mojo and the Graph API.
Quantize a graph
Learn how to get more performance and portability from your models with our quantization API for MAX Graph.
Deploy a model
Learn how to deploy a model with NVIDIA Triton Inference Server, using MAX Engine as the backend.
And this is just the beginning!
In the coming months, we'll add support for GPU hardware, more extensibility APIs, and more solutions for production deployment with MAX.