Run inference with C

Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from PyTorch and ONNX.

This page shows how to use the MAX Engine C API to load a model and execute it with MAX Engine.

Try it

For a complete code example, check out this TorchScript example using C.

Create a runtime context

The first thing you need is an M_RuntimeContext, which is an application level object that sets up various resources such as threadpool and allocators during inference. We recommended you create one context and use it throughout your application.

To create an M_RuntimeContext, you need two other objects:

M_RuntimeConfig: This configures details about the runtime context such as the number of threads to use and the logging level.
M_Status: This is the object through which MAX Engine passes all error messages.

Here's how you can create both of these objects and then create the M_RuntimeContext:

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Notice that this code checks if the M_Status object is an error, using M_isError(), and then exits if it is.

Compile the model

Now you can compile your PyTorch or ONNX model.

note

PyTorch models must be in TorchScript format. Read more.

Generally, you do that by passing your model path to M_setModelPath(), along with an M_CompileConfig object, and then call M_compileModel().

However, the MAX Engine compiler needs to know the model input shapes, which are not specified in a TorchScript file (they are specified in TF SavedModel and ONNX files). So, you need some extra code if you're loading a TorchScript model, as shown in the following PyTorch tab.

PyTorch
ONNX

If you're using a PyTorch model (it must be in TorchScript format), the M_CompileConfig needs the model path, via M_setModelPath(), and the input specs (shape, rank, and types), via M_setTorchInputSpecs().

note

Although you must specify all input shapes, the shapes can be dynamic: use M_getDynamicDimensionValue() for any dimension size that's dynamic. For more detail, see M_newTorchInputSpec().

Here's an abbreviated example:

// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Create torch input specs
int64_t *inputIdsShape =
    (int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TorchInputSpec *inputIdsInputSpec =
    M_newTorchInputSpec(inputIdsShape, /*dimNames=*/NULL, /*rankSize=*/2,
                        /*dtype=*/M_INT32, status);

// ... Similar code here to also create M_TorchInputSpec for
//     attentionMaskInputSpec and tokenTypeIdsInputSpec

// Set the input specs
M_TorchInputSpec *inputSpecs[3] = {inputIdsInputSpec, attentionMaskInputSpec,
                               tokenTypeIdsInputSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, 3);

// Compile the model
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Because the TorchScript model does not include metadata about the input specs, this code loads the input shapes from .bin files that were generated earlier. You can see an example of how to generate these files in our download-model.py script for bert-c-torchscript on GitHub.

If you're using an ONNX model, the M_CompileConfig needs just the model path, via M_setModelPath(). Then, you can call M_compileModel():

// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Compile the model
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

MAX Engine now begins compiling the model asynchronously; M_compileModel() returns immediately. Note that an M_CompileConfig can only be used for a single compilation call. Any subsequent calls require a new M_CompileConfig.

Initialize the model

The M_AsyncCompiledModel returned by M_compileModel() is not ready for inference yet. You now need to initialize the model by calling M_initModel(), which returns an instance of M_AsyncModel.

This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.

M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

You don't need to wait for M_compileModel() to return before calling M_initModel(), because it internally waits for compilation to finish. If you want to wait, add a call to M_waitForCompilation() before you call M_initModel(). This is the general pattern followed by all MAX Engine APIs that accept an asynchronous value as an argument.

M_initModel() is also asynchronous and returns immediately. If you want to wait for it to finish, add a call to M_waitForModel().

Prepare input tensors

The last step before you run an inference is to move each input tensor into a single M_AsyncTensorMap. You can add each input by calling M_borrowTensorInto(), passing it the input tensor and the corresponding tensor specification (shape, type, etc) as an M_TensorSpec.

// Define the tensor spec
int64_t *inputIdsShape =
    (int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
    M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Run an inference

Now you're ready to run an inference with M_executeModelSync():

M_AsyncTensorMap *outputs =
    M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Process the output

The output is returned in an M_AsyncTensorMap, and you can get individual outputs from it with M_getTensorByNameFrom().

M_AsyncTensor *logits =
    M_getTensorByNameFrom(outputs,
                          /*tensorName=*/"logits", status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

If you don't know the tensor name, you can get it from M_getTensorNameAt().

Clean up

That's it! Don't forget to free all the things—see the types reference to find each free function.

For more example code, see our GitHub repo.

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub

😔 What went wrong?

Create a runtime context​

Compile the model​

Initialize the model​

Prepare input tensors​

Run an inference​

Process the output​

Clean up​