Azure AI Foundry Models allows customers to consume the most powerful models from flagship model providers using a single endpoint and credentials. This means that you can switch between models and consume them from your application without changing a single line of code.
The article explains how models are organized inside of the service and how to use the inference endpoint to invoke them.
Deployments
Azure AI Foundry makes models available using the deployment concept. Deployments are a way to give a model a name under certain configurations. Then, you can invoke such model configuration by indicating its name on your requests.
Deployments capture:
- A model name
- A model version
- A provisioning/capacity type1
- A content filtering configuration1
- A rate limiting configuration1
1 Configurations may vary depending on the selected model.
An Azure AI Foundry resource can have as many model deployments as needed and they don't incur in cost unless inference is performed for those models. Deployments are Azure resources and hence they're subject to Azure policies.
To learn more about how to create deployments see Add and configure model deployments.
Endpoints
Azure AI Foundry Services (formerly known Azure AI Services) expose multiple endpoints depending on the type of work you're looking for:
- Azure AI inference endpoint (usually with the form
https://<resource-name>.services.ai.azure.com/models
)
- Azure OpenAI endpoint (usually with the form
https://<resource-name>.openai.azure.com
)
The Azure AI inference endpoint allows customers to use a single endpoint with the same authentication and schema to generate inference for the deployed models in the resource. All the models support this capability. This endpoint follows the Azure AI Model Inference API.
The Azure OpenAI API exposes the full capabilities of OpenAI models and supports more features like assistants, threads, files, and batch inference. Non-OpenAI models may also be exposed in this route.
To learn more about how to apply the Azure OpenAI endpoint see Azure OpenAI in Azure AI Foundry Models documentation.
Using Azure AI inference endpoint
The inference endpoint routes requests to a given deployment by matching the parameter name
inside of the request to the name of the deployment. This means that deployments work as an alias of a given model under certain configurations. This flexibility allows you to deploy a given model multiple times in the service but under different configurations if needed.
For example, if you create a deployment named Mistral-large
, then such deployment can be invoked as:
Install the package azure-ai-inference
using your package manager, like pip:
pip install azure-ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(
endpoint="https://<resource>.services.ai.azure.com/models",
credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
)
Explore our samples and read the API reference documentation to get yourself started.
Install the package @azure-rest/ai-inference
using npm:
npm install @azure-rest/ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";
const client = new ModelClient(
"https://<resource>.services.ai.azure.com/models",
new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);
Explore our samples and read the API reference documentation to get yourself started.
Install the Azure AI inference library with the following command:
dotnet add package Azure.AI.Inference --prerelease
Import the following namespaces:
using Azure;
using Azure.Identity;
using Azure.AI.Inference;
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
ChatCompletionsClient client = new ChatCompletionsClient(
new Uri("https://<resource>.services.ai.azure.com/models"),
new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL"))
);
Explore our samples and read the API reference documentation to get yourself started.
Add the package to your project:
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-ai-inference</artifactId>
<version>1.0.0-beta.1</version>
</dependency>
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
ChatCompletionsClient client = new ChatCompletionsClientBuilder()
.credential(new AzureKeyCredential("{key}"))
.endpoint("https://<resource>.services.ai.azure.com/models")
.buildClient();
Explore our samples and read the API reference documentation to get yourself started.
Use the reference section to explore the API design and which parameters are available. For example, the reference section for Chat completions details how to use the route /chat/completions
to generate predictions based on chat-formatted instructions. Notice that the path /models
is included to the root of the URL:
Request
POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
api-key: <api-key>
Content-Type: application/json
For a chat model, you can create a request as follows:
from azure.ai.inference.models import SystemMessage, UserMessage
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="Explain Riemann's conjecture in 1 paragraph"),
],
model="mistral-large"
)
print(response.choices[0].message.content)
var messages = [
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "Explain Riemann's conjecture in 1 paragraph" },
];
var response = await client.path("/chat/completions").post({
body: {
messages: messages,
model: "mistral-large"
}
});
console.log(response.body.choices[0].message.content)
requestOptions = new ChatCompletionsOptions()
{
Messages = {
new ChatRequestSystemMessage("You are a helpful assistant."),
new ChatRequestUserMessage("Explain Riemann's conjecture in 1 paragraph")
},
Model = "mistral-large"
};
response = client.Complete(requestOptions);
Console.WriteLine($"Response: {response.Value.Content}");
List<ChatRequestMessage> chatMessages = new ArrayList<>();
chatMessages.add(new ChatRequestSystemMessage("You are a helpful assistant"));
chatMessages.add(new ChatRequestUserMessage("Explain Riemann's conjecture in 1 paragraph"));
ChatCompletions chatCompletions = client.complete(new ChatCompletionsOptions(chatMessages));
for (ChatChoice choice : chatCompletions.getChoices()) {
ChatResponseMessage message = choice.getMessage();
System.out.println("Response:" + message.getContent());
}
Request
POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
api-key: <api-key>
Content-Type: application/json
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Explain Riemann's conjecture in 1 paragraph"
}
],
"model": "mistral-large"
}
If you specify a model name that doesn't match any given model deployment, you get an error that the model doesn't exist. You can control which models are available for users by creating model deployments as explained at add and configure model deployments.
Key-less authentication
Models deployed to Azure AI Foundry Models in Azure AI Services support key-less authorization using Microsoft Entra ID. Key-less authorization enhances security, simplifies the user experience, reduces operational complexity, and provides robust compliance support for modern development. It makes it a strong choice for organizations adopting secure and scalable identity management solutions.
To use key-less authentication, configure your resource and grant access to users to perform inference. Once configured, then you can authenticate as follows:
Install the package azure-ai-inference
using your package manager, like pip:
pip install azure-ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions with Entra ID:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential
client = ChatCompletionsClient(
endpoint="https://<resource>.services.ai.azure.com/models",
credential=DefaultAzureCredential(),
credential_scopes=["https://cognitiveservices.azure.com/.default"],
)
Install the package @azure-rest/ai-inference
using npm:
npm install @azure-rest/ai-inference
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions with Entra ID:
import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { DefaultAzureCredential } from "@azure/identity";
const clientOptions = { credentials: { "https://cognitiveservices.azure.com" } };
const client = new ModelClient(
"https://<resource>.services.ai.azure.com/models",
new DefaultAzureCredential(),
clientOptions,
);
Install the Azure AI inference library with the following command:
dotnet add package Azure.AI.Inference --prerelease
Install the Azure.Identity
package:
dotnet add package Azure.Identity
Import the following namespaces:
using Azure;
using Azure.Identity;
using Azure.AI.Inference;
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions with Entra ID:
TokenCredential credential = new DefaultAzureCredential();
AzureAIInferenceClientOptions clientOptions = new AzureAIInferenceClientOptions();
BearerTokenAuthenticationPolicy tokenPolicy = new BearerTokenAuthenticationPolicy(credential, new string[] { "https://cognitiveservices.azure.com/.default" });
clientOptions.AddPolicy(tokenPolicy, HttpPipelinePosition.PerRetry);
ChatCompletionsClient client = new ChatCompletionsClient(
new Uri("https://<resource>.services.ai.azure.com/models"),
credential,
clientOptions.
);
Add the package to your project:
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-ai-inference</artifactId>
<version>1.0.0-beta.4</version>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-identity</artifactId>
<version>1.15.3</version>
</dependency>
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
TokenCredential defaultCredential = new DefaultAzureCredentialBuilder().build();
ChatCompletionsClient client = new ChatCompletionsClientBuilder()
.credential(defaultCredential)
.endpoint("https://<resource>.services.ai.azure.com/models")
.buildClient();
Explore our samples and read the API reference documentation to get yourself started.
Use the reference section to explore the API design and which parameters are available and indicate authentication token in the header Authorization
. For example, the reference section for Chat completions details how to use the route /chat/completions
to generate predictions based on chat-formatted instructions. Notice that the path /models
is included to the root of the URL:
Request
POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
Authorization: Bearer <bearer-token>
Content-Type: application/json
Tokens have to be issued with scope https://cognitiveservices.azure.com/.default
.
For testing purposes, the easiest way to get a valid token for your user account is to use the Azure CLI. In a console, run the following Azure CLI command:
az account get-access-token --resource https://cognitiveservices.azure.com --query "accessToken" --output tsv
Limitations
- Azure OpenAI Batch can't be used with the Foundry Models endpoint. You have to use the dedicated deployment URL as explained at Batch API support in Azure OpenAI documentation.
- Real-time API isn't supported in the inference endpoint. Use the dedicated deployment URL.
Next steps