Tutorial: Run chatbot in App Service with a Phi-4 sidecar extension (FastAPI)

2025-05-19

This tutorial guides you through deploying a FastAPI-based chatbot application integrated with the Phi-4 sidecar extension on Azure App Service. By following the steps, you'll learn how to set up a scalable web app, add an AI-powered sidecar for enhanced conversational capabilities, and test the chatbot's functionality.

Hosting your own small language model (SLM) offers several advantages:

Full control over your data. Sensitive information isn't exposed to external services, which is critical for industries with strict compliance requirements.
Self-hosted models can be fine-tuned to meet specific use cases or ___domain-specific requirements.
Minimized network latency and faster response times for a better user experience.
Full control over resource allocation, ensuring optimal performance for your application.

Prerequisites

An Azure account with an active subscription.
A GitHub account.

Deploy the sample application

In the browser, navigate to the sample application repository.
Start a new Codespace from the repository.
Log in with your Azure account:
```
az login
```

Open the terminal in the Codespace and run the following commands:

cd use_sidecar_extension/fastapiapp
az webapp up --sku P3MV3
az webapp config set --startup-file "gunicorn -w 4 -k uvicorn.workers.UvicornWorker app.main:app"

This startup command is a common setup for deploying FastAPI applications to Azure App Service. For more information, see Quickstart: Deploy a Python (Django, Flask, or FastAPI) web app to Azure App Service.

Add the Phi-4 sidecar extension

In this section, you add the Phi-4 sidecar extension to your ASP.NET Core application hosted on Azure App Service.

Navigate to the Azure portal and go to your app's management page.
In the left-hand menu, select Deployment > Deployment Center.
On the Containers tab, select Add > Sidecar extension.
In the sidecar extension options, select AI: phi-4-q4-gguf (Experimental).
Provide a name for the sidecar extension.
Select Save to apply the changes.
Wait a few minutes for the sidecar extension to deploy. Keep selecting Refresh until the Status column shows Running.

This Phi-4 sidecar extension uses a chat completion API like OpenAI that can respond to chat completion response at http://localhost:11434/v1/chat/completions. For more information on how to interact with the API, see:

Test the chatbot

In your app's management page, in the left-hand menu, select Overview.
Under Default ___domain, select the URL to open your web app in a browser.
Verify that the chatbot application is running and responding to user inputs.

How the sample application works

The sample application demonstrates how to integrate a FastAPI-based service with the SLM sidecar extension. The SLMService class encapsulates the logic for sending requests to the SLM API and processing the streamed responses. This integration enables the application to generate conversational responses dynamically.

Looking in use_sidecar_extension/fastapiapp/app/services/slm_service.py, you see that:

The service sends a POST request to the SLM endpoint http://localhost:11434/v1/chat/completions.
```
self.api_url = 'http://localhost:11434/v1/chat/completions'
```

The POST payload includes the system message and the prompt that's built from the selected product and the user query.

request_payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
    "stream": True,
    "cache_prompt": False,
    "n_predict": 2048  # Increased token limit to allow longer responses
}

The POST request streams the response line by line. Each line is parsed to extract the generated content (or token).

async with httpx.AsyncClient() as client:
    async with client.stream(
        "POST", 
        self.api_url,
        json=request_payload,
        headers={"Content-Type": "application/json"},
        timeout=30.0
    ) as response:
        async for line in response.aiter_lines():
            if not line or line == "[DONE]":
                continue

            if line.startswith("data: "):
                line = line.replace("data: ", "").strip()


            try:
                json_obj = json.loads(line)
                if "choices" in json_obj and len(json_obj["choices"]) > 0:
                    delta = json_obj["choices"][0].get("delta", {})
                    content = delta.get("content")
                    if content:
                        yield content

Frequently asked questions

How does pricing tier affect the performance of the SLM sidecar?
How to use my own SLM sidecar?

How does pricing tier affect the performance of the SLM sidecar?

Since AI models consume considerable resources, choose the pricing tier that gives you sufficient vCPUs and memory to run your specific model. For this reason, the built-in AI sidecar extensions only appear when the app is in a suitable pricing tier. If you build your own SLM sidecar container, you should also use a CPU-optimized model, since the App Service pricing tiers are CPU-only tiers.

For example, the Phi-3 mini model with a 4K context length from Hugging Face is designed to run with limited resources and provides strong math and logical reasoning for many common scenarios. It also comes with a CPU-optimized version. In App Service, we tested the model on all premium tiers and found it to perform well in the P2mv3 tier or higher. If your requirements allow, you can run it on a lower tier.

How to use my own SLM sidecar?

The sample repository contains a sample SLM container that you can use as a sidecar. It runs a FastAPI application that listens on port 8000, as specified in its Dockerfile. The application uses ONNX Runtime to load the Phi-3 model, then forwards the HTTP POST data to the model and streams the response from the model back to the client. For more information, see model_api.py.

To build the sidecar image yourself, you need to install Docker Desktop locally on your machine.

Clone the repository locally.

git clone https://github.com/Azure-Samples/ai-slm-in-app-service-sidecar
cd ai-slm-in-app-service-sidecar

Change into the Phi-3 image's source directory and download the model locally using the Huggingface CLI.
```
cd bring_your_own_slm/src/phi-3-sidecar
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --local-dir ./Phi-3-mini-4k-instruct-onnx
```
The Dockerfile is configured to copy the model from ./Phi-3-mini-4k-instruct-onnx.
Build the Docker image. For example:
```
docker build --tag phi-3 .
```
Upload the built image to Azure Container Registry with Push your first image to your Azure container registry using the Docker CLI.
In the Deployment Center > Containers (new) tab, select Add > Custom container and configure the new container as follows:
- Name: phi-3
- Image source: Azure Container Registry
- Registry: your registry
- Image: the uploaded image
- Tag: the image tag you want
- Port: 8000
Select Apply.

See bring_your_own_slm/src/webapp for a sample application that interacts with this custom sidecar container.

Next steps

Tutorial: Configure a sidecar container for a Linux app in Azure App Service

Share via