Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This tutorial guides you through deploying a FastAPI-based chatbot application integrated with the Phi-4 sidecar extension on Azure App Service. By following the steps, you'll learn how to set up a scalable web app, add an AI-powered sidecar for enhanced conversational capabilities, and test the chatbot's functionality.
Hosting your own small language model (SLM) offers several advantages:
- Full control over your data. Sensitive information isn't exposed to external services, which is critical for industries with strict compliance requirements.
- Self-hosted models can be fine-tuned to meet specific use cases or ___domain-specific requirements.
- Minimized network latency and faster response times for a better user experience.
- Full control over resource allocation, ensuring optimal performance for your application.
Prerequisites
- An Azure account with an active subscription.
- A GitHub account.
Deploy the sample application
In the browser, navigate to the sample application repository.
Start a new Codespace from the repository.
Log in with your Azure account:
az login
Open the terminal in the Codespace and run the following commands:
cd use_sidecar_extension/fastapiapp az webapp up --sku P3MV3 az webapp config set --startup-file "gunicorn -w 4 -k uvicorn.workers.UvicornWorker app.main:app"
This startup command is a common setup for deploying FastAPI applications to Azure App Service. For more information, see Quickstart: Deploy a Python (Django, Flask, or FastAPI) web app to Azure App Service.
Add the Phi-4 sidecar extension
In this section, you add the Phi-4 sidecar extension to your ASP.NET Core application hosted on Azure App Service.
- Navigate to the Azure portal and go to your app's management page.
- In the left-hand menu, select Deployment > Deployment Center.
- On the Containers tab, select Add > Sidecar extension.
- In the sidecar extension options, select AI: phi-4-q4-gguf (Experimental).
- Provide a name for the sidecar extension.
- Select Save to apply the changes.
- Wait a few minutes for the sidecar extension to deploy. Keep selecting Refresh until the Status column shows Running.
This Phi-4 sidecar extension uses a chat completion API like OpenAI that can respond to chat completion response at http://localhost:11434/v1/chat/completions
. For more information on how to interact with the API, see:
Test the chatbot
In your app's management page, in the left-hand menu, select Overview.
Under Default ___domain, select the URL to open your web app in a browser.
Verify that the chatbot application is running and responding to user inputs.
How the sample application works
The sample application demonstrates how to integrate a FastAPI-based service with the SLM sidecar extension. The SLMService
class encapsulates the logic for sending requests to the SLM API and processing the streamed responses. This integration enables the application to generate conversational responses dynamically.
Looking in use_sidecar_extension/fastapiapp/app/services/slm_service.py, you see that:
The service sends a POST request to the SLM endpoint
http://localhost:11434/v1/chat/completions
.self.api_url = 'http://localhost:11434/v1/chat/completions'
The POST payload includes the system message and the prompt that's built from the selected product and the user query.
request_payload = { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], "stream": True, "cache_prompt": False, "n_predict": 2048 # Increased token limit to allow longer responses }
The POST request streams the response line by line. Each line is parsed to extract the generated content (or token).
async with httpx.AsyncClient() as client: async with client.stream( "POST", self.api_url, json=request_payload, headers={"Content-Type": "application/json"}, timeout=30.0 ) as response: async for line in response.aiter_lines(): if not line or line == "[DONE]": continue if line.startswith("data: "): line = line.replace("data: ", "").strip() try: json_obj = json.loads(line) if "choices" in json_obj and len(json_obj["choices"]) > 0: delta = json_obj["choices"][0].get("delta", {}) content = delta.get("content") if content: yield content
Frequently asked questions
How does pricing tier affect the performance of the SLM sidecar?
Since AI models consume considerable resources, choose the pricing tier that gives you sufficient vCPUs and memory to run your specific model. For this reason, the built-in AI sidecar extensions only appear when the app is in a suitable pricing tier. If you build your own SLM sidecar container, you should also use a CPU-optimized model, since the App Service pricing tiers are CPU-only tiers.
For example, the Phi-3 mini model with a 4K context length from Hugging Face is designed to run with limited resources and provides strong math and logical reasoning for many common scenarios. It also comes with a CPU-optimized version. In App Service, we tested the model on all premium tiers and found it to perform well in the P2mv3 tier or higher. If your requirements allow, you can run it on a lower tier.
How to use my own SLM sidecar?
The sample repository contains a sample SLM container that you can use as a sidecar. It runs a FastAPI application that listens on port 8000, as specified in its Dockerfile. The application uses ONNX Runtime to load the Phi-3 model, then forwards the HTTP POST data to the model and streams the response from the model back to the client. For more information, see model_api.py.
To build the sidecar image yourself, you need to install Docker Desktop locally on your machine.
Clone the repository locally.
git clone https://github.com/Azure-Samples/ai-slm-in-app-service-sidecar cd ai-slm-in-app-service-sidecar
Change into the Phi-3 image's source directory and download the model locally using the Huggingface CLI.
cd bring_your_own_slm/src/phi-3-sidecar huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --local-dir ./Phi-3-mini-4k-instruct-onnx
The Dockerfile is configured to copy the model from ./Phi-3-mini-4k-instruct-onnx.
Build the Docker image. For example:
docker build --tag phi-3 .
Upload the built image to Azure Container Registry with Push your first image to your Azure container registry using the Docker CLI.
In the Deployment Center > Containers (new) tab, select Add > Custom container and configure the new container as follows:
- Name: phi-3
- Image source: Azure Container Registry
- Registry: your registry
- Image: the uploaded image
- Tag: the image tag you want
- Port: 8000
Select Apply.
See bring_your_own_slm/src/webapp for a sample application that interacts with this custom sidecar container.
Next steps
Tutorial: Configure a sidecar container for a Linux app in Azure App Service