Building a Production-Grade RAG Chatbot on Azure

There's a specific moment that trips up a lot of "AI chatbot" tutorials: the demo works beautifully in a Jupyter notebook, and then someone asks how to actually ship it — with real users, real documents, and a security team asking uncomfortable questions. This guide is written for that moment. It walks through building a Retrieval-Augmented Generation (RAG) chatbot from an empty resource group all the way to a monitored, autoscaled, CI/CD-deployed service on Azure Kubernetes Service.

By the end, you'll have a chatbot that does one specific, useful thing: you upload a PDF, DOCX, or TXT file, ask a question about it, and GPT-4o answers using only the content of that file — not whatever it happened to memorize during training. That constraint is what makes RAG useful for real work, and it's also what makes the architecture more interesting than a plain chat wrapper.

Why RAG, and why this stack

The one-paragraph version of how this works: when someone uploads a document, the backend splits it into chunks of roughly 512 tokens, sends each chunk to an embeddings model to get a vector representation, and stores both the text and the vector in a search index. When someone asks a question, the backend embeds the question the same way, asks the search index for the most similar chunks, and stuffs those chunks into a prompt as context. GPT-4o is then instructed to answer only from that context. If the answer isn't in the documents, it says so instead of guessing. That's the whole trick — and most of the engineering effort in this guide goes into making that trick reliable, secure, and cheap to run at scale.

The stack settles out like this: FastAPI for the backend, Next.js for the frontend, Azure AI Search as the vector store, Azure OpenAI for both embeddings (text-embedding-3-small) and completions (gpt-4o), Blob Storage for the raw files, AKS for orchestration, Key Vault for secrets, and GitHub Actions for CI/CD. Nothing here is exotic — the interesting part is how the pieces are wired together, particularly around identity and security, which is where most home-grown RAG projects quietly cut corners.

Picture the full request path: a browser talks to the Next.js frontend running on AKS, which calls the FastAPI backend, also on AKS. The backend calls Azure OpenAI for embeddings and chat, and Azure AI Search for retrieval, while Blob Storage holds the original files. Everything sits behind private endpoints inside a VNet, secrets live in Key Vault and are fetched through Workload Identity rather than environment variables, and a GitHub Actions pipeline builds images, pushes them to a container registry, and rolls them out to the cluster.

Before touching any of that, you'll need a few things installed and configured: an Azure subscription with Owner or Contributor access, Azure CLI 2.60+, kubectl 1.29+, Helm 3.14+, Docker 24+, Python 3.11+, Node 20 LTS, and a GitHub account (the free tier is fine). You'll also need to register a handful of resource providers once per subscription — AKS, ACR, Key Vault, Storage, Cognitive Services, Search, Networking, and the monitoring providers.

az provider register --namespace Microsoft.ContainerService # AKS
az provider register --namespace Microsoft.ContainerRegistry # ACR
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.CognitiveServices # Azure OpenAI
az provider register --namespace Microsoft.Search # AI Search
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.OperationalInsights

Phases 1–3 Laying the foundation: resource group, storage, and search

Everything in this project lives inside one resource group, which makes teardown trivial later — a single az group delete removes the whole thing. East US is the region of choice here mainly because GPT-4o and text-embedding-3-small are both available there; if you pick a different region, check Azure OpenAI's model availability first, since coverage isn't uniform.

LOCATION="eastus"
RG="rg-rag-chatbot"
PREFIX="ragbot"
SUBSCRIPTION=$(az account show --query id -o tsv)

az group create \
--name $RG \
--location $LOCATION \
--tags project=rag-chatbot environment=dev

With the resource group in place, the first real piece of infrastructure is Blob Storage, which holds the raw uploaded documents before anything gets chunked or indexed.

STORAGE_NAME="${PREFIX}docs$(openssl rand -hex 4)"

az storage account create \
--name $STORAGE_NAME \
--resource-group $RG \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--allow-blob-public-access false \
--min-tls-version TLS1_2

az storage container create \
--name docs \
--account-name $STORAGE_NAME \
--auth-mode login

That --allow-blob-public-access false flag isn't decorative — without it, anyone who guesses a blob URL can pull down a user's uploaded documents, and it's not something you can safely toggle after the fact without risking broken access for legitimate callers. Set it correctly the first time.

Next comes the vector store. Azure AI Search is where document chunks and their embeddings actually live, and it's what finds the semantically closest chunks when someone asks a question.

SEARCH_NAME="${PREFIX}-search"

az search service create \
--name $SEARCH_NAME \
--resource-group $RG \
--location $LOCATION \
--sku basic \
--partition-count 1 \
--replica-count 1

SEARCH_ADMIN_KEY=$(az search admin-key show \
--service-name $SEARCH_NAME \
--resource-group $RG \
--query primaryKey -o tsv)

The SKU choice matters more than it looks: Basic supports semantic ranking and vector fields, both required for RAG, while the Free tier doesn't support vectors at all. For anything beyond a personal prototype, plan on Standard S1 or higher.

Once the service is up, it needs an index schema — the shape that every document chunk will take, including its embedding vector.

# scripts/create_index.py
import os
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex, SimpleField, SearchFieldDataType,
SearchableField, SearchField, VectorSearch,
HnswAlgorithmConfiguration, VectorSearchProfile
)
from azure.core.credentials import AzureKeyCredential

endpoint = os.environ["AZURE_SEARCH_ENDPOINT"]
key = os.environ["AZURE_SEARCH_KEY"]

client = SearchIndexClient(endpoint, AzureKeyCredential(key))

fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="content", type=SearchFieldDataType.String),
SimpleField(name="source_file", type=SearchFieldDataType.String, filterable=True),
SimpleField(name="page_number", type=SearchFieldDataType.Int32),
SimpleField(name="chunk_index", type=SearchFieldDataType.Int32),
SearchField(
name="embedding",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
vector_search_dimensions=1536,
vector_search_profile_name="hnsw-profile"
)
]

vector_search = VectorSearch(
algorithms=[HnswAlgorithmConfiguration(name="hnsw-algo")],
profiles=[VectorSearchProfile(name="hnsw-profile", algorithm_configuration_name="hnsw-algo")]
)

index = SearchIndex(name="documents", fields=fields, vector_search=vector_search)
client.create_or_update_index(index)
print("Index created: documents")

AZURE_SEARCH_ENDPOINT="https://${SEARCH_NAME}.search.windows.net" \
AZURE_SEARCH_KEY="$SEARCH_ADMIN_KEY" \
python scripts/create_index.py

Phases 4–5 Bringing in the model: Azure OpenAI and Key Vault

With storage and search ready, the next stop is Azure OpenAI, where GPT-4o and the embeddings model actually get deployed.

OPENAI_NAME="${PREFIX}-openai"

az cognitiveservices account create \
--name $OPENAI_NAME \
--resource-group $RG \
--location $LOCATION \
--kind OpenAI \
--sku S0 \
--yes

az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RG \
--deployment-name gpt-4o \
--model-name gpt-4o \
--model-version "2024-11-20" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"

az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RG \
--deployment-name text-embedding-3-small \
--model-name text-embedding-3-small \
--model-version "1" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"

--sku-capacity 10 caps throughput at 10K tokens per minute, which is plenty for a single-developer test environment but worth bumping via a quota increase request before any real production traffic hits it.

Every credential this project generates — the OpenAI key, the search admin key, the storage connection string — goes into Key Vault rather than an .env file or a Kubernetes secret. The backend never reads secrets from environment variables at all; it fetches them at runtime through the Azure SDK using Managed Identity, which is a theme that carries through the rest of this build.

KV_NAME="${PREFIX}-kv-$(openssl rand -hex 3)"

az keyvault create \
--name $KV_NAME \
--resource-group $RG \
--location $LOCATION \
--sku standard \
--enable-rbac-authorization true \
--enable-soft-delete true \
--soft-delete-retention-days 90

OPENAI_KEY=$(az cognitiveservices account keys list \
--name $OPENAI_NAME --resource-group $RG --query key1 -o tsv)

az keyvault secret set --vault-name $KV_NAME --name "openai-api-key" --value "$OPENAI_KEY"
az keyvault secret set --vault-name $KV_NAME --name "openai-endpoint" --value "$OPENAI_ENDPOINT"
az keyvault secret set --vault-name $KV_NAME --name "search-endpoint" --value "https://${SEARCH_NAME}.search.windows.net"
az keyvault secret set --vault-name $KV_NAME --name "search-admin-key" --value "$SEARCH_ADMIN_KEY"

STORAGE_CONN=$(az storage account show-connection-string \
--name $STORAGE_NAME --resource-group $RG --query connectionString -o tsv)
az keyvault secret set --vault-name $KV_NAME --name "storage-connection-string" --value "$STORAGE_CONN"

Phases 6–7 Writing the application: a shape worth sketching first

Before writing any code, it helps to lay out the project structure, since the backend and frontend end up split cleanly along the lines you'd expect — ingestion and chat logic on one side, chat UI and upload widget on the other, with Helm and GitHub Actions wrapping the whole thing.

rag-chatbot/
├── backend/
│ ├── app/
│ │ ├── main.py, ingest.py, chat.py, config.py, models.py
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/
│ ├── app/ (page.tsx, components/, api/chat/route.ts)
│ ├── Dockerfile
│ └── package.json
├── helm/rag-chatbot/ (Chart.yaml, values.yaml, templates/)
├── .github/workflows/ (ci.yml, deploy.yml)
└── scripts/ (create_index.py, bootstrap.sh)

Secrets, loaded lazily. config.py is intentionally tiny — its whole job is to fetch a named secret from Key Vault using DefaultAzureCredential, cached so it's only fetched once per process.

import os
from functools import lru_cache
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

@lru_cache(maxsize=None)
def get_secret(name: str) -> str:
kv_uri = os.environ["KEY_VAULT_URI"] # only env var the app needs
client = SecretClient(vault_url=kv_uri, credential=DefaultAzureCredential())
return client.get_secret(name).value

Ingestion: from upload to vector. This is the pipeline that does the actual RAG groundwork — store the raw file, extract its text, chunk it, embed each chunk, and push everything into the search index.

import hashlib
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import AzureOpenAI
from azure.storage.blob import BlobServiceClient
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from .config import get_secret

def get_openai_client() -> AzureOpenAI:
return AzureOpenAI(
azure_endpoint=get_secret("openai-endpoint"),
api_key=get_secret("openai-api-key"),
api_version="2024-10-21",
)

def embed(texts: list[str]) -> list[list[float]]:
client = get_openai_client()
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
return [item.embedding for item in response.data]

def get_search_client() -> SearchClient:
return SearchClient(
endpoint=get_secret("search-endpoint"),
index_name="documents",
credential=AzureKeyCredential(get_secret("search-admin-key")),
)

def upload_to_blob(filename: str, content: bytes) -> str:
conn_str = get_secret("storage-connection-string")
blob_client = BlobServiceClient.from_connection_string(conn_str)
container = blob_client.get_container_client("docs")
container.upload_blob(name=filename, data=content, overwrite=True)
return filename

def extract_text(filename: str, content: bytes) -> str:
if filename.endswith(".pdf"):
import pypdf
reader = pypdf.PdfReader(io.BytesIO(content))
return "\n".join(page.extract_text() or "" for page in reader.pages)
elif filename.endswith(".docx"):
import docx
doc = docx.Document(io.BytesIO(content))
return "\n".join(para.text for para in doc.paragraphs)
elif filename.endswith(".txt"):
return content.decode("utf-8", errors="replace")
else:
raise ValueError(f"Unsupported file type: {filename}")

def ingest_document(filename: str, content: bytes) -> dict:
upload_to_blob(filename, content)
text = extract_text(filename, content)

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(text)

all_embeddings = []
for i in range(0, len(chunks), 100):
all_embeddings.extend(embed(chunks[i : i + 100]))

documents = []
file_hash = hashlib.md5(content).hexdigest()[:8]
for idx, (chunk, embedding) in enumerate(zip(chunks, all_embeddings)):
documents.append({
"id": f"{file_hash}-{idx}",
"content": chunk,
"source_file": filename,
"page_number": 0,
"chunk_index": idx,
"embedding": embedding,
})

search = get_search_client()
results = search.upload_documents(documents=documents)
succeeded = sum(1 for r in results if r.succeeded)
return {"filename": filename, "chunks": len(chunks), "indexed": succeeded}

Chat: retrieval, then grounded generation. This is the other half of the RAG loop — embed the question, run a hybrid vector-plus-keyword search, build a context block from the results, and stream a completion that's instructed to stick to that context:

from typing import AsyncGenerator
from openai import AzureOpenAI
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
from .config import get_secret

SYSTEM_PROMPT = """You are a helpful assistant. Answer questions based ONLY on the context provided below.
If the answer is not in the context, say "I don't know based on the provided documents."
Do not use any knowledge outside the provided context. Cite the source file when possible."""

def retrieve_chunks(query: str, top_k: int = 5) -> list[dict]:
openai_client = AzureOpenAI(
azure_endpoint=get_secret("openai-endpoint"),
api_key=get_secret("openai-api-key"),
api_version="2024-10-21",
)
query_embedding = openai_client.embeddings.create(
input=[query], model="text-embedding-3-small"
).data[0].embedding

search_client = SearchClient(
endpoint=get_secret("search-endpoint"),
index_name="documents",
credential=AzureKeyCredential(get_secret("search-admin-key")),
)
results = search_client.search(
search_text=query, # BM25 keyword search
vector_queries=[VectorizedQuery(vector=query_embedding, k_nearest_neighbors=top_k, fields="embedding")],
top=top_k,
select=["content", "source_file", "chunk_index"],
)
return [{"content": r["content"], "source": r["source_file"]} for r in results]

def build_context(chunks: list[dict]) -> str:
parts = [f"[Source: {c['source']}]\n{c['content']}" for c in chunks]
return "\n\n---\n\n".join(parts)

async def stream_chat(question: str, history: list[dict]) -> AsyncGenerator[str, None]:
chunks = retrieve_chunks(question)
context = build_context(chunks)

messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
]

openai_client = AzureOpenAI(
azure_endpoint=get_secret("openai-endpoint"),
api_key=get_secret("openai-api-key"),
api_version="2024-10-21",
)
stream = openai_client.chat.completions.create(
model="gpt-4o", messages=messages, stream=True, temperature=0, max_tokens=1024,
)
for event in stream:
if event.choices and event.choices[0].delta.content:
yield event.choices[0].delta.content

Notice the temperature is pinned to 0 — for a grounded question-answering task, you want deterministic, boring answers, not creative ones.

Wiring it into FastAPI. The app itself is small: a health check, an upload endpoint that validates file size and type before handing off to ingest_document, and a chat endpoint that streams tokens back over server-sent events.

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from .ingest import ingest_document
from .chat import stream_chat

app = FastAPI(title="RAG Chatbot API", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["https://your-frontend-domain.com"], # tighten before prod
allow_methods=["GET", "POST"],
allow_headers=["*"],
)

class ChatRequest(BaseModel):
question: str
history: list[dict] = []

@app.get("/health")
def health():
return {"status": "healthy"}

@app.post("/upload")
async def upload(file: UploadFile = File(...)):
if file.size and file.size > 50 * 1024 * 1024:
raise HTTPException(status_code=413, detail="File too large (max 50MB)")
allowed = {".pdf", ".docx", ".txt"}
suffix = "." + file.filename.rsplit(".", 1)[-1].lower()
if suffix not in allowed:
raise HTTPException(status_code=400, detail=f"Unsupported type. Allowed: {allowed}")
content = await file.read()
return ingest_document(file.filename, content)

@app.post("/chat")
async def chat(req: ChatRequest):
async def generate():
async for token in stream_chat(req.question, req.history):
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")

fastapi==0.111.0
uvicorn[standard]==0.29.0
python-multipart==0.0.9
openai==1.35.3
azure-identity==1.17.1
azure-keyvault-secrets==4.8.0
azure-search-documents==11.6.0
azure-storage-blob==12.20.0
langchain-text-splitters==0.2.1
pypdf==4.2.0
python-docx==1.1.2
pydantic==2.7.3

Phase 8 The frontend: a chat window that streams

The frontend doesn't need to be elaborate — a scaffolded Next.js app with Tailwind covers it.

cd frontend
npx create-next-app@latest . --typescript --tailwind --eslint --app --no-src-dir --import-alias "@/*"

The chat component is where most of the interesting client-side logic lives: it posts the question, then reads the response body as a stream of server-sent events, appending tokens to the last message as they arrive so the reply appears to type itself out:

'use client'
import { useState, useRef, useEffect } from 'react'

interface Message { role: 'user' | 'assistant'; content: string }

export default function ChatInterface() {
const [messages, setMessages] = useState([])
const [input, setInput] = useState('')
const [loading, setLoading] = useState(false)
const bottomRef = useRef(null)

useEffect(() => { bottomRef.current?.scrollIntoView({ behavior: 'smooth' }) }, [messages])

async function sendMessage() {
if (!input.trim() || loading) return
const question = input.trim()
setInput('')
setMessages(prev => [...prev, { role: 'user', content: question }])
setLoading(true)

const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question, history: messages }),
})

if (!response.ok || !response.body) {
setMessages(prev => [...prev, { role: 'assistant', content: 'Error: could not reach backend.' }])
setLoading(false)
return
}

const reader = response.body.getReader()
const decoder = new TextDecoder()
let assistantMsg = ''
setMessages(prev => [...prev, { role: 'assistant', content: '' }])

while (true) {
const { done, value } = await reader.read()
if (done) break
const text = decoder.decode(value)
for (const line of text.split('\n')) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
assistantMsg += line.slice(6)
setMessages(prev => [...prev.slice(0, -1), { role: 'assistant', content: assistantMsg }])
}
}
}
setLoading(false)
}

return (

{messages.map((msg, i) => (

{msg.content}

))}
{loading &&

Thinking…

}

className="flex-1 bg-gray-800 text-white rounded-xl px-4 py-2 text-sm outline-none focus:ring-2 focus:ring-violet-500"
value={input} onChange={e => setInput(e.target.value)}
onKeyDown={e => e.key === 'Enter' && sendMessage()}
placeholder="Ask a question about your documents…"
/>

)
}

A thin API route proxies chat requests to the FastAPI backend rather than exposing the backend URL directly to the browser.

import { NextRequest } from 'next/server'
const BACKEND_URL = process.env.BACKEND_URL ?? 'http://localhost:8000'

export async function POST(req: NextRequest) {
const body = await req.json()
const upstream = await fetch(`${BACKEND_URL}/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
})
if (!upstream.ok) return new Response('Backend error', { status: upstream.status })
return new Response(upstream.body, {
headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive' },
})
}

And the upload widget rounds things out, giving users a drag-and-drop-style file input and a status line that reports how many chunks got indexed.

'use client'
import { useState } from 'react'

export default function FileUpload() {
const [status, setStatus] = useState('')
const [uploading, setUploading] = useState(false)

async function handleFile(e: React.ChangeEvent) {
const file = e.target.files?.[0]
if (!file) return
setUploading(true)
setStatus(`Uploading ${file.name}…`)

const form = new FormData()
form.append('file', file)
const res = await fetch(`${process.env.NEXT_PUBLIC_BACKEND_URL}/upload`, { method: 'POST', body: form })

if (res.ok) {
const data = await res.json()
setStatus(`✓ Indexed ${data.chunks} chunks from ${data.filename}`)
} else {
setStatus(`✗ Upload failed: ${res.statusText}`)
}
setUploading(false)
e.target.value = ''
}

return (

{uploading ? 'Processing…' : '+ Upload a PDF, DOCX, or TXT file'}

{status &&

{status}

}

)
}

Phases 9–11 Packaging it: Docker, ACR, and AKS

Both services get multi-stage Dockerfiles, mostly to keep final image sizes down and to avoid shipping build tools into production containers.

# backend/Dockerfile
FROM python:3.11-slim AS builder
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY app/ ./app/
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONPATH=/app
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

# frontend/Dockerfile
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production

FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
ARG NEXT_PUBLIC_BACKEND_URL
ENV NEXT_PUBLIC_BACKEND_URL=$NEXT_PUBLIC_BACKEND_URL
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
COPY --from=builder /app/public ./public
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s CMD wget -qO- http://localhost:3000 || exit 1
CMD ["node", "server.js"]

Images need somewhere to live before AKS can pull them, which is what Azure Container Registry is for. Note admin-enabled false — nothing here authenticates with a username and password; Managed Identity handles image pulls instead:

ACR_NAME="${PREFIX}acr$(openssl rand -hex 3)"

az acr create --name $ACR_NAME --resource-group $RG --location $LOCATION --sku Basic --admin-enabled false

az acr build --registry $ACR_NAME --image "rag-backend:dev-$(git rev-parse --short HEAD)" ./backend
az acr build --registry $ACR_NAME --image "rag-frontend:dev-$(git rev-parse --short HEAD)" ./frontend

ACR_NAME="${PREFIX}acr$(openssl rand -hex 3)"

az acr create --name $ACR_NAME --resource-group $RG --location $LOCATION --sku Basic --admin-enabled false

ACR_NAME="${PREFIX}acr$(openssl rand -hex 3)"

az acr create --name $ACR_NAME --resource-group $RG --location $LOCATION --sku Basic --admin-enabled false

Then comes the cluster itself, provisioned with OIDC issuance and Workload Identity turned on from the start — retrofitting identity federation onto an existing cluster is possible but much more annoying than enabling it up front.

AKS_NAME="${PREFIX}-aks"

az aks create \
--name $AKS_NAME \
--resource-group $RG \
--location $LOCATION \
--node-count 2 \
--node-vm-size Standard_D2s_v3 \
--enable-oidc-issuer \
--enable-workload-identity \
--enable-managed-identity \
--attach-acr $ACR_NAME \
--network-plugin azure \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 5 \
--generate-ssh-keys

az aks get-credentials --resource-group $RG --name $AKS_NAME --overwrite-existing
kubectl get nodes

--attach-acr quietly does a lot of work here: it grants the cluster's managed identity the AcrPull role, so pods can pull images without any image-pull secret sitting in a manifest.

Phase 12 Identity without secrets: Workload Identity

This is the part of the architecture that's easy to skip and expensive to regret skipping. A client secret — an API key, a connection string, a service principal password — is a string that lives somewhere: a Key Vault entry, a CI variable, a .env file. Every one of those places is a place it can leak. Workload Identity sidesteps the whole problem: a pod's Kubernetes service account token gets exchanged for an Azure AD token via OIDC federation, and Key Vault trusts that exchange directly. There's no secret to leak and nothing to rotate on a schedule.

IDENTITY_NAME="mi-rag-backend"
az identity create --name $IDENTITY_NAME --resource-group $RG --location $LOCATION

IDENTITY_CLIENT_ID=$(az identity show --name $IDENTITY_NAME --resource-group $RG --query clientId -o tsv)
IDENTITY_OBJECT_ID=$(az identity show --name $IDENTITY_NAME --resource-group $RG --query principalId -o tsv)

KV_ID=$(az keyvault show --name $KV_NAME --resource-group $RG --query id -o tsv)
az role assignment create --role "Key Vault Secrets User" --assignee-object-id $IDENTITY_OBJECT_ID --scope $KV_ID

AKS_OIDC_ISSUER=$(az aks show --name $AKS_NAME --resource-group $RG --query "oidcIssuerProfile.issuerUrl" -o tsv)

az identity federated-credential create \
--name "rag-backend-fedcred" \
--identity-name $IDENTITY_NAME \
--resource-group $RG \
--issuer "$AKS_OIDC_ISSUER" \
--subject "system:serviceaccount:default:rag-backend-sa" \
--audience "api://AzureADTokenExchange"

You can confirm it's actually working from inside a running pod — this should return a real access token, not an error.

kubectl run -it debug --image=curlimages/curl --rm -- \
curl -H "Metadata: true" \
"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://vault.azure.net"

Phases 13–14 Deploying to Kubernetes

The Helm chart's service account is what links the federated credential to a running pod — the client-id annotation and the use: "true" label are both required for Workload Identity to actually engage.

# helm/rag-chatbot/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: rag-backend-sa
namespace: {{ .Values.namespace }}
annotations:
azure.workload.identity/client-id: {{ .Values.identity.clientId }}
labels:
azure.workload.identity/use: "true"

The backend deployment itself is fairly ordinary — resource requests and limits, liveness and readiness probes against /health, and the one environment variable the app actually needs.

# helm/rag-chatbot/templates/deployment-backend.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-backend
namespace: {{ .Values.namespace }}
spec:
replicas: {{ .Values.backend.replicas }}
selector:
matchLabels:
app: rag-backend
template:
metadata:
labels:
app: rag-backend
azure.workload.identity/use: "true"
spec:
serviceAccountName: rag-backend-sa
containers:
- name: backend
image: {{ .Values.acr }}/rag-backend:{{ .Values.imageTag }}
ports:
- containerPort: 8000
env:
- name: KEY_VAULT_URI
value: {{ .Values.keyVaultUri }}
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: 1000m, memory: 1Gi }
livenessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 10
periodSeconds: 5

# helm/rag-chatbot/values.yaml
namespace: default
acr: yourregistry.azurecr.io
imageTag: latest
keyVaultUri: https://your-kv.vault.azure.net/
identity:
clientId: "" # set via --set during deploy
backend:
replicas: 2
frontend:
replicas: 2
backendUrl: http://rag-backend-svc:8000
ingress:
enabled: true
host: chat.yourdomain.com
tlsSecretName: chat-tls

helm upgrade --install rag-chatbot ./helm/rag-chatbot \
--namespace default \
--set acr="${ACR_NAME}.azurecr.io" \
--set imageTag="$(git rev-parse --short HEAD)" \
--set keyVaultUri="$(az keyvault show --name $KV_NAME --resource-group $RG --query properties.vaultUri -o tsv)" \
--set identity.clientId="$IDENTITY_CLIENT_ID" \
--set ingress.host="chat.yourdomain.com" \
--wait --timeout 5m

Getting traffic into the cluster means installing an ingress controller and a certificate manager for automatic TLS.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace \
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz \
--wait

helm repo add jetstack https://charts.jetstack.io
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true --wait

# helm/rag-chatbot/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: rag-chatbot-ingress
namespace: {{ .Values.namespace }}
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts: [{{ .Values.ingress.host }}]
secretName: {{ .Values.ingress.tlsSecretName }}
rules:
- host: {{ .Values.ingress.host }}
http:
paths:
- path: /api
pathType: Prefix
backend: { service: { name: rag-backend-svc, port: { number: 8000 } } }
- path: /
pathType: Prefix
backend: { service: { name: rag-frontend-svc, port: { number: 3000 } } }

Phase 15 — Defense in depth

It's worth stepping back at this point and looking at the security posture as a whole, because it isn't one control — it's four independent layers, and an attacker has to get through all of them to reach anything sensitive.

The network layer puts everything inside a 10.0.0.0/16 VNet behind deny-all NSGs, keeps AKS nodes off the public internet entirely, routes ingress only through the controller, terminates TLS 1.3 at the edge, and resolves internal names through private DNS zones. The identity layer is everything just covered — Managed and Workload Identity throughout, no service passwords anywhere, least-privilege RBAC, Entra ID OIDC, and no routine use of cluster-admin. The data layer is Key Vault plus encryption at rest and in transit, private endpoints in front of both Blob Storage and AI Search, 90-day soft-delete on the vault, with customer-managed keys as an option if you need it. And the application layer is where the chatbot-specific risks get handled: rate limiting at 100 requests per minute, JWT validation, a 50KB cap on input size, a prompt-injection guard, the Content Safety API, a strict CORS allowlist, and HSTS/CSP headers.

That last layer is worth a concrete example, since "prompt injection guard" can otherwise sound hand-wavy. In practice it's rate limiting plus some basic input screening sitting directly in front of the chat endpoint.

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/chat")
@limiter.limit("100/minute")
async def chat(req: ChatRequest, request: Request):
if len(req.question) > 50_000:
raise HTTPException(status_code=400, detail="Question too long")
forbidden = ["ignore previous instructions", "system prompt", "jailbreak"]
lower_q = req.question.lower()
if any(phrase in lower_q for phrase in forbidden):
raise HTTPException(status_code=400, detail="Invalid input")
...

Treat that phrase list as a starting point, not a finished defense — expand it based on your actual threat model rather than trusting a short blocklist to hold on its own.

Phase 16 Shipping changes: GitHub Actions end to end

The CI/CD pipeline is a fairly standard ten-stage flow: a push or PR to main triggers linting with ruff and tests with pytest, then a multi-stage Docker build, a Trivy scan that fails the build on HIGH or CRITICAL CVEs, a push of the SHA-tagged image to ACR, a deploy to a staging AKS namespace, Playwright smoke tests against staging, a manual approval gate, a rolling deploy to production, and finally a Slack notification either way. SHA-tagging every image means you can always trace a running container back to the exact commit that produced it.

# .github/workflows/ci.yml
name: CI
on:
push: { branches: [main] }
pull_request: { branches: [main] }
env:
ACR_NAME: ${{ secrets.ACR_NAME }}
REGISTRY: ${{ secrets.ACR_NAME }}.azurecr.io
jobs:
test-backend:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11', cache: pip }
- run: pip install -r backend/requirements.txt pytest ruff
- run: ruff check backend/
- run: pytest backend/ -v

test-frontend:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20', cache: npm, cache-dependency-path: frontend/package-lock.json }
- run: npm ci --prefix frontend
- run: npm run build --prefix frontend

build-and-push:
needs: [test-backend, test-frontend]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az acr login --name ${{ env.ACR_NAME }}
- run: |
docker build -t $REGISTRY/rag-backend:${{ github.sha }} ./backend
docker push $REGISTRY/rag-backend:${{ github.sha }}
- uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/rag-backend:${{ github.sha }}
exit-code: '1'
severity: HIGH,CRITICAL
- run: |
docker build --build-arg NEXT_PUBLIC_BACKEND_URL=${{ secrets.BACKEND_URL }} \
-t $REGISTRY/rag-frontend:${{ github.sha }} ./frontend
docker push $REGISTRY/rag-frontend:${{ github.sha }}

# .github/workflows/deploy.yml
name: Deploy
on:
workflow_run:
workflows: [CI]
types: [completed]
branches: [main]
jobs:
deploy-staging:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: staging
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az aks get-credentials -g ${{ secrets.RG }} -n ${{ secrets.AKS_NAME }}
- run: |
helm upgrade --install rag-chatbot ./helm/rag-chatbot \
--namespace staging --create-namespace \
--set imageTag=${{ github.sha }} \
--set acr=${{ secrets.ACR_NAME }}.azurecr.io \
--set keyVaultUri=${{ secrets.KV_URI }} \
--set identity.clientId=${{ secrets.MI_CLIENT_ID }} \
--set ingress.host=staging-chat.yourdomain.com \
--wait --timeout 5m

deploy-prod:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # requires manual approval in GitHub Environments
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az aks get-credentials -g ${{ secrets.RG }} -n ${{ secrets.AKS_NAME }}
- run: |
helm upgrade --install rag-chatbot ./helm/rag-chatbot \
--namespace default \
--set imageTag=${{ github.sha }} \
--set acr=${{ secrets.ACR_NAME }}.azurecr.io \
--set keyVaultUri=${{ secrets.KV_URI }} \
--set identity.clientId=${{ secrets.MI_CLIENT_ID }} \
--set ingress.host=chat.yourdomain.com \
--wait --timeout 5m
- if: always()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text":"${{ job.status == ''success'' && ''✓'' || ''✗'' }} RAG Chatbot deploy to prod: ${{ job.status }} (${{ github.sha }})"}'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

The pipeline expects ten secrets configured in the repo: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, ACR_NAME, AKS_NAME, RG, KV_URI, MI_CLIENT_ID, BACKEND_URL, and SLACK_WEBHOOK.

Phase 17 Watching it run

Once the app is live, Container Insights gives you the operational picture — pod health, error rates, latency — without bolting on a separate observability stack.

az aks enable-addons \
--name $AKS_NAME \
--resource-group $RG \
--addons monitoring \
--workspace-resource-id $(az monitor log-analytics workspace create \
--resource-group $RG --workspace-name "${PREFIX}-logs" --query id -o tsv)

A handful of KQL queries cover most day-to-day questions — which pods are restarting, whether the backend is throwing 5xx errors, and how request latency is trending.

// Average response latency (from FastAPI logs)
ContainerLog
| where Name contains "rag-backend"
| where LogEntry matches regex @'"duration":\d+'
| extend duration = toint(extract('"duration":(\\d+)', 1, LogEntry))
| summarize avg(duration), percentile(duration, 95) by bin(TimeGenerated, 5m)

And an alert rule means you find out about restart loops from Azure Monitor instead of from a user's bug report:

az monitor metrics alert create \
--resource-group $RG \
--name "rag-backend-pod-restarts" \
--scopes $(az aks show -g $RG -n $AKS_NAME --query id -o tsv) \
--condition "avg kube_pod_container_status_restarts_total > 3" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2

Phase 18 Proving it works: tests and load

A few layers of testing are worth having before you trust this with real traffic. Functional tests confirm the extraction logic handles each file type correctly and rejects the ones it shouldn't.

# tests/test_ingest.py
import pytest
from app.ingest import extract_text

def test_extract_pdf():
with open("tests/fixtures/sample.pdf", "rb") as f:
text = extract_text("sample.pdf", f.read())
assert len(text) > 100
assert isinstance(text, str)

def test_extract_docx():
with open("tests/fixtures/sample.docx", "rb") as f:
text = extract_text("sample.docx", f.read())
assert len(text) > 10

def test_unsupported_type():
with pytest.raises(ValueError, match="Unsupported file type"):
extract_text("file.xlsx", b"data")

RAG quality tests check that retrieval is actually returning something sensible against a document you've already ingested.

# tests/test_rag_quality.py
import pytest
from app.chat import retrieve_chunks

@pytest.mark.integration
def test_retrieval_finds_relevant_chunk():
chunks = retrieve_chunks("What is the main topic of the document?", top_k=3)
assert len(chunks) == 3
assert all("content" in c for c in chunks)

@pytest.mark.integration
def test_retrieval_returns_source():
chunks = retrieve_chunks("any question", top_k=1)
assert "source" in chunks[0]
assert chunks[0]["source"].endswith((".pdf", ".docx", ".txt"))

# tests/test_rag_quality.py
import pytest
from app.chat import retrieve_chunks

And a Locust script answers the question that matters before launch — how does this hold up under concurrent users.

# tests/locustfile.py
from locust import HttpUser, task, between

class ChatUser(HttpUser):
wait_time = between(2, 5)

@task(3)
def ask_question(self):
self.client.post("/chat", json={"question": "What are the key points in the document?", "history": []}, timeout=30)

@task(1)
def health_check(self):
self.client.get("/health")

locust -f tests/locustfile.py \
--host=https://chat.yourdomain.com \
--users=50 --spawn-rate=5 --run-time=5m --headless

Phase 19 What it costs, and how to tear it down

For moderate usage in East US, the running cost lands somewhere in the neighborhood of $270–330 a month: about $140 for two D2s_v3 AKS nodes, $30–80 for GPT-4o usage, $5–15 for embeddings, $75 for AI Search on the Basic tier, and small change for Blob Storage, ACR, Key Vault, and Log Analytics. For a light prototype — say 100 questions a day at roughly 1,000 output tokens each — GPT-4o's per-token pricing works out to around $45 a month on its own, so the OpenAI line item scales with usage far more than the infrastructure does. Setting a hard quota in Azure OpenAI is the cheapest insurance against a runaway bill.

Phase 20 — Putting it on a resume

If this project is meant to demonstrate skill rather than just run in the background, it's worth writing up plainly what it actually shows: Azure OpenAI integration with both a chat and an embeddings model, hybrid vector-plus-BM25 search through Azure AI Search, a private AKS deployment with autoscaling and OIDC-based Workload Identity, a four-layer security model spanning network, identity, data, and application controls, a full lint-test-scan-deploy CI/CD pipeline with a manual production gate, and operational visibility through Container Insights and KQL. A short local quick-start rounds it out.

# Backend
cd backend
pip install -r requirements.txt
export KEY_VAULT_URI=https://your-kv.vault.azure.net/
uvicorn app.main:app --reload

# Frontend
cd frontend
npm install
NEXT_PUBLIC_BACKEND_URL=http://localhost:8000 npm run dev

None of the individual pieces here are unusual on their own — Azure OpenAI, AI Search, AKS, Key Vault are all well-documented services. What makes the project worth showing off is that they're wired together the way a real production system would be: no secrets in code, no public storage, a real CI/CD gate before anything reaches users, and a monitoring setup that tells you when something breaks before a user does.

End-to-End RAG Chatbot Deployment on Azure Kubernetes Service

Building a Production-Grade RAG Chatbot on Azure

Why RAG, and why this stack

Phases 1–3 Laying the foundation: resource group, storage, and search

Phases 4–5 Bringing in the model: Azure OpenAI and Key Vault

Phases 6–7 Writing the application: a shape worth sketching first

Phase 8 The frontend: a chat window that streams

Phases 9–11 Packaging it: Docker, ACR, and AKS

Phase 12 Identity without secrets: Workload Identity

Phases 13–14 Deploying to Kubernetes

Phase 15 — Defense in depth

Phase 16 Shipping changes: GitHub Actions end to end

Phase 17 Watching it run

Phase 18 Proving it works: tests and load

Phase 19 What it costs, and how to tear it down

Phase 20 — Putting it on a resume

Induranga Lokuliyanage

Comments (0)

Building a Production-Grade RAG Chatbot on Azure

Why RAG, and why this stack

Phases 1–3 Laying the foundation: resource group, storage, and search

Phases 4–5 Bringing in the model: Azure OpenAI and Key Vault

Phases 6–7 Writing the application: a shape worth sketching first

Phase 8 The frontend: a chat window that streams

Phases 9–11 Packaging it: Docker, ACR, and AKS

Phase 12 Identity without secrets: Workload Identity

Phases 13–14 Deploying to Kubernetes

Phase 15 — Defense in depth

Phase 16 Shipping changes: GitHub Actions end to end

Phase 17 Watching it run

Phase 18 Proving it works: tests and load

Phase 19 What it costs, and how to tear it down

Phase 20 — Putting it on a resume

Induranga Lokuliyanage

Related Posts

Azure App Service Sidecar Pattern: A Complete DevOps Deep Dive

Azure Application Gateway for Containers: The Complete Guide for AKS (2026)

How to Connect Overlapping Azure VNets Using Private Link?

How to Host a Static Website on Azure for $0 (Step-by-Step)

Azure App Service vs. Virtual Machines: Best Hosting in 2026?

Azure Storage: Blob vs File vs Table — How to Pick the Right One

Comments (0)