Phase 9 · Chapter 9.02
Chatbot Deployment System
ChatGPT clone বানানো সহজ — production-এ ১০০০ user simultaneously serve করা কঠিন। Streaming, context, cost — তিন challenge।
Components
Chatbot stack-এর ৬ piece
- LLM provider: OpenAI, Anthropic, self-hosted (Llama, Mistral)।
- Prompt layer: system prompt, few-shot, persona।
- Context manager: conversation history, summarization।
- RAG: retrieval over knowledge base।
- Tool use: function calling, agents।
- Safety: moderation, guardrails, rate limit।
Streaming
SSE endpoint, FastAPI
pythonproduction
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI()
@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
history = await load_history(req.session_id)
history.append({"role": "user", "content": req.message})
def gen():
full = []
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=history,
stream=True,
)
for chunk in stream:
tok = chunk.choices[0].delta.content or ""
full.append(tok)
yield f"data: {tok}\n\n"
save_turn(req.session_id, req.message, "".join(full))
yield "data: [DONE]\n\n"
return StreamingResponse(gen(), media_type="text/event-stream")Context Window
Token budget সামলাও
pythonproduction
MAX_TOKENS = 8000
RESERVED_REPLY = 1000
def trim_history(messages):
while count_tokens(messages) > MAX_TOKENS - RESERVED_REPLY:
# 1) drop oldest non-system turn
for i, m in enumerate(messages):
if m["role"] != "system":
messages.pop(i)
break
return messages
# Better: summarize older turns
def compress(messages):
if count_tokens(messages) < MAX_TOKENS * 0.7:
return messages
old, recent = messages[1:-6], messages[-6:]
summary = summarize(old)
return [messages[0],
{"role": "system", "content": f"Earlier summary: {summary}"},
*recent]RAG
Retrieval-Augmented Generation
pythonproduction
from sentence_transformers import SentenceTransformer
import chromadb
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.PersistentClient("./vdb").get_collection("docs")
def augment(question: str, k: int = 4) -> list[dict]:
qv = embedder.encode([question]).tolist()
hits = db.query(query_embeddings=qv, n_results=k)
context = "\n---\n".join(hits["documents"][0])
return [
{"role": "system", "content":
"Answer ONLY from the context. If unknown, say so.\n\n" + context},
{"role": "user", "content": question},
]Safety Layer
Guardrails চাই production-এ
- Input moderation: OpenAI moderation / Llama Guard — toxic, PII flag।
- Output filter: regex + LLM judge — leakage, hallucination check।
- Rate limit: per-user TPM (tokens/min), daily budget।
- Prompt injection defense: system prompt isolation, user content tag।
- Audit log: session, prompt, response — debug + compliance।
Cost Control
LLM bill নিয়ন্ত্রণে
- Semantic cache (GPTCache) — similar query repeat না করে cached return।
- Cheaper model fallback — সহজ query gpt-4o-mini, কঠিন gpt-4o।
- Prompt compression — Llmlingua দিয়ে context 2–3x ছোট।
- Batch async (non-real-time) — 50% discount many provider-এ।
Pitfalls
Production reality
- Streaming proxy buffering on — token client-এ গুচ্ছে আসে। NGINX
proxy_buffering off। - Session storage in memory — pod restart-এ chat হারায়। Redis use করো।
- No timeout — LLM hang হলে user wait 5min।
- Token count ভুল — payment shock।
- Hallucination test না — 90% case OK, 10% confidently wrong।
Mini Project
Streaming RAG bot
- 10 markdown doc-কে Chroma-তে index করো।
- FastAPI
/chat/streamendpoint — SSE। - Redis-এ session history, 10-turn limit।
- Simple HTML chat UI দিয়ে test করো।
Takeaway
মনে রাখো
Chatbot = streaming + context + guardrails + cost। মডেল wrapper বানাও, model নিজেই বানিয়ো না।