Phase 9 · Chapter 9.02

Chatbot Deployment System

ChatGPT clone বানানো সহজ — production-এ ১০০০ user simultaneously serve করা কঠিন। Streaming, context, cost — তিন challenge।

Components

Chatbot stack-এর ৬ piece

LLM provider: OpenAI, Anthropic, self-hosted (Llama, Mistral)।
Prompt layer: system prompt, few-shot, persona।
Context manager: conversation history, summarization।
RAG: retrieval over knowledge base।
Tool use: function calling, agents।
Safety: moderation, guardrails, rate limit।

Streaming

SSE endpoint, FastAPI

pythonproduction

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    history = await load_history(req.session_id)
    history.append({"role": "user", "content": req.message})

    def gen():
        full = []
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=history,
            stream=True,
        )
        for chunk in stream:
            tok = chunk.choices[0].delta.content or ""
            full.append(tok)
            yield f"data: {tok}\n\n"
        save_turn(req.session_id, req.message, "".join(full))
        yield "data: [DONE]\n\n"

    return StreamingResponse(gen(), media_type="text/event-stream")

Context Window

Token budget সামলাও

pythonproduction

MAX_TOKENS = 8000
RESERVED_REPLY = 1000

def trim_history(messages):
    while count_tokens(messages) > MAX_TOKENS - RESERVED_REPLY:
        # 1) drop oldest non-system turn
        for i, m in enumerate(messages):
            if m["role"] != "system":
                messages.pop(i)
                break
    return messages

# Better: summarize older turns
def compress(messages):
    if count_tokens(messages) < MAX_TOKENS * 0.7:
        return messages
    old, recent = messages[1:-6], messages[-6:]
    summary = summarize(old)
    return [messages[0],
            {"role": "system", "content": f"Earlier summary: {summary}"},
            *recent]

RAG

Retrieval-Augmented Generation

pythonproduction

from sentence_transformers import SentenceTransformer
import chromadb

embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.PersistentClient("./vdb").get_collection("docs")

def augment(question: str, k: int = 4) -> list[dict]:
    qv = embedder.encode([question]).tolist()
    hits = db.query(query_embeddings=qv, n_results=k)
    context = "\n---\n".join(hits["documents"][0])
    return [
        {"role": "system", "content":
            "Answer ONLY from the context. If unknown, say so.\n\n" + context},
        {"role": "user", "content": question},
    ]

Safety Layer

Guardrails চাই production-এ

Input moderation: OpenAI moderation / Llama Guard — toxic, PII flag।
Output filter: regex + LLM judge — leakage, hallucination check।
Rate limit: per-user TPM (tokens/min), daily budget।
Prompt injection defense: system prompt isolation, user content tag।
Audit log: session, prompt, response — debug + compliance।

Cost Control

LLM bill নিয়ন্ত্রণে

Semantic cache (GPTCache) — similar query repeat না করে cached return।
Cheaper model fallback — সহজ query gpt-4o-mini, কঠিন gpt-4o।
Prompt compression — Llmlingua দিয়ে context 2–3x ছোট।
Batch async (non-real-time) — 50% discount many provider-এ।

Pitfalls

Production reality

Streaming proxy buffering on — token client-এ গুচ্ছে আসে। NGINX proxy_buffering off।
Session storage in memory — pod restart-এ chat হারায়। Redis use করো।
No timeout — LLM hang হলে user wait 5min।
Token count ভুল — payment shock।
Hallucination test না — 90% case OK, 10% confidently wrong।

Mini Project

Streaming RAG bot

10 markdown doc-কে Chroma-তে index করো।
FastAPI /chat/stream endpoint — SSE।
Redis-এ session history, 10-turn limit।
Simple HTML chat UI দিয়ে test করো।

Takeaway

মনে রাখো

Chatbot = streaming + context + guardrails + cost। মডেল wrapper বানাও, model নিজেই বানিয়ো না।

← Roadmap-এ ফিরুন পরবর্তী: Computer Vision Production System