হোম/Roadmap/Chapter 9.02
Phase 9 · Chapter 9.02

Chatbot Deployment System

ChatGPT clone বানানো সহজ — production-এ ১০০০ user simultaneously serve করা কঠিন। Streaming, context, cost — তিন challenge।

Components

Chatbot stack-এর ৬ piece

  • LLM provider: OpenAI, Anthropic, self-hosted (Llama, Mistral)।
  • Prompt layer: system prompt, few-shot, persona।
  • Context manager: conversation history, summarization।
  • RAG: retrieval over knowledge base।
  • Tool use: function calling, agents।
  • Safety: moderation, guardrails, rate limit।
Streaming

SSE endpoint, FastAPI

pythonproduction
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    history = await load_history(req.session_id)
    history.append({"role": "user", "content": req.message})

    def gen():
        full = []
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=history,
            stream=True,
        )
        for chunk in stream:
            tok = chunk.choices[0].delta.content or ""
            full.append(tok)
            yield f"data: {tok}\n\n"
        save_turn(req.session_id, req.message, "".join(full))
        yield "data: [DONE]\n\n"

    return StreamingResponse(gen(), media_type="text/event-stream")
Context Window

Token budget সামলাও

pythonproduction
MAX_TOKENS = 8000
RESERVED_REPLY = 1000

def trim_history(messages):
    while count_tokens(messages) > MAX_TOKENS - RESERVED_REPLY:
        # 1) drop oldest non-system turn
        for i, m in enumerate(messages):
            if m["role"] != "system":
                messages.pop(i)
                break
    return messages

# Better: summarize older turns
def compress(messages):
    if count_tokens(messages) < MAX_TOKENS * 0.7:
        return messages
    old, recent = messages[1:-6], messages[-6:]
    summary = summarize(old)
    return [messages[0],
            {"role": "system", "content": f"Earlier summary: {summary}"},
            *recent]
RAG

Retrieval-Augmented Generation

pythonproduction
from sentence_transformers import SentenceTransformer
import chromadb

embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.PersistentClient("./vdb").get_collection("docs")

def augment(question: str, k: int = 4) -> list[dict]:
    qv = embedder.encode([question]).tolist()
    hits = db.query(query_embeddings=qv, n_results=k)
    context = "\n---\n".join(hits["documents"][0])
    return [
        {"role": "system", "content":
            "Answer ONLY from the context. If unknown, say so.\n\n" + context},
        {"role": "user", "content": question},
    ]
Safety Layer

Guardrails চাই production-এ

  • Input moderation: OpenAI moderation / Llama Guard — toxic, PII flag।
  • Output filter: regex + LLM judge — leakage, hallucination check।
  • Rate limit: per-user TPM (tokens/min), daily budget।
  • Prompt injection defense: system prompt isolation, user content tag।
  • Audit log: session, prompt, response — debug + compliance।
Cost Control

LLM bill নিয়ন্ত্রণে

  • Semantic cache (GPTCache) — similar query repeat না করে cached return।
  • Cheaper model fallback — সহজ query gpt-4o-mini, কঠিন gpt-4o।
  • Prompt compression — Llmlingua দিয়ে context 2–3x ছোট।
  • Batch async (non-real-time) — 50% discount many provider-এ।
Pitfalls

Production reality

  • Streaming proxy buffering on — token client-এ গুচ্ছে আসে। NGINX proxy_buffering off
  • Session storage in memory — pod restart-এ chat হারায়। Redis use করো।
  • No timeout — LLM hang হলে user wait 5min।
  • Token count ভুল — payment shock।
  • Hallucination test না — 90% case OK, 10% confidently wrong।
Mini Project

Streaming RAG bot

  1. 10 markdown doc-কে Chroma-তে index করো।
  2. FastAPI /chat/stream endpoint — SSE।
  3. Redis-এ session history, 10-turn limit।
  4. Simple HTML chat UI দিয়ে test করো।
Takeaway

মনে রাখো

Chatbot = streaming + context + guardrails + cost। মডেল wrapper বানাও, model নিজেই বানিয়ো না।