Phase 9 · Chapter 9.04
NLP API Systems
Sentiment, NER, summarization, embedding, translation — সব NLP service-এর backbone এক: tokenize → encode → decode → post-process।
Task Map
Common NLP API endpoint
- Classification: sentiment, topic, intent।
- Token tagging: NER, POS।
- Embedding: semantic search, clustering।
- Seq2seq: summarization, translation, paraphrase।
- Generation: LLM (Chapter 9-02-এ আলাদা)।
HuggingFace + FastAPI
Embedding service
pythonproduction
from fastapi import FastAPI
from pydantic import BaseModel, Field
from sentence_transformers import SentenceTransformer
import numpy as np
app = FastAPI()
model = SentenceTransformer("intfloat/multilingual-e5-base", device="cuda")
class EmbedReq(BaseModel):
texts: list[str] = Field(min_length=1, max_length=128)
@app.post("/embed")
def embed(req: EmbedReq):
vecs = model.encode(req.texts, batch_size=32, normalize_embeddings=True)
return {"dim": vecs.shape[1], "vectors": vecs.tolist()}Classification
Sentiment service with batching
pythonproduction
from transformers import pipeline
clf = pipeline(
"text-classification",
model="cardiffnlp/twitter-xlm-roberta-base-sentiment",
device=0, # GPU
truncation=True,
max_length=256,
)
@app.post("/sentiment")
def sentiment(req: BatchReq):
out = clf(req.texts, batch_size=32)
return [{"label": o["label"], "score": round(o["score"], 4)} for o in out]Server-Side Batching
Triton-style micro-batching
pythonproduction
# collect requests for 5ms, run as single forward pass
import asyncio
QUEUE: asyncio.Queue = asyncio.Queue()
async def batcher():
while True:
items = [await QUEUE.get()]
try:
while len(items) < 32:
items.append(QUEUE.get_nowait())
except asyncio.QueueEmpty:
pass
texts = [i["text"] for i in items]
results = clf(texts, batch_size=len(texts))
for it, r in zip(items, results):
it["fut"].set_result(r)
@app.on_event("startup")
async def start():
asyncio.create_task(batcher())
@app.post("/sentiment")
async def sentiment(req: TextReq):
fut = asyncio.get_event_loop().create_future()
await QUEUE.put({"text": req.text, "fut": fut})
return await asyncio.wait_for(fut, timeout=5)Multilingual
ভাষাভিত্তিক চ্যালেঞ্জ
- Tokenizer ভাষা-aware হতে হবে (XLM-R, mBERT, IndicBERT)।
- Language detection আগে — তারপর model route।
- Right-to-left script (Arabic, Hebrew) — display + preprocessing।
- Bengali-Hindi-Tamil mixed input — script detector + transliterate।
Optimization
NLP latency-এর শত্রু
- Tokenizer slow →
use_fast=True(Rust)। - Long text → chunk + aggregate, কখনো truncate।
- Padding waste compute → dynamic padding বা bucketing।
- FP16 + ONNX → BERT-class model 3–5x faster।
- Sentence embedding-এ Redis cache (hash(text) → vec)।
Pitfalls
যা production-এ ভাঙে
- Tokenizer mismatch — training vs serving — silent accuracy drop।
- UTF-8 / encoding bug — emoji, RTL char।
- Max length silently truncate — long doc-এ context হারায়।
- PII leak — log-এ raw text store।
- Model 1.5GB — pod cold start 30s, autoscale ভাঙে।
Mini Project
Multilingual sentiment API
- XLM-R sentiment model FastAPI-তে wrap।
- Server-side micro-batching যোগ করো।
- Redis cache hash(text) → label।
- 100 concurrent request-এ p95 measure।
Phase 9 Complete
তুমি যা শিখলে
Recsys, chatbot, CV, NLP — চার domain-এর production blueprint। পরবর্তী Phase: AI System Architecture — microservices, event-driven, low-latency design।
← Roadmap-এ ফিরুন
পরবর্তী: Microservices for AIশীঘ্রই