Skip to content
View animesh01's full-sized avatar

Block or report animesh01

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
animesh01/README.md

Animesh Chowdhury

AI & Data Product Leader · Product, Evaluation & Quality (AI/ML) · Conversational & Agentic AI

AI/Data product leader, 10+ years building data & GenAI products end-to-end — data product lead for Walmart's AI shopping assistant, owning the evaluation, experimentation, and quality systems that steer the roadmap.

LinkedIn · Streamlit · Tableau Public · RPubs · Email


👋 About

I take customer-facing AI products from ambiguous problem to launch, and own the evaluation, experimentation, and quality systems that decide what ships next. My edge is hands-on technical depth — LLM evaluation, RAG, observability, experimentation infrastructure — paired with the product judgment to weigh customer experience, safety, cost, and scale in one call.

Currently data product lead for Sparky, Walmart's AI shopping assistant (used by ~50% of Walmart app users; a publicly cited driver of ~35% larger orders), where I defined the platform's first standardized quality KPI and its greenfield evaluation standards from zero.


🧪 Featured projects

Five runnable apps spanning the AI-product lifecycle — build → evaluate → experiment → monitor → explain. Each is live, self-contained, and built on synthetic or real public data.

Project What it shows Demo
🛰️ LLM Observability & Evals Model-health monitoring across quality, safety, performance, cost & drift — SQL-backed pipeline, alerting, and PDF/PPTX export Live ↗
💬 Chat Quality Score (CQS) LLM-as-a-judge evaluation scoring conversations on a 4-dimension rubric, calibrated against human labels Live ↗
🛒 Product Recommendation Quality Tracks AI recommendation relevance week over week and surfaces the drivers behind any change Live ↗
🧪 A/B Experimentation Framework Hypothesis design, randomization, guardrail metrics, and ship / iterate / stop decisioning Live ↗
🔎 LedgerIQ — Finance RAG Agent Finance-ops RAG over two sources — real SEC EDGAR filings and FP&A planning documents — grounded, cited answers that refuse when out-of-corpus, with token-minimization controls and MCP retrieval servers Live ↗

LedgerIQ runs on real public SEC EDGAR data (SEC source) plus synthetic FP&A documents (FP&A source); the other apps use fabricated or synthetic data — no proprietary, confidential, or employer-specific information.

Built with Streamlit · RAG & MCP · SQLite · LLM-as-a-judge · Python


🛠️ What I work with

Product: product strategy & roadmap · feature prioritization · MVP scoping · PRDs & requirements · experimentation & A/B testing · KPI ownership · stakeholder management GenAI & AI/ML: LLM evaluation (LLM-as-a-judge) · RAG & grounding · agentic AI & tool use (MCP) · prompt evaluation · conversational & agentic AI · retrieval / recommendation relevance · human-in-the-loop governance · model observability · AI safety evaluation · token & cost–quality optimization Data & Platform: SQL · Python · R · BigQuery · Snowflake · PostgreSQL · Kafka · telemetry & experimentation infrastructure BI & Tools: Tableau · Power BI · Streamlit · Jira · Miro


🏆 Selected recognition

  • Bravo Award (×2) — for GenAI initiatives delivering ~$1M in annual savings, and for analytics spanning 30+ conversational-AI domains
  • Innovation Challenge Winner — top RPA solution selected from 218 ideas across 340 professionals, funded and rolled out across the US, Europe, and India

Open to AI/GenAI Product Management roles. Let's talk → chowdhuryanimesh1@gmail.com

Pinned Loading

  1. llm-observability-dashboard llm-observability-dashboard Public

    SQL-backed LLM observability & evals dashboard for a conversational AI assistant — model-health monitoring across quality, safety, performance, cost, and drift, with executive summary, alerting, an…

    Python

  2. cqs-evaluation cqs-evaluation Public

    LLM-as-a-judge evaluation demo for conversational AI: scores chats on a 4-dimension rubric into a single 0–100 quality score and calibrates the automated judge against human labels. Synthetic demo …

    Python

  3. product-recommendation-quality product-recommendation-quality Public

    A measurement framework and live report for the quality of an AI shopping assistant's product recommendations — graded relevance, weekly tracking, root-cause analysis, and judge calibration. Built …

    HTML

  4. ab-experimentation-framework ab-experimentation-framework Public

    Interactive Streamlit dashboard for A/B experiment design, statistical analysis, guardrails, and launch decisions. Worked example: a delivery-estimate chatbot. Fabricated demo data.

    Python

  5. sec-filings-rag-agent sec-filings-rag-agent Public

    Finance-ops RAG agent over two sources — real SEC EDGAR filings and FP&A planning documents — with answers grounded in retrieved passages, citations to the exact section, and a refusal when out-of-…

    Python