AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.
pip install aiperf
aiperf profile \
--model Qwen/Qwen3-0.6B \
--url http://localhost:8000 \
--endpoint-type chat \
--concurrency 10 \
--request-count 100 \
--streaming- Scalable multiprocess architecture with 9 services communicating via ZMQ
- 3 UI modes:
dashboard(real-time TUI),simple(progress bars),none(headless) - Multiple benchmarking modes: concurrency, request-rate, request-rate with max concurrency, trace replay
- Extensible plugin system for endpoints, datasets, transports, and metrics
- Public dataset support including ShareGPT and custom formats
- OpenAI chat completions, completions, embeddings, audio, images
- NIM embeddings, rankings
- Basic Tutorial - Profile Qwen3-0.6B with vLLM
- Comprehensive Benchmarking Guide - 5 real-world use cases
- User Interface - Dashboard, simple, or headless
- Hugging Face TGI - Profile Hugging Face TGI models
- OpenAI Text Endpoints - Profile OpenAI-compatible text APIs
- Request Rate with Max Concurrency - Dual request control
- Arrival Patterns - Constant, Poisson, gamma traffic
- Prefill Concurrency - Memory-safe long-context benchmarking
- Gradual Ramping - Smooth ramp-up of concurrency and request rate
- Warmup Phase - Eliminate cold-start effects
- User-Centric Timing - Per-user rate limiting for KV cache benchmarking
- Request Cancellation - Timeout and resilience testing
- Multi-URL Load Balancing - Distribute across servers
- Trace Benchmarking - Deterministic workload replay
- Custom Prompt Benchmarking - Send exact prompts as-is
- Custom Dataset - Custom dataset formats
- ShareGPT Dataset - Profile with ShareGPT dataset
- Synthetic Dataset Generation - Generate synthetic datasets
- Fixed Schedule - Precise timestamp-based execution
- Time-based Benchmarking - Duration-based testing
- Sequence Distributions - Mixed ISL/OSL pairings
- Prefix Synthesis - Prefix data synthesis for KV cache testing
- Reproducibility - Deterministic datasets with
--random-seed - Template Endpoint - Custom Jinja2 request templates
- Multi-Turn Conversations - Multi-turn conversation benchmarking
- Local Tokenizer - Use local tokenizers without HuggingFace
- Embeddings - Profile embedding models
- Rankings - Profile ranking models
- Audio - Profile audio language models
- Vision - Profile vision language models
- SGLang Image Generation - Image generation benchmarking
- SGLang Video Generation - Video generation benchmarking
- Synthetic Video - Synthetic video generation
- Timeslice Metrics - Per-timeslice performance analysis
- Goodput - SLO-based throughput measurement
- HTTP Trace Metrics - DNS, TCP/TLS, TTFB timing
- Multi-Run Confidence - Confidence intervals across repeated runs
- Profile Exports - Post-processing with Pydantic models
- Visualization and Plotting - PNG charts and multi-run comparison
- GPU Telemetry - DCGM metrics collection
- Server Metrics - Prometheus-compatible metrics
| Document | Purpose |
|---|---|
| Architecture | Three-plane architecture, core components, credit system, data flow |
| CLI Options | Complete command and option reference |
| Metrics Reference | All metric definitions, formulas, and requirements |
| Environment Variables | All AIPERF_* configuration variables |
| Plugin System | Plugin architecture, 25+ categories, creation guide |
| Creating Plugins | Step-by-step plugin tutorial |
| Accuracy Benchmarks | Accuracy evaluation stubs and datasets |
| Benchmark Modes | Trace replay and timing modes |
| Server Metrics | Prometheus-compatible server metrics collection |
| Tokenizer Auto-Detection | Pre-flight tokenizer detection |
| Dataset Synthesis API | Synthesis module API reference |
| Code Patterns | Code examples for services, models, messages, plugins |
| Migrating from Genai-Perf | Migration guide and feature comparison |
| Design Proposals | Enhancement proposals and discussions |
See CONTRIBUTING.md for development setup, coding conventions, and contribution guidelines.
- Output sequence length constraints (
--output-tokens-mean) cannot be guaranteed unless you passignore_eosand/ormin_tokensvia--extra-inputsto an inference server that supports them. - Very high concurrency settings (typically >15,000) may lead to port exhaustion on some systems. Adjust system limits or reduce concurrency if connection failures occur.
- Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. Terminate the process and check configuration settings.
- Copying selected text may not work reliably in the dashboard UI. Use the
ckey to copy all logs.