vLLM vs OpenAI API

Self-host swap-in for OpenAI API. · Self-host OpenAI API · vLLM on os-alt

vLLM is one of the open-source self-host replacements for OpenAI API — license Apache-2.0, 30min docker run with --gpus to stand up, and $200-1500/mo depending on gpu class; an a100 80gb runs llama 3. Compare against OpenAI API's GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens below.

	vLLMopen-source	OpenAI APIpaid SaaS
Category	LLM inference API	LLM inference API
License / pricing	`Apache-2.0`	GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens
Starting price	$0 self-host	$20/user/mo
GitHub	vllm-project/vllm ★ 79.9k · last commit todayalive	closed source
Setup time	30min docker run with --gpus	SaaS — sign up + bill
Monthly cost	$200-1500/mo depending on GPU class; an A100 80GB runs Llama 3.1 70B comfortably with PagedAttention batching.	from $20/user/mo (GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens)

Switching from OpenAI API to vLLM

Run `docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-3.1-70B-Instruct`. The container exposes `/v1/chat/completions` and `/v1/embeddings` matching the OpenAI schema; point your existing `openai` client's `base_url` at `http://your-host:8000/v1`. Use vLLM's `--api-key` flag to require a bearer token before exposing the endpoint to the internet.

Good fit for: Production inference at scale — vLLM's continuous batching is what you want when 10+ concurrent users hit the endpoint.
Weak at: Single-GPU model fit — large models (70B+) need multi-GPU tensor parallelism and careful VRAM budgeting.

Other open-source self-host alternatives to OpenAI API

Ollama
MIT5min single binaryFree on a workstation with a 16GB+ GPU; ~$200/mo for an A10/RTX 4090 cloud GPU; CPU-only works for 7B models but is too slow for production.
LiteLLM
MIT15min docker-compose (proxy + Postgres for usage logs)$5 VPS for the proxy itself; the underlying model server (Ollama / vLLM / OpenAI passthrough) is the real cost line.

In a terminal? npx os-alt openai-api prints OpenAI API's self-host options — how the CLI works →

FAQ

Is vLLM a free alternative to OpenAI API?

Yes — vLLM is open source under Apache-2.0. Self-host cost: $200-1500/mo depending on GPU class; an A100 80GB runs Llama 3.1 70B comfortably with PagedAttention batching.. OpenAI API starts at $20/user/mo (GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens).

How long does vLLM take to set up vs OpenAI API?

Self-hosting vLLM: 30min docker run with --gpus. OpenAI API is a hosted SaaS — sign up and you're in.

What is vLLM good at, and what is it weak at?

Good fit for: Production inference at scale — vLLM's continuous batching is what you want when 10+ concurrent users hit the endpoint.. Weak at: Single-GPU model fit — large models (70B+) need multi-GPU tensor parallelism and careful VRAM budgeting..