Two years ago, the question of open versus closed AI models was simple: closed models from OpenAI and Google were clearly superior, and open alternatives were interesting but impractical for serious work. That calculus has changed. Meta’s Llama 3.1 405B matches GPT-4 on many benchmarks. DeepSeek V3 was trained for a fraction of the cost. Mistral and Qwen have carved out real niches. The gap between open and closed has collapsed from years to months.
But the conversation has also become muddied. “Open source” in AI often means something quite different from open source in traditional software. Models are released under restrictive licenses. Training data is withheld. Reproduction is impossible. The term has become a marketing label as much as a technical description.
Here is a clear-eyed look at where things actually stand.
Meta Llama 3.1 (July 2024): Available in 8B, 70B, and 405B parameter sizes. The 405B model is a dense transformer trained on 15.6 trillion tokens. Released under the Llama 3.1 Community License, which permits commercial use but prohibits using it to train competing models if you have over 700 million monthly users (the “Meta clause”). Training data composition disclosed at a high level but not reproducible.
Mistral / Mixtral (2023-2025): The French AI lab has released a series of models, from Mistral 7B (which punched well above its weight class) to Mixtral 8x22B (a MoE architecture). Licensed under Apache 2.0 — one of the most permissive licenses in the open AI ecosystem. Mistral also operates closed commercial models (Mistral Large), making it a hybrid player.
DeepSeek V3 (December 2024): 671B total parameters (MoE, 37B active). Trained for approximately $5.6 million, dramatically undercutting Western training budgets. Released under a permissive license. Performance competitive with GPT-4o and Claude 3.5 Sonnet on many benchmarks. DeepSeek also released DeepSeek-R1, a reasoning-focused model. The low training cost challenged industry assumptions and raised questions about whether frontier AI requires frontier budgets.
Qwen 2.5 (September 2024): Alibaba’s model family, available from 0.5B to 72B parameters. Particularly strong on multilingual benchmarks and code generation. Released under Apache 2.0. The 72B model competes with Llama 3.1 70B and Mistral Large on most evaluations.
OpenAI GPT-4o / GPT-4o mini (2024): The market leader. GPT-4o offers strong multimodal capabilities (text, image, audio). Architecture undisclosed. Available only via API and ChatGPT. Pricing: $2.50/$10.00 per million input/output tokens.
Anthropic Claude 3.5 Sonnet / Claude Opus 4 (2024-2025): Known for strong performance on long-context tasks, coding, and instruction-following. 200K token context window. Architecture undisclosed. Anthropic emphasizes safety and interpretability research. Pricing: $3.00/$15.00 per million tokens (Sonnet).
Google Gemini 1.5 Pro / 2.0 Flash (2024-2025): Deeply integrated into Google’s ecosystem (Search, Workspace, Android). 1M token context window for Gemini 1.5 Pro — the largest commercially available. Pricing competitive with GPT-4o.
Benchmarks are imperfect. They can be gamed, contaminated by training data, and may not reflect real-world performance. But they are the best standardized measures we have. Here is where the major models stood on MMLU (Massive Multitask Language Understanding), a widely cited general knowledge benchmark, as of late 2024:
The story is clear: on MMLU, the gap between the best open and best closed models is roughly 1-2 percentage points. On other benchmarks the picture is more varied — closed models tend to lead on the hardest reasoning benchmarks (GPQA, MATH-500) by a larger margin — but the overall trend is unmistakable convergence.
For many practical applications — summarization, translation, content generation, basic code tasks, customer support — the performance difference between Llama 3.1 405B and GPT-4o is negligible.
| Dimension | Open Weights | Closed API |
|---|---|---|
| Performance | Within 1-5% of frontier on most tasks. Gap widens on hardest reasoning benchmarks. | Best-in-class on hardest tasks. First to ship new capabilities (multimodal, reasoning). |
| Cost at Scale | Major advantage. Self-hosting on 8xH100 ($15K/mo lease) handles ~500K tokens/min. At 1B tokens/day, 5-10x cheaper than API. | Simple to start. $2.50-$15/M tokens. Costs scale linearly with usage, no upfront investment. |
| Cost at Low Volume | Uneconomical to self-host for <100K requests/day. Use hosted open-weight APIs (Together, Fireworks) instead. | Clear winner for small teams. Pay only for what you use. No infrastructure to manage. |
| Data Privacy | Major advantage. Data never leaves your infrastructure. Full audit trail. Essential for regulated industries. | Data transits to provider servers. BAAs available from major providers but not all use cases are covered. |
| Customization | Full fine-tuning, LoRA, quantization, distillation. Complete control over model behavior. | Limited fine-tuning offered by some providers. No access to weights, architecture, or training pipeline. |
| Vendor Risk | Model weights are yours forever. No deprecation, price changes, or ToS shifts. | Provider can deprecate models (OpenAI deprecated GPT-4-32K), change pricing, or alter content policies. |
| Safety / Guardrails | You are responsible for all safety filtering, content moderation, and abuse prevention. | Provider handles baseline safety. Guardrails are built in but may be overly restrictive for some use cases. |
| Time to Production | Weeks to months for self-hosted deployment. Requires ML infrastructure expertise. | Hours to days. API key and a few lines of code. |
The economics deserve a deeper look, because cost is often the deciding factor.
Scenario: 10 million tokens per day (a mid-size application)
Closed API (GPT-4o): 10M input tokens/day at $2.50/M = $25/day. If half the tokens are output at $10/M, add $50/day. Total: ~$75/day, or $27,375/year.
Self-hosted Llama 3.1 70B on 2xH100 (80GB): Server lease approximately $5,000/month through a cloud provider. At this volume, the GPUs are underutilized — you could handle 10-50x more traffic. Total: $60,000/year. At this volume, the API is cheaper.
Self-hosted at 100M tokens per day: API cost scales to $273,750/year. The self-hosted infrastructure might need 4-8 H100s ($10,000-$20,000/month), totaling $120,000-$240,000/year. The crossover happens around 50-100M tokens per day for most configurations.
Self-hosted at 1B tokens per day: API cost would be $2.7M/year. Self-hosted cost with a proper cluster: $400,000-$600,000/year. The savings are dramatic.
The exact crossover depends on your model size, quantization level, hardware choice, and whether you have ML engineers on staff. But the pattern holds: APIs win at low volume, self-hosting wins at high volume, and the crossover is lower than most people think.
Not all “open” models are equally open. The AI community has increasingly called out “open washing” — using the language of open source to describe releases that are materially different from what the term means in software.
The Open Source Initiative (OSI) published its Open Source AI Definition in October 2024, requiring that a truly open source AI model must include:
By this standard, most “open” models fail. Meta’s Llama releases weights and some training details but not the training data, and the license restricts certain commercial uses. Mistral’s Apache-licensed models come closest to the OSI definition but still do not include training data. DeepSeek releases weights under permissive licenses but the training data and much of the training infrastructure remain proprietary.
This matters because the benefits of open source — reproducibility, auditability, community improvement — depend on what exactly is being opened. Releasing weights without training data means you can use the model but cannot fully understand its biases, verify its training, or reproduce it. You are trusting the releasing organization in many of the same ways you trust a closed API provider.
The practical spectrum looks like this:
| Tier | What’s Released | Example |
|---|---|---|
| Fully Open | Weights + code + data + license | OLMo (AI2), some academic models |
| Open Weights (Permissive) | Weights + inference code, Apache/MIT license | Mistral 7B, Qwen 2.5 |
| Open Weights (Restricted) | Weights + inference code, custom license with commercial limits | Llama 3.1 |
| Closed with API | Nothing released, accessible via API | GPT-4o, Claude, Gemini |
When someone says “open source AI model,” ask which tier. The answer changes the risk calculus significantly.
The most sophisticated AI teams in 2026 are not choosing one side. They are running a portfolio:
Frontier closed models for the hardest tasks — complex reasoning, nuanced content generation, tasks where the last 2-3% of quality matters. These are typically low-volume, high-value calls.
Open-weight models (self-hosted) for high-volume production workloads where cost dominates — classification, extraction, summarization, embedding generation, RAG retrieval. At scale, the 5-10x cost savings justify the infrastructure investment.
Fine-tuned small open models for domain-specific tasks. A 7B parameter model fine-tuned on your data can outperform GPT-4 on your specific task while running on a single GPU. This is especially common in healthcare, legal, and financial services where domain-specific accuracy matters more than general capability.
Local models for privacy-critical workflows. Running a quantized Llama or Mistral model on-premise ensures data never leaves your network. This is a regulatory requirement in some industries, not a preference.
Training cost deflation. DeepSeek V3 demonstrated that frontier-adjacent performance does not require $100M budgets. If this trend continues, the barrier to producing competitive open models drops dramatically, potentially accelerating the convergence between open and closed.
Regulation. The EU AI Act, expected to be enforced starting in 2026, treats open and closed models differently. Open models may receive exemptions from certain obligations, which could either incentivize genuine openness or create regulatory arbitrage where “open washing” becomes a compliance strategy.
Closed model differentiation. As open models close the capability gap, closed providers will compete on dimensions other than raw model quality: infrastructure, reliability, safety guarantees, tool ecosystems, and enterprise support. This is already happening — OpenAI’s competitive advantage is increasingly its product suite (ChatGPT, API platform, Codex) rather than model quality alone.
Sovereign AI. Governments are funding domestic AI model development (France with Mistral, UAE with Falcon/TII, China with multiple labs). Most of these efforts produce open-weight models. National AI sovereignty may become a significant driver of the open ecosystem.
The open-vs-closed debate is no longer about capability — it is about economics, control, and risk tolerance. Open-weight models are good enough for most production workloads. Closed models retain an edge on the hardest tasks and offer the easiest path to production.
The right question is not “which is better” but “what is the right mix for my specific use case, volume, regulatory environment, and engineering capacity?” Any team that dogmatically commits to one approach is either leaving money on the table or accepting unnecessary risk.
The models are converging. The strategies should too.
One email at dawn. The five stories that mattered, with the bits removed and the meaning kept. Free, for now.