The Only LLM Cheat Sheet You’ll Need (June 2025 Edition)
How I choose between GPT-4o / GPT-4.1, Claude 4, Gemini 2.5, Copilot & the “classic” models
People keep asking, “Which model should I enable for my team?”
The answer flips depending on two axes:
Is the task code-heavy or not?
How much context (tokens, tooling, modality) do you actually need?
Below I walk through today’s main line-ups — including Gemini and GitHub Copilot — then finish with a copy-pastable cheat-sheet you can pin to Slack or Notion.
The Modern Coding Stack
Claude Opus 4
Anthropic’s flagship (May-2025). Two-hundred-thousand-token window, a plan → generate → run → fix loop baked in, and roughly 72 % on SWE-Bench Verified when tool-use is allowed. The “maximum reasoning, maximum bill” option — pull it out for bugs that span ten layers of abstraction.Claude Sonnet 4
Same 200 k context but cheaper. My default for repo-wide refactors, bulk scaffolding, and long migrations where sheer breadth beats depth.GPT-4.1
OpenAI’s April-2025 release. One-million-token window and 54 % on SWE-Bench Verified with the stock agent. Best OpenAI pick when “merge a CI-green PR” is the definition of done.Gemini 2.5 Pro
Two-million-token context — the current record — and about 64 % on SWE-Bench with Google’s reference agent. Ideal for monorepos or data rooms that dwarf even 4.1’s window.GPT-4o
One-hundred-twenty-eight-thousand tokens, near-real-time latency, full multimodal I/O (text + vision + voice). Feels like a senior engineer who types at 400 WPM.Llama 3 / Code Llama (open source)
≈67 % on HumanEval and self-hostable. If your lawyers frown on cloud APIs, spin this up behind the firewall.
Two “meta” layers you should know
Gemini Code Assist
180 k free completions / month on the individual tier — enough for weekend hacks or student projects. blog.googledevops.comGitHub Copilot
It’s not one model: Copilot routes calls to GPT-4.1, Sonnet 4 or Gemini depending on prompt + cost, then adds its own IDE & PR automation. The new “Coding Agent” can spin up a VM, run tests, and open a PR by itself. github.blogtheverge.com
Everyday Knowledge-work & Multimodal
GPT-4o — Real-time voice/vision chat; feels like FaceTime with an expert. openai.com
Gemini 2.5 Pro — Drop a 1 GB PDF or a thousand-page contract and ask “Summarise every compliance risk.”blog.google
Claude Sonnet 4 — Most steerable tone; my default for strategy docs or investor memos. anthropic.com
Perplexity AI — Web-search-first assistant with inline citations — great for fact checking.
Llama 3 local — Offline summarisation and translation when data can’t leave the subnet.
Model Lineage at a Glance
OpenAI — GPT-3.5 → GPT-4 (32 k) → o3 → GPT-4o (128 k, multimodal) → GPT-4.1 (1 M).
Anthropic — Claude Instant → Claude Sonnet 4 → Claude Opus 4 (200 k, highest agentic score).
Google DeepMind — Gemini 1.x → Gemini 1.5 Pro → Gemini 2.5 Pro (2 M tokens).
Meta OSS — Code Llama → Llama 3 (70 B).
GitHub Copilot — A service layer that orchestrates several of the above and now ships a hands-free “Coding Agent”.
Cheat-Sheet — “Use X When…”
🟢 Tiny bug or regex
→ GPT-4o (fastest chat)
🟢 Repo-wide refactor (hundreds of files)
→ Claude Sonnet 4 (200 k context)
🟢 Nasty multi-step bug, need max reasoning
→ Claude Opus 4 (~72 % SWE-Bench)
🟢 CI-gated, real-world bug fix
→ GPT-4.1 (best OpenAI pass rate)
🟢 Ultra-long spec (500 k+ tokens) or data room
→ Gemini 2.5 Pro (2 M context)
🟢 Live voice / screen-share Q&A
→ GPT-4o (sub-second multimodal)
🟢 Free weekend hack
→ Gemini Code Assist free tier
🟢 Air-gapped server or strict NDA
→ Self-host Llama 3 / Code Llama
🟢 “Open a branch, run tests, push PR for me”
→ GitHub Copilot Coding Agent
Final take
Proto-builders & refactor ninjas → Claude Sonnet 4
Spin up an MVP, wipe out boilerplate, or refactor hundreds of files in one prompt.Bug-hunters on nightmare tickets → Claude Opus 4
Its deeper reasoning and “plan → run → fix” loop catch multi-layer, cross-repo defects.
For everything else, match the model to context length, latency, cost, and data policy — then switch the moment the leaderboard flips.