Back to Guides
GuideToolsBeginnerEssential

There Are So Many AI Models. How Do I Pick the Right One?

Ramez Kouzy, MD 10 min

What you'll learn

  • The big three: OpenAI, Anthropic, Google -- strengths of each
  • How a radiation oncologist actually uses different models
  • Free tier vs paid: when it matters
  • Benchmarks and medical arenas: how to evaluate models
  • The frontier: reasoning models, agents, deep research

The Honest Answer: It Depends (But Let Me Help)

Every month there is a new "best" AI model. Every week there is a new benchmark. If you try to pick the objectively best model, you will be researching forever and using nothing.

So let me save you some time: there is no single best model. The right model depends on what you are trying to do.

That said, you need a starting point. Here is the current landscape as of early 2026, filtered through what actually matters for clinicians.


The Big Three (And What They Are Good At)

OpenAI: The GPT-5 Family

OpenAI started this whole wave with ChatGPT. Their model lineup has evolved from a single model into a family.

GPT-5.2 is their current flagship. It comes in three variants:

  • GPT-5.2 Instant - Fast and conversational for everyday use
  • GPT-5.2 Thinking - Adds adaptive reasoning, spending more time on complex problems
  • GPT-5.2 Pro - Maximum capability for the hardest tasks

This model currently tops the overall benchmark charts for complex, multi-step reasoning.

GPT-5 and GPT-5.1 are still available and excellent. GPT-5 is the workhorse for coding and agentic tool use. GPT-5.1 refined the conversational tone and added better personalization.

GPT-4o remains the free tier default. Fast, multimodal (handles images, voice, text), and still surprisingly capable for a model you do not pay for.

The older o-series reasoning models (o3, o4-mini) are still accessible but are being eclipsed by the unified GPT-5.2 Thinking approach.

Best for: Voice mode (the real-time voice interface is the best available), image generation (GPT-Image leads the field), and complex reasoning with GPT-5.2 Thinking. ChatGPT also has the largest ecosystem of plugins and integrations.

Access: Free tier gives you GPT-4o. ChatGPT Plus ($20/month) gives you GPT-5.2 and the full lineup. chat.openai.com

Anthropic: Claude Opus 4.5, Sonnet 4.5, Haiku

Claude is my daily driver for clinical thinking and complex work. Anthropic's models tend to be more careful and nuanced in their reasoning. Claude is particularly strong at following detailed instructions and working with long documents.

Claude Sonnet 4.5 is their newest release and what Anthropic recommends as the default starting point. It is the best model in the world for real-world coding agents and computer use according to their benchmarks. It is the everyday workhorse: fast, capable, and excellent for most tasks.

Claude Opus 4.5 is the premium tier with extended thinking capabilities. When you need maximum reasoning depth - working through a complex clinical question, writing something nuanced, tackling a hard coding problem - Opus with thinking enabled is exceptional. It currently tops the coding and web development leaderboards on LM Arena.

Claude Haiku is the fast, lightweight option for simple tasks where speed matters more than depth.

Best for: Long-form writing, nuanced clinical reasoning, coding (leads the web development and coding benchmarks), and working with very long documents. Claude has an enormous context window that can process entire textbooks. The thinking mode is particularly good for step-by-step clinical reasoning.

Access: Free tier gives you Sonnet 4.5 with limits. Claude Pro ($20/month) gives you Opus access and higher limits. claude.ai

Google: Gemini 3 Pro, Gemini 3 Flash

Gemini is integrated into Google's ecosystem, which gives it a unique advantage: built-in access to Google Search. When you ask Gemini a factual question, it pulls from live web results rather than relying solely on training data.

Gemini 3 Pro is their flagship. Currently the number one model on LM Arena's user preference rankings, meaning real humans in blind tests prefer its responses over every other model. It also supports thinking levels, letting it reason more deeply when needed.

Gemini 3 Flash is the speed-optimized version. Remarkably capable for its speed, with its own thinking modes (including levels not available on Pro). Flash is an excellent daily driver when you want quick, grounded answers without waiting.

Both models support Grounding with Google Search, which means when you ask a factual question, the model can search the live web and cite its sources. Gemini 3 Pro with Grounding currently leads the Search Arena for citation quality and factuality.

Best for: Search-grounded answers with real citations (best in class), quick factual lookups, integration with Google Workspace (Docs, Gmail, etc.), multimodal tasks, and Deep Research (multi-step investigations that synthesize dozens of sources).

Access: Free tier available with solid capabilities. Gemini Advanced ($20/month) gives you the full Gemini 3 Pro with higher limits. gemini.google.com


Apps Built on Top of AI: A Different Category Entirely

Not every AI tool is a chatbot. Many of the most useful tools for clinicians are applications built on top of AI models. They use an LLM under the hood, but they wrap it in a specific workflow designed for a particular task.

Think of the analogy this way: ChatGPT and Claude are like general-purpose operating systems. Apps built on top of them are like the specialized software you install. You do not use Windows to do your taxes - you use TurboTax, which runs on Windows.

ToolWhat It DoesWhen to Use It
Open EvidenceAI answers grounded in medical literatureClinical questions that need evidence-based answers with real citations
AbridgeAI-powered clinical documentation from conversationAutomatic scribing during patient encounters
ElicitAI-powered literature search and synthesisResearch questions, literature reviews, finding relevant papers
PerplexityAI research with citations from the live webCurrent information with sourced answers

The point: when someone says "AI gave me the wrong answer," the real question is often "were you using the right tool for the job?" A general-purpose chatbot and a specialized medical AI application are solving fundamentally different problems. We cover this in more detail in the next article.


How I Actually Use Them

I do not use one model for everything. My real workflow:

Claude Opus for complex reasoning, coding projects, and anything where I need the model to think deeply. If I am working through a difficult clinical question or writing something nuanced, Opus is my go-to.

Claude Sonnet for day-to-day clinical thinking, quick drafts, summarization, and routine tasks. It is fast, capable, and good enough for 80 percent of what I need.

Gemini for questions where I want grounded, current information. Because it can search the web natively, I trust it more for factual queries than models that rely on training data alone.

ChatGPT for voice mode (talking through ideas while commuting), image generation, and when I want a quick second opinion from a different model family.

Open Evidence when I need medical evidence with real citations. This is not a chatbot question - this is a clinical evidence question.

The Key Insight

I switch between models the way you might switch between imaging modalities. CT for anatomy, MRI for soft tissue, PET for metabolic activity. Each tool has its strength.


The Quick Reference

ModelBest ForWeakness
Claude (Anthropic)Long documents, nuanced writing, clinical reasoningNo native web search
ChatGPT (OpenAI)Code, voice mode, web browsing, image generationVerbose, tends to agree with you
Gemini (Google)Search grounding, Workspace integrationLess polished prose
PerplexityQuick factual questions with cited sourcesNot ideal for long-form writing

General-purpose chatbots (use for thinking, drafting, brainstorming):

  • Gemini 3 Pro - number one in user preference, best for search-grounded answers
  • GPT-5.2 Thinking - strongest benchmark reasoning, best for complex multi-step problems
  • Claude Opus 4.5 - best for coding, writing, and long documents
  • Claude Sonnet 4.5 - best everyday balance of speed and quality
  • Gemini 3 Flash - fast, capable, excellent value with thinking modes
  • GPT-5.2 Instant - fast, conversational, great for everyday ChatGPT use

Specialized apps (use for specific clinical/research tasks):

  • Open Evidence - clinical evidence with real citations
  • Abridge - AI-powered clinical documentation
  • Elicit - AI-powered literature search and synthesis
  • Perplexity - web research with sources

Benchmarks: Do They Matter?

You may have seen references to medical benchmarks like MedQA, or leaderboards like LM Arena and its medical sub-arena. These are standardized tests that compare models on various tasks.

Are they useful? Somewhat. Benchmarks tell you which models are generally capable. LM Arena is particularly valuable because it uses blind human preference voting - real people compare model outputs without knowing which model produced them.

But benchmarks test narrow slices of performance. A model that scores highest on multiple-choice medical questions might not be the best at drafting a nuanced manuscript discussion section.

My advice: use benchmarks to narrow the field, then try the top contenders yourself on your actual tasks. Your workflow is the only benchmark that truly matters.


Free Tier vs. Paid: Does It Matter?

Yes, honestly, it does. Free tiers are fine for trying things out, but they typically give you weaker models, lower usage limits, or both.

For important work - clinical reasoning support, manuscript drafting, complex analysis - use the best model you can access. Twenty dollars a month for ChatGPT Plus or Claude Pro is less than most physicians spend on coffee. If you are going to integrate AI into your workflow seriously, pay for a subscription. The difference in output quality between free and paid tiers is noticeable.

That said, start free. Figure out which platform you prefer. Then upgrade when you are ready to use it for real work.


Open Source: Worth Knowing About

There is a parallel universe of open-source models - Llama (Meta), Qwen (Alibaba), DeepSeek, Mistral - that you can run locally on your own hardware.

The advantage is privacy (data never leaves your machine) and cost (free). The disadvantage is that they generally lag behind the frontier commercial models in capability, and setting them up requires some technical comfort.

DeepSeek deserves a special mention - their models have been remarkably competitive with commercial offerings at a fraction of the cost. Qwen3 is also emerging as a strong open-source option, particularly for vision tasks.

For most clinicians, the commercial options are the right starting point. But if data privacy is a critical concern - say, you want to experiment with clinical notes - open-source models running locally are worth investigating down the road.


The Frontier: What Is Coming

The model landscape is not just about chatbots anymore. A few concepts worth having on your radar:

Thinking and reasoning modes are now built into the flagship models. GPT-5.2 Thinking, Claude Opus with extended thinking, and Gemini 3 with thinking levels all let the model reason step by step before answering. They are slower but dramatically better at complex problems. Think of the difference between a snap clinical judgment and carefully working through a differential.

Projects and memory let models maintain context across conversations. Instead of starting fresh every time, you can build a persistent workspace around a research project or clinical question.

Agents are models that can take actions - not just generate text but actually do things. Search the web, run code, interact with tools, complete multi-step workflows. This is the direction everything is moving.

Deep research tools (available in ChatGPT, Gemini, and others) can conduct multi-step investigations, reading and synthesizing dozens of sources to produce a comprehensive report on a topic. Think of it as a research assistant that can do a literature review in minutes.

These are not science fiction. They are available today across the major platforms. We cover them in more detail in the Tools collection.


The Bottom Line

Do not spend weeks choosing the perfect model. Pick one (I would start with Claude Sonnet or ChatGPT), use it for real tasks, and branch out as you develop a sense of what different models handle well.

The best model is the one you actually use.

And remember: for specific clinical tasks, the right answer is often not a chatbot at all - it is a specialized app built on top of AI. Match the tool to the task.

The model landscape will keep changing. What matters is building the skill of working with these tools effectively - and that skill transfers across every model.

Enjoyed this guide?

Subscribe to Beam Notes for more insights delivered to your inbox.

Subscribe