Back to Guides
GuideGetting StartedBeginner

I Tried ChatGPT in 2023 and It Made Up Citations. Why Should I Trust It Now?

Ramez Kouzy, MD 6 min

What you'll learn

  • Why early AI models hallucinated citations
  • How models have improved since 2023
  • The right mental model: thinking partner, not database
  • Good vs bad use cases for AI in clinical work
  • How to use AI for brainstorming, drafting, and synthesis

The Story I Hear Every Week

It goes something like this: "I asked ChatGPT to give me five references supporting hypofractionation in breast cancer. It gave me five papers. Three of them did not exist. I closed the tab and have not used it since."

If that is your story, you are not alone. It was many people's story. And honestly, it was a reasonable reaction.

If a tool confidently fabricates citations - complete with fake authors, fake journals, and fake DOIs - why would you trust it with anything?

My answer: you were right to be skeptical, and you were also using it wrong. Both things are true.


What Actually Happened (And Why)

In 2023, most people's first instinct with ChatGPT was to treat it like a search engine. You typed a question, expected accurate references, and got burned.

This is because LLMs do not retrieve information from a database. They generate text based on patterns.

When you asked for a citation, the model did not look up a paper. It predicted what a citation should look like based on the patterns it learned. It knew that a citation about hypofractionation in breast cancer should probably have an author name that sounds like a breast cancer researcher, a journal that sounds right (maybe the Red Journal or JCO), and a year that seems plausible.

So it generated one. Sometimes it matched a real paper. Often it did not.

This is called hallucination, and it was a real and serious problem. It has not disappeared entirely - models can still fabricate details, especially when pushed for specifics they were not trained on. But it has improved dramatically.

Modern models hallucinate far less frequently, and many now have access to web search or retrieval tools that let them ground their responses in actual sources.

The Core Lesson

The citation disaster taught us something valuable: LLMs are not search engines. They are pattern generators. Once you internalize that distinction, you stop making the wrong kinds of requests.


The Mental Model Shift

Stop thinking of AI as a reference manager. It is not PubMed. It is not Google Scholar. It is not UpToDate.

If you want to find specific papers with real citations, use tools built for that purpose - Elicit, Semantic Scholar, Consensus, PubMed. Those tools are designed to search actual databases and return real references. See our guide on AI for literature review for detailed workflows.

So what is AI good for?

Think of it as a sharp PhD student who has read broadly, thinks quickly, and is eager to help - but who needs your direction and your verification.

You would not send a PhD student to write your reference list without checking it. But you would absolutely ask them to:

  • Brainstorm angles for a research question
  • Summarize a long paper in plain language
  • Draft an introduction for a manuscript
  • Walk through the pros and cons of two treatment approaches
  • Help you think through a confusing clinical scenario
  • Challenge your assumptions about a case

That is where AI excels. Not as a fact database, but as a thinking partner.


Good Use Cases vs. Bad Use Cases

Bad RequestWhy It FailsBetter Approach
Give me 10 references supporting the use of proton therapy for pediatric medulloblastomaAsking the model to retrieve specific facts from a database it does not haveUse Elicit or PubMed for the search, then ask AI to help organize and summarize what you found
What is the five-year overall survival for nasopharyngeal carcinoma treated with concurrent chemoradiation?You want a specific number with a specific source - the model might confidently fabricate itLook up the data in a clinical guideline or trial, then ask AI to help you contextualize it for a presentation
Write my Methods section and include all the statistical tests with citationsAsking AI to generate verifiable facts without your oversightDraft the Methods yourself or with AI, then YOU add the citations from your actual references
Good RequestWhy It Works
I'm writing a review on proton therapy for pediatric medulloblastoma. Help me outline the key arguments - what are the main clinical rationales, what does the dosimetric data suggest, and what are the main criticisms?You are using the model as a thinking partner to organize arguments - you will populate real citations later
I'm preparing a talk on nasopharyngeal carcinoma outcomes. Help me think through how to structure the survival data section. What are the key variables that affect outcomes, and what subgroups should I address?The model helps you think through structure and framing - you bring the specific data
Here is a draft of my Methods section. Can you review it for clarity, suggest improvements to the flow, and flag anything that seems unclear?You wrote the content with real citations - AI just helps with editing and clarity

See the pattern? The best use cases involve synthesis, drafting, brainstorming, and critique. The worst use cases involve asking for specific factual retrieval without verification.


The Right Question Is Not "Is It Accurate?"

When clinicians evaluate AI tools, they default to asking: "Is the output accurate?" This is a natural instinct - accuracy is everything in medicine. But it is the wrong primary question for a thinking tool.

The better question is: "Is this useful for my thinking?"

A brainstorm does not need to be perfect. A first draft does not need to be publishable. An outline does not need to be comprehensive. These are starting points that save you time and expand your thinking.

You still bring the clinical expertise, the verification, and the final judgment.

If a model helps you generate a first draft in 20 minutes instead of 2 hours, and you spend 30 minutes editing it, you have still saved over an hour. The draft did not need to be flawless. It needed to be a useful starting point.

This is a fundamentally different relationship than you have with UpToDate or PubMed, where accuracy is the entire point. AI chatbots are not replacing those tools. They are complementing them by handling the parts of your work that benefit from speed, breadth, and creative synthesis.


Models Have Improved. Your Approach Should Too.

To be fair to the technology: the models available today are substantially better than what you tried in 2023.

GPT-4o, Claude Opus and Sonnet, Gemini - these are meaningfully more capable, more calibrated, and less prone to fabrication than early ChatGPT. Many now support web search, retrieval from documents, and citations grounded in actual sources.

Learn more about which model to use for different tasks.

But even with perfect models, the mental model shift matters. The most effective AI users in medicine are not the ones who found the most accurate model. They are the ones who learned to ask the right kinds of questions, verify the outputs, and integrate AI into a workflow that still has them firmly in the driver's seat.


The Bottom Line

If ChatGPT burned you in 2023, that experience taught you something important: never trust AI output without verification. Good. Keep that instinct. It will serve you well.

But do not let one bad experience with an early tool close you off to an entire category of technology that has matured significantly and can genuinely make your work faster, broader, and more creative.

The tool was not broken. Your mental model of what it was for needed updating.

Now that you have that update, you are ready to actually start using it.

Enjoyed this guide?

Subscribe to Beam Notes for more insights delivered to your inbox.

Subscribe