Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.
Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.
EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist
You are so brave. I get like 8 spammers calling me daily about loans like I owe them money, and that's without blasting my phone number out to the internet.
If you're not using it yet, then I recommend enabling the call screening feature on your phone, it has basically reduced my number of spam calls down to zero. It's available on iPhones and pixels and Samsung phones(and probably others?)
It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.
A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.
The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.
As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.
It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.
I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.
If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.
they're essentially an ATS SAAS for medical school, if they have enough schools or enough prestigious schools, they can ask for whatever they want and the applicant schools would oblige. cheeky way to make it happen overnight: give a slight advantage to transcripts that are submitted digitally- the conversion would be complete within months.
The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).
Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.
Does the student not have access to the grades? As they are applying to medical school, a few hours of drudgery form filling will still be the easiest part of the process.
It's a bit complicated. Each school has their own grading system (some pass fail, others four tiered, others full letter grades). Additionally, there are reported distributions for each grade. Lastly, there's sometimes a summary statement at the end that usually says "X student was 'superlative'" and then a table at the end that says 'superlative' means top X% of class. On top of that, students may not get their full dean's letter that says all of this stuff.
Basically, self reporting is very difficult to do given the amount of variability in grade reporting.
How should medical residency work? Like how should admissions work, is the match doing what you would want it to do, is there a radical alternative, etc? You have our attention!
Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.
Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it
Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.
I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.
Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.
If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.
Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!
I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.
Humans: ~65% accurate
Gemini 1.5: ~72% accurate
Gemini 2.0: ~88% accurate
Gemini 2.5: ~92%* accurate
*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D
As long as the ergonomics with the SDK stay the same. Jumping up to a new model this far in is something I don't want to contemplate wrestling with honestly. When we were forced off of 1.5 to 2.0 we found that our context strategy had to be completely reworked to recover and see better returns.
>Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks
Got it. The non-experts are holding it wrong!
The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!
Sure.
Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...
It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.
"This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"
I mean, look at it:
GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries.
GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health. The model scores significantly higher than any previous model on HealthBench , an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria.
GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard)
The model excels across a range of multimodal benchmarks, spanning visual, video-based, spatial, and scientific reasoning. Stronger multimodal performance means ChatGPT can reason more accurately over images and other non-text inputs—whether that’s interpreting a chart, summarizing a photo of a presentation, or answering questions about a diagram.
They're clearly competent developers (despite mis-identifying GPT-5-mini as GPT-5o-mini) - but they also don't appear to have evaluated the alternative models, presumably because of this bit:
"This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting and has an enterprise agreement with them, as well as with OpenAI, which improves overall data and model security"
I agree with your general point though. I've been a pretty consistent voice in saying that this stuff is extremely difficult to use.
The solution architect, leads, product managers and engineers that were behind this feature are now laymen who shouldn't do their due diligence on a system to be used to do an extremely important task? They shouldn't test this system across a wide range of input pdfs for accuracy and accept nothing below 100%?
I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.
Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.
Eh, I guess we’re looked at different PDFs and models. Gemini 2.5 flash is very good, and Gemini 2.0 and Claude 3.7 were passable at parsing out complicated tables in image chunks, and we did have a fairly small prompt that worked >90% of cases. Where we had failures they were almost always in asking the model to do something infeasible (like parse a table where the header was on a previous, not provided page).
If you have a better way to parse PDFs using opencv or whatever, please provide this service and people will buy it for their RAG chat bots or to train vlms.
Would it be helpful if LLM creates bounding boxes for "traditional" OCR to work on? I.e. allowing extraction of information of arbitrary PDF as if it was a "particular pdf"
The challenge here is that it's not just OCR for extracting text from a resume, this is about extracting grades from school transcripts. That's a LOT harder, see this excellent comment: https://news.ycombinator.com/item?id=45581480
I would assume they OCR first, then extract whatever info they need from the result using LLMs
Edit: Does sound like it -
"Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.
But... that document also says:
"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"
Which makes it sounds like they were using vision-LLMs for that OCR step.
Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.
I would love to hear more about the solutions you have in mind, if you're willing.
The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.
spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh
Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:
> Once confirmed, we corrected the extracted grade immediately.
> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.
I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.
RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.
The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.
The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.
Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.
Some people define RAG as having to use vector search, others (myself included) define RAG as any technique that retrieves additional relevant context to help generate the response, which can include triggering things like full-text search queries or even grep (increasingly common thanks to Claude Code et al).
RAG is just "Retrieval Augmented Generation", vector similarity is one way to do that retrieval but not the only. Though you are right, there is really no retrieval step augmenting the generation here, more like just a validation step stuck on the end.
Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.
What's new or pertinent here is the specific real world use case and who it's impacting.
>It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
Again I would say that's why context is significant. You are strictly right, but it was applied in this instance for the purpose of faithfully representing grades. So I wouldn't say it's necessarily a matter of misunderstanding design, the errors are real after all, but the fact that it was entrusted for the purpose of faithful factual representation is what makes it an important story.
Hallucinations are also completely normal, "by design", just the output / experience of the system that produces it. It's just us who decided on the classification of what's real and what isn't, and looking at the state of things, we are not even very good on agreeing on the limit.
I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.
Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.
Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.
I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.
This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.
The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.
... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).
Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.
This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".
> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.
Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.
> And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.
Calling it a hallucination leads people to think that they just need to stop it from hallucinating.
In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.
A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".
It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.
The story isn't the hallucination, it's that people are using this shit in risky ways and ignoring the known problems with it. Engineers knew well before 1981 that building this [1] wasn't safe, but that didn't stop someone from building it. When it collapsed, it was a story.
I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.
TL;DR: being wrong, even very wrong != hallucination
> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
can you hear yourself? you are providing excuses for a computer system that produces erroneous output.
He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.
We are in the late stage of the hype cycle for LLMs where the comments are becoming progressively ridiculous like for cryptocoins before the market crashed. The other day a user posted that LLMs are the new transistors or electricity.
School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.
Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).
What we've seen help in these cases are:
1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.
2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.
The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).
Frustrating that their official recommendation is to verify the grades manually.
If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?
Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.
Because the contract has already been signed, they can't guarantee it works right, and they don't want to be open to lawsuits. "You, mister wrongly-denied applicant, cannot sue us; we specifically told them to check all grades manually!"
This is why this particular emperor has no clothes. They keep trying to jam AI into stuff to make it "easier", but the LLMs, by their very nature do the tasks in lossy or incorrect ways. Imagine if Microsoft had sold Excel with a "be sure to verify all the calculations" caveat.
Because the people purchasing the tool aren't the ones who will actually use it. The former get a "Deployed AI tooling to X to increase productivity by X%" on their resume. The latter get left to deal with the mess.
Lots of comments in here that seem to have missed that this is about using vision-LLMs for OCR.
This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.
Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.
Is it? To me it sounds like they do OCR first, then extract from the result with LLM:
"Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
The models really are getting better though. Compare Gemini 1.5 and Gemini 2.5 on the same PDF document (I've done this a bunch) and you can see the difference.
The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.
I fully agree. My point was more a lot of commenters seem or implicitly compare the llm based approach with some “better” or “simpler” approach which really doesn’t exist from my estimation LLMs are sota for this kind of extractions (though they still have issues).
While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.
Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?
It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.
Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.
So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?
I keep circling this with AI and I'm not really sure what to do with it. They mention that the AI is meant to be used as reference only in the linked article but what does that actually mean? Who is checking who? Is the AI filling out the data from what it sees in the PDF and the user is expected to check it or is the user filling out the data and the AI is expected to check it?
Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?
Yeah, I've seen this "for reference only" wording in many places, often used as a sort of disclaimer on stuff that could be wrong, but I have absolutely no idea what it means in that context. To me "reference" implies comprehensive, high quality information that I can refer to when I need to know some obscure detail of something.
Is there some legal context in which this phrase has a specific meaning, perhaps?
Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.
Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...
Using a mini model for this seems grossly irresponsible. I've been doing some work testing models for similar extraction tasks (nothing where a failure affects someone's grade or anything) and gpt mini / Gemini flash simply can't do this sort of thing. Using anything less than the highest model with reasoning, you're guaranteed to get this sort of thing happening.
It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.
I see your point here but please take a look at the “standard” unstructured pdf extraction algos they have a lot of problems as well. Llm based extraction is still (on avergad) a big improvement.
It's predicting the next token by statistical approximation. Hallucination vs fact is an ad-hoc distinction we impose on the result to suit our purpose.
they actually write it:
> For this cycle, we have refined our model architecture, expanded the catalog of medical schools and grading schemas, and upgraded to include the GPT-5o-mini model for increased accuracy and efficiency. Real-time validation has also been strengthened to provide programs with more reliable percentile and grade distribution data. Together, these enhancements make transcript normalization an even more powerful tool to support fair, consistent, and data-driven review in the transition to residency.
>Reviewers are strongly encouraged to verify all information against the applicant’s official PDF transcript. This reminder is also displayed directly within the product.
This is not how this works. You know people will not do this. In fact the whole value proposition hinges on people not doing this. If the information needs to be verified by a human, then it takes more time than just going through the document.
If your product can not be trusted, then it can not be used to make important decisions. Pushing the responsibility to not use your product on the user is absurd and does not make your actions any less negligent.
Wow. Never ceases to amaze me how some people in these comment sections remain blind to the power of Artificial Intelligence (AI).
Have you not tried prompting the model correctly? My startup gets 0 hallucinations on the latest iteration of Claude Sonnet using a custom proprietary reflecting RAG framework inspired by ontology.
It never ceases to amaze me when startup founders claim that every problem is the same. Some use cases (like parsing text out of PDF) can’t be distilled down to a prompt.
LLM can't hallucinate. Correct phrase would be "GPT-5o-mini generates medical residency applicant grades". Everywhere you see word hallucinate in regards of a program output, it should be replaced with generate for clarity.
OpenAI are the last people who I would take as a reference, because they are financially motivated to keep the charade of a "thinking" LLM or so called "AI". That's why they are widely using anthropomorphic terms like "hallucination" or "reasoning" or "thinking", while their computer programs can't do neither of those things. LLM companies sometimes even expose their hypocrisy. My favorite example for now is when Antropic showed in their own paper that asking LLM how it "reasoned" through calculating a sum of numbers doesn't match reality at all, it's all generated slop.
This is why it is important that users (us) don't fall into the anthropomorphism trap and call programs what they are are and what they really do. Especially important since general populace seems to be deluded by the OpenAI and Anthropic aggressive lies and believe that LLMs can think.
Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.
Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.
EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist
Hi wondering if you could message me at shane.shifflett@dowjones.com or via signal at 929 638 0009? https://www.wsj.com/news/author/shane-shifflett
You are so brave. I get like 8 spammers calling me daily about loans like I owe them money, and that's without blasting my phone number out to the internet.
If you're not using it yet, then I recommend enabling the call screening feature on your phone, it has basically reduced my number of spam calls down to zero. It's available on iPhones and pixels and Samsung phones(and probably others?)
I have it, but you still have junk in missed calls and voicemails.
It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.
A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.
The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.
As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.
I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657
> the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.
I'm interested in what the conditions were that didn't let you reject those kind of transactions, or blacklist them for the future.
We hear about companies firing/banning unprofitable customers sometimes, surprised it doesn't happen more often honestly.
Thank you for sharing this.
It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.
I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.
If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.
they're essentially an ATS SAAS for medical school, if they have enough schools or enough prestigious schools, they can ask for whatever they want and the applicant schools would oblige. cheeky way to make it happen overnight: give a slight advantage to transcripts that are submitted digitally- the conversion would be complete within months.
If you want to get sued, sure.
The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).
Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.
Because you'd have to get thousands of schools to agree to using the same format.
Does the student not have access to the grades? As they are applying to medical school, a few hours of drudgery form filling will still be the easiest part of the process.
It's a bit complicated. Each school has their own grading system (some pass fail, others four tiered, others full letter grades). Additionally, there are reported distributions for each grade. Lastly, there's sometimes a summary statement at the end that usually says "X student was 'superlative'" and then a table at the end that says 'superlative' means top X% of class. On top of that, students may not get their full dean's letter that says all of this stuff. Basically, self reporting is very difficult to do given the amount of variability in grade reporting.
Maybe have the students verify their own grades extracted by the LLM?
How should medical residency work? Like how should admissions work, is the match doing what you would want it to do, is there a radical alternative, etc? You have our attention!
Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.
> this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).
Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it
Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.
I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.
Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.
If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.
Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!
-> put Gemini 2.5 at the top of the pack
I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.
Humans: ~65% accurate
Gemini 1.5: ~72% accurate
Gemini 2.0: ~88% accurate
Gemini 2.5: ~92%* accurate
*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D
Gemini 3.0 is rumored to drop any day now, will be very interesting to see the score that gets for your benchmark here.
As long as the ergonomics with the SDK stay the same. Jumping up to a new model this far in is something I don't want to contemplate wrestling with honestly. When we were forced off of 1.5 to 2.0 we found that our context strategy had to be completely reworked to recover and see better returns.
>Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks
Got it. The non-experts are holding it wrong!
The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!
Sure.
Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...
It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.
"This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"
I mean, look at it:
And on and on it goes..."The non-experts are holding it wrong!"
We aren't talking about non-experts here. Go read https://www.thalamusgme.com/blogs/methodology-for-creation-a...
They're clearly competent developers (despite mis-identifying GPT-5-mini as GPT-5o-mini) - but they also don't appear to have evaluated the alternative models, presumably because of this bit:
"This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting and has an enterprise agreement with them, as well as with OpenAI, which improves overall data and model security"
I agree with your general point though. I've been a pretty consistent voice in saying that this stuff is extremely difficult to use.
> The laymen
The solution architect, leads, product managers and engineers that were behind this feature are now laymen who shouldn't do their due diligence on a system to be used to do an extremely important task? They shouldn't test this system across a wide range of input pdfs for accuracy and accept nothing below 100%?
I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.
Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.
Eh, I guess we’re looked at different PDFs and models. Gemini 2.5 flash is very good, and Gemini 2.0 and Claude 3.7 were passable at parsing out complicated tables in image chunks, and we did have a fairly small prompt that worked >90% of cases. Where we had failures they were almost always in asking the model to do something infeasible (like parse a table where the header was on a previous, not provided page).
If you have a better way to parse PDFs using opencv or whatever, please provide this service and people will buy it for their RAG chat bots or to train vlms.
Would it be helpful if LLM creates bounding boxes for "traditional" OCR to work on? I.e. allowing extraction of information of arbitrary PDF as if it was a "particular pdf"
The parent says
> that information is buried in PDFs sent by schools (often not standardized).
I don't think OCR will help you there.
An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.
Don't most jobs do OCR on the resumes sent in for employment? I get that a resume is a more standard format. Maybe that's the rub
The challenge here is that it's not just OCR for extracting text from a resume, this is about extracting grades from school transcripts. That's a LOT harder, see this excellent comment: https://news.ycombinator.com/item?id=45581480
I would assume they OCR first, then extract whatever info they need from the result using LLMs
Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.
But... that document also says:
"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"
Which makes it sounds like they were using vision-LLMs for that OCR step.
Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.
Thanks, I didn't see that part!
I would love to hear more about the solutions you have in mind, if you're willing.
The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.
Welcome to the world of greybeards, baffled by everyone using AWS at 100s to 100000s of times the cost of your own servers.
spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh
Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:
> Once confirmed, we corrected the extracted grade immediately.
> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.
I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.
RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.
The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.
The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.
Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.
I could be totally wrong here.
Some people define RAG as having to use vector search, others (myself included) define RAG as any technique that retrieves additional relevant context to help generate the response, which can include triggering things like full-text search queries or even grep (increasingly common thanks to Claude Code et al).
RAG is just "Retrieval Augmented Generation", vector similarity is one way to do that retrieval but not the only. Though you are right, there is really no retrieval step augmenting the generation here, more like just a validation step stuck on the end.
Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.
What's new or pertinent here is the specific real world use case and who it's impacting.
>It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
Again I would say that's why context is significant. You are strictly right, but it was applied in this instance for the purpose of faithfully representing grades. So I wouldn't say it's necessarily a matter of misunderstanding design, the errors are real after all, but the fact that it was entrusted for the purpose of faithful factual representation is what makes it an important story.
Hallucinations are also completely normal, "by design", just the output / experience of the system that produces it. It's just us who decided on the classification of what's real and what isn't, and looking at the state of things, we are not even very good on agreeing on the limit.
I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.
Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.
Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.
I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.
This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.
The way I've been putting it for a while is, "all they do is hallucinate—it's the only thing they do. Sometimes the hallucinations are useful."
The OpenAI paper on hallucinations gives actual technical reasons for them, if you're interested.
https://openai.com/index/why-language-models-hallucinate/
https://arxiv.org/abs/2509.04664
The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.
... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).
> Nothing new to see here.
Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.
This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".
I completely agree with you. GP’s cynical take is an upvote magnet but doesn’t contribute to the discourse.
All models are wrong, but some are useful. https://en.wikipedia.org/wiki/All_models_are_wrong
I think the definition of hallucination fits pretty neatly.
> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.
Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.
> And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.
Calling it a hallucination leads people to think that they just need to stop it from hallucinating.
In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.
A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".
It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.
The story isn't the hallucination, it's that people are using this shit in risky ways and ignoring the known problems with it. Engineers knew well before 1981 that building this [1] wasn't safe, but that didn't stop someone from building it. When it collapsed, it was a story.
[1] https://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse
I don’t understand the issue with the word “hallucination”.
If a model hallucinates it did do something wrong, something that we would ideally like to minimize.
The fact that it’s impossible to completely get rid of hallucinations is separate.
An electric car uses electricity, it’s a fundamental part of its design. But we’d still like to minimize electricity usage.
I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.
TL;DR: being wrong, even very wrong != hallucination
> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
can you hear yourself? you are providing excuses for a computer system that produces erroneous output.
No he does not.
He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.
We are in the late stage of the hype cycle for LLMs where the comments are becoming progressively ridiculous like for cryptocoins before the market crashed. The other day a user posted that LLMs are the new transistors or electricity.
School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.
Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).
What we've seen help in these cases are:
1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.
2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.
The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).
Would it help a lot to run it through multiple different AI systems and verify that they agree on the result?
Yeah that can occasionally work and something we've tested, but it introduces a lot of noise unfortunately and makes systematic evals difficult.
Frustrating that their official recommendation is to verify the grades manually.
If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?
Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.
Because the contract has already been signed, they can't guarantee it works right, and they don't want to be open to lawsuits. "You, mister wrongly-denied applicant, cannot sue us; we specifically told them to check all grades manually!"
This is why this particular emperor has no clothes. They keep trying to jam AI into stuff to make it "easier", but the LLMs, by their very nature do the tasks in lossy or incorrect ways. Imagine if Microsoft had sold Excel with a "be sure to verify all the calculations" caveat.
> If they’re doing that, why use the tool at all?
Because the people purchasing the tool aren't the ones who will actually use it. The former get a "Deployed AI tooling to X to increase productivity by X%" on their resume. The latter get left to deal with the mess.
Lots of comments in here that seem to have missed that this is about using vision-LLMs for OCR.
This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.
Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.
> this is about using vision-LLMs for OCR
Is it? To me it sounds like they do OCR first, then extract from the result with LLM:
"Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
See comment I just posted here: https://news.ycombinator.com/item?id=45582982
These kinds of errors have always existed and will always exist there is no perfect way to extract info from documents like this.
The models really are getting better though. Compare Gemini 1.5 and Gemini 2.5 on the same PDF document (I've done this a bunch) and you can see the difference.
The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.
I fully agree. My point was more a lot of commenters seem or implicitly compare the llm based approach with some “better” or “simpler” approach which really doesn’t exist from my estimation LLMs are sota for this kind of extractions (though they still have issues).
People don't respect the chasm between "obviously no mistakes" and "no obvious mistakes".
While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.
Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?
It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.
Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.
So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?
I keep circling this with AI and I'm not really sure what to do with it. They mention that the AI is meant to be used as reference only in the linked article but what does that actually mean? Who is checking who? Is the AI filling out the data from what it sees in the PDF and the user is expected to check it or is the user filling out the data and the AI is expected to check it?
Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?
Yeah, I've seen this "for reference only" wording in many places, often used as a sort of disclaimer on stuff that could be wrong, but I have absolutely no idea what it means in that context. To me "reference" implies comprehensive, high quality information that I can refer to when I need to know some obscure detail of something.
Is there some legal context in which this phrase has a specific meaning, perhaps?
Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?
Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.
Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...
It seems like a default mode for AI should be to generate repeatable Regex for text extraction.
Unfortunately many PDFs don't even internally represent text in a contiguous way.
Tables in PDFs still confuse traditional OCR engines. VLMs do better in some cases (though not this one, apparently).
Not in PDF.
Using a mini model for this seems grossly irresponsible. I've been doing some work testing models for similar extraction tasks (nothing where a failure affects someone's grade or anything) and gpt mini / Gemini flash simply can't do this sort of thing. Using anything less than the highest model with reasoning, you're guaranteed to get this sort of thing happening.
It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.
I regularly use LLM-as-OCR and find it really helpful to:
1. Minimize the number of PDF pages per context/call. Don't dump a giant document set into one request. Break them into the smallest coherent chunks.
2. In a clean context, re-send the page and the extracted target content and ask the model to proofread/double-check the extracted data.
3. Repeat the extraction and/or the proofreading steps with a different model and compare the results.
4. Iterate until the proofreadings pass without altering the data, or flag proofreading failures for stronger models or human intervention.
What's the typical run for you cost?
I see _even with search/RAG_ LLMs hallucinate. They just hallucinate more convincingly in the language of the documents you retrieved.
So you really have to double check when researching information that really matters.
This sucks. Residency match is stressful as it is, and adding systems like these just make the experience even worse for the applicants.
Source: spouse matched in 2018. It was one of the most stressful periods of our lives.
Seems like hallucination will always be an issue for predict the next word training. Maybe we need to rethink pretraining .
Not only did the AI hallucinate the applicant grade, but also the model name!
GPT-5o-whatever ain’t a thing.
The irony is sweeeeet
I see your point here but please take a look at the “standard” unstructured pdf extraction algos they have a lot of problems as well. Llm based extraction is still (on avergad) a big improvement.
It's predicting the next token by statistical approximation. Hallucination vs fact is an ad-hoc distinction we impose on the result to suit our purpose.
5o mini?
I'm assuming they mean gpt-5-mini. I'm honestly surprised how many people I've heard say "5o".
Guess it hallucinated the model name as well.
There is no such thing as GPT-5o-mini, or GPT-5o. Concerning that the methodology seems to repeat the same error, not just the submitted title.
https://www.thalamusgme.com/blogs/methodology-for-creation-a...
they actually write it: > For this cycle, we have refined our model architecture, expanded the catalog of medical schools and grading schemas, and upgraded to include the GPT-5o-mini model for increased accuracy and efficiency. Real-time validation has also been strengthened to provide programs with more reliable percentile and grade distribution data. Together, these enhancements make transcript normalization an even more powerful tool to support fair, consistent, and data-driven review in the transition to residency.
> AI hallucinates students' grades
> Make a write-up about it
> Using AI
> Which then hallucinates more stuff
Funny stuff.
This is the singularity: AI makes up stuff about AI better than humans could.
they probably mean gpt-5-low. but the small models are bad for parsing data where the data has strong implications
I wonder if they're using reasoning? It usually eliminates these types of errors
Nothing new to see here. Human also hallucinates, as you can tell from the model name.
Where did the "GPT-5o-mini" in this headline come from?
That's not a real model name: there's GPT-5-mini and GPT-4o-mini but no GPT-5o-mini.
UPDATE: Here's where the GPT-5o-mini came from: https://www.thalamusgme.com/blogs/methodology-for-creation-a... - via this comment: https://news.ycombinator.com/item?id=45581030
That said, I've been disappointed by OCR performance from the GPT-5 series. I caught it hallucinating some of the content for a pretty straight-forward newspaper scan a few weeks ago: https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...
Gemini 2.5 is much more reliable for extracting text from images in my experience.
>AI hallucinates
>Look inside
>GPT-4o-mini
>Reviewers are strongly encouraged to verify all information against the applicant’s official PDF transcript. This reminder is also displayed directly within the product.
This is not how this works. You know people will not do this. In fact the whole value proposition hinges on people not doing this. If the information needs to be verified by a human, then it takes more time than just going through the document.
If your product can not be trusted, then it can not be used to make important decisions. Pushing the responsibility to not use your product on the user is absurd and does not make your actions any less negligent.
I thought this was supposed to be AGI?
Semi-related but Sonnet 4.5 drives me absolutely insane.
I tell it a date, like March 2024 as the start, and October 2025 as the current month.
It still thinks that is 7 months somehow... and this is Anthropic's latest model..
[flagged]
Wow. Never ceases to amaze me how some people in these comment sections remain blind to the power of Artificial Intelligence (AI). Have you not tried prompting the model correctly? My startup gets 0 hallucinations on the latest iteration of Claude Sonnet using a custom proprietary reflecting RAG framework inspired by ontology.
It never ceases to amaze me when startup founders claim that every problem is the same. Some use cases (like parsing text out of PDF) can’t be distilled down to a prompt.
LLM can't hallucinate. Correct phrase would be "GPT-5o-mini generates medical residency applicant grades". Everywhere you see word hallucinate in regards of a program output, it should be replaced with generate for clarity.
If you're being 100% literal, sure. But language evolves and it's the accepted term for the concept. OpenAI themselves uses the phrase - https://openai.com/index/why-language-models-hallucinate/
OpenAI are the last people who I would take as a reference, because they are financially motivated to keep the charade of a "thinking" LLM or so called "AI". That's why they are widely using anthropomorphic terms like "hallucination" or "reasoning" or "thinking", while their computer programs can't do neither of those things. LLM companies sometimes even expose their hypocrisy. My favorite example for now is when Antropic showed in their own paper that asking LLM how it "reasoned" through calculating a sum of numbers doesn't match reality at all, it's all generated slop.
This is why it is important that users (us) don't fall into the anthropomorphism trap and call programs what they are are and what they really do. Especially important since general populace seems to be deluded by the OpenAI and Anthropic aggressive lies and believe that LLMs can think.