Generative Hypochondria
- Lee Zhao
- Jan 26
- 5 min read

I let ChatGPT analyze a decade of my Apple Watch data. Then I called my doctor.
I.
In Jorge Luis Borges’s The Library of Babel, there is a library containing every possible book. Most of them are gibberish—random sequences of letters like “mcvbn qio jkl.” But somewhere in there is the true history of your death, a cure for cancer, and a perfect translation of the Iliad. The problem, of course, is that because the library contains everything, it contains nothing useful. You cannot find the cure for cancer because it is buried under a trillion books that look exactly like the cure for cancer but have one typo that turns the formula into cyanide.
I think about this whenever someone tells me that the Apple Watch—or the Oura Ring, or the Whoop Strap—is going to revolutionize medicine by feeding data into ChatGPT.
The argument usually goes like this:
We are collecting gigabytes of continuous physiological data (heart rate, HRV, oxygen saturation, step cadence).
We have Large Language Models (LLMs) that are incredibly good at finding patterns in huge datasets.
Therefore, soon your watch will vibrate and ChatGPT will say: "I notice a 4% dip in your HRV and a slight gait asymmetry. You have the early stages of Creutzfeldt-Jakob Disease. Please see a neurologist."
Don't even think about what the neurologist would actually do for CJD. (Hint, it rhymes with "..thing"). This is the standard techno-optimist view. It is clean, logical, and certainly wrong for the foreseeable future. The problem isn't the AI. The problem is that we are trying to teach the AI to read using a language where we have deleted all the verbs.
II.
Let’s talk about "Ground Truth."
If you want to train an AI to recognize a cat, you show it a million pictures. But crucially, you also need a million humans to have labeled those pictures "Cat" or "Not Cat." If you just show it a million random JPEGs without labels, the AI might learn to cluster "images with fur" vs. "images with sunsets," but it won't know what a cat is.
In healthcare, we have the JPEGs (the Apple Watch data). We do not have the labels.
Consider a hypothetical user, Alice. Alice wears an Apple Watch for three years. She generates 100 million data points of heart rate variability (HRV). On February 14th, Alice has a heart attack.
This looks like a perfect training example! We should feed this to the AI so it learns "What a pre-heart attack looks like." But here is the problem: Alice’s medical record lives in a completely different silo (Epic/Cerner). It is a messy, billing-coded pdf that says she presented with chest pain. Unless Alice explicitly donates her watch data to a study and allows that study to access her medical records and someone rigorously cleans that data to ensure the timestamp on the watch matches the timestamp of the troponin test in the ER, the link is broken.
To the AI, Alice is just a stream of numbers that suddenly stopped. Did she have a heart attack? Did she take the watch off to charge it? Did she switch to using an Oura ring? The AI doesn't know.
We have a library of Babel filled with your vital data. Volumes and volumes of your heart beats and steps and breathes, but the books don't have titles.
III.
This brings us to the Validation Gap , or "Garbage In, Hallucination Out."
Medical data is not like text data. If you mistype a word on Reddit, GPT-5.2 can figure it out from context. If your pulse oximeter slips on your finger and reads 85% for ten seconds while you sleep, that is not a typo. That is a loud, screaming signal that you are dying. Except you aren't. You just rolled over.
Clinical-grade equipment is validated against strict standards. Consumer-grade equipment is validated against... other consumer-grade equipment, or "sales." When we feed this noisy, unvalidated data into an LLM, we are asking it to make high-stakes predictions based on low-fidelity shadows.
The result is Generative Hypochondria. If you ask ChatGPT: "My Apple Watch says my VO2 Max dropped 2 points this month, am I dying?" It will look at its training data (which is the Internet). The Internet is full of forums where worried people discuss VO2 Max and dying. So ChatGPT says: "A drop in VO2 Max can be associated with cardiovascular deconditioning or heart failure."
Technically true! Also completely useless. It cannot say: "I have analyzed 50 million people with your specific noise profile and 99.9% of them just had a stressful month at work." It can't say that, because that dataset does not exist.
IV.
The core issue is a mismatch of goals.
Medicine’s goal: Primum non nocere (First, do no harm). We demand high specificity. We don't want to tell a healthy person they are sick.
Consumer Tech’s goal: Engagement. We want to show you a graph that moves, and get you to come look at this graph again, and maybe buy some other stuff that we sell.
LLM’s goal: Plausibility. The model wants to generate the next token that sounds like a smart doctor would say it.
When you mash these three together, you get a system that is incredibly engaging, sounds very smart, and is dangerous.
The "Signal" we are looking for (the predictive signature of disease in wearable data) is likely very subtle. It is buried under layers of noise (movement artifacts, loose bands, cold weather). To find it, we need massive, labeled datasets—paired datasets of (Watch Data + Confirmed Clinical Diagnosis).
Right now, Apple has the watch data. The health systems have the diagnosis. And HIPAA stands in the middle, preventing them from ever meeting. For goodness sake, even in 2025 we can't even get health systems to share the diagnosis with each other without many signed forms.
V.
Until we build the bridge—until we run the very long, very expensive, longitudinal studies that verify "Pattern X on a $400 watch equals Disease Y with 95% confidence"—ChatGPT Health is a stochastic parrot wearing a white coat.
It can recite the textbook definition of atrial fibrillation. It can tell you to get more exercise. But it cannot look at your noisy, unvalidated, unlabeled datastream and tell you a true story of your future.
In another metaphor--we are staring at the tea leaves, and be have wearable watches that allow use to gather many more tea leaves. We have built software that is excellent at describing the shape of the tea leaves. But we still haven't done the work to figure out if the tea leaves actually predict the future.