Why Your Fitness Tracker's Correlations Are Lying to You

It was a Friday night and my brand-new health dashboard had an opinion.

In a tidy little card, glowing red, it announced: “More steps strongly correlates with WORSE sleep (r = -0.61).”

I stared at it. So the cure for insomnia is… sitting still? Cool. Great. I’d just spent a weekend wiring up Postgres, a correlation engine, and a slick dark UI to discover that exercise is bad for you.

The number was real. The correlation was real. The conclusion was complete garbage. And it took me an embarrassingly long evening to figure out why.

The investigation

The setup was simple. Four metrics per day — steps, sleep hours, resting heart rate, weight — pulled from the tracker’s export into a table. The engine did the obvious thing: loop over every pair of columns, run a Pearson correlation, sort by absolute value, surface the “strongest findings.”

First I assumed a bug. Sign flipped somewhere, units crossed, a join gone sideways. I pulled the raw rows.

SELECT day, steps, sleep_hours, resting_hr
FROM health_daily
ORDER BY day DESC
LIMIT 10;

The data was fine. High-step days really did have less sleep. The math wasn’t lying. The math was just answering a dumber question than I thought I’d asked.

Then I noticed the pattern in the rows. The big-step / bad-sleep days clustered. They were the days I remembered — the chaotic ones. Travel days. Deadline days. The days I was on my feet because everything was on fire.

The “aha”

There was a third variable sitting in the middle of the whole thing, and I’d never measured it: a stressful day.

A busy, stressful day makes me walk more (running around, pacing, errands) AND sleep worse (wired, late, anxious). Steps and sleep don’t touch each other. They’re both just symptoms of the same hidden cause.

                  ┌──────────────────────┐
                  │   STRESSFUL DAY      │   <- the thing I never logged
                  │   (the confounder)   │
                  └─────────┬────────────┘
                            │
                ┌───────────┴───────────┐
                ▼                       ▼
         ┌────────────┐         ┌──────────────┐
         │  + STEPS   │  ?????  │  - SLEEP     │
         └────────────┘ <-----> └──────────────┘
            no actual arrow between these two

The engine drew the dotted line at the bottom and called it a discovery. It had no idea the box at the top even existed.

And it got worse. With only a few weeks of data, every correlation was riding on a tiny sample — a couple of weird days could swing r wildly. On top of that I was testing every pair: 4 metrics is 6 comparisons, and the more pairs you test, the more likely pure noise hands you a juicy-looking number. Test enough things and “significant” findings appear for free.

Confounders, tiny samples, and many comparisons. Three different ways to manufacture confident nonsense, all firing at once.

The engine saw the red dotted line. It never saw the mauve box.

The fix

I stopped trusting the engine and started guarding it. Four changes.

1. Require a real sample size. No pair gets reported under a floor.

MIN_N = 21  # three weeks minimum before we say a word

if pair.n < MIN_N:
    pair.verdict = "insufficient_data"
    continue

2. Correct for multiple comparisons. If you test 6 pairs, a raw p < 0.05 means almost nothing. Hold them all to a stricter bar.

from statsmodels.stats.multitest import multipletests

pvals = [p.pvalue for p in pairs]
reject, p_adj, _, _ = multipletests(pvals, alpha=0.05, method="holm")

3. Control for the obvious third variable. Partial correlation: hold the suspected confounder fixed and see if anything survives.

import pingouin as pg

# does steps-vs-sleep survive once we account for a stress proxy?
result = pg.partial_corr(
    data=df, x="steps", y="sleep_hours", covar="stress_proxy"
)
# the -0.61 collapsed toward zero once stress was held constant

4. Relabel the output. The card no longer says “correlates with.” It says “possible link — investigate.” Every finding ships as a hypothesis, never a verdict.

The steps/sleep “finding” failed three of the four gates and got demoted to a quiet maybe. Exactly where it belonged.

Why it happened

Because a correlation engine is a confidence machine with no judgment. It will faithfully compute a number for any two columns you hand it and present that number with the same authority whether it’s bedrock truth or coincidence between two symptoms of a Tuesday from hell.

The math was never wrong. My question was. “Are these two columns correlated?” is trivial. “Does one affect the other?” is a completely different question the engine was never equipped to answer — and I let the pretty red card pretend it had.

Takeaways

Correlation engines on personal data manufacture convincing nonsense. A real r with a fake meaning looks identical to a true insight on a dashboard.
Hunt the confounder first. If two things correlate, ask what unmeasured third thing could be driving both before you believe either causes the other.
Tiny samples lie loudly. A handful of weird days can swing a correlation hard. Set a minimum-N floor and enforce it in code.
Testing many pairs guarantees false hits. Correct for multiple comparisons (Holm, Bonferroni, FDR) or you’ll “discover” links that are pure noise.
Ship correlations as hypotheses, never conclusions. Label the output “investigate,” not “proven.” Your dashboard’s job is to point, not to swear.

The investigation#

The “aha”#

The fix#

Why it happened#

Takeaways#

The investigation

The “aha”

The fix

Why it happened

Takeaways