Summary
Consumer wearable data from different devices cannot be treated as interchangeable, yet the specific failure modes are poorly characterized for long-term, multi-device use. Using an N-of-1 longitudinal dataset spanning 2,443 days (Garmin Fenix), 819 days (Oura Ring Gen 3), 240 days (Whoop 4.0), and 284 days (TwinMe Watch), this preprint quantifies inter-device agreement, not clinical accuracy, across six metric categories: resting heart rate, heart rate variability, sleep, steps, respiratory rate, and SpO2.
Agreement varies systematically by metric: sleep duration (CCC = 0.643) is more consistent than HRV (0.362), respiratory rate (0.315), and SpO2 (-0.016). Garmin-Oura resting heart rate agreement is poor (CCC = 0.106 raw), driven by scale mismatch rather than location bias, and day-to-day directional agreement is near zero, meaning two devices can disagree about whether your RHR went up or down. A Ridge Regression harmonization layer raises Whoop to Garmin RHR agreement from CCC = 0.190 to 0.768 on aggregated out-of-sample predictions, though per-fold performance is highly variable.
Why it matters
Identical metric labels across wearables do not guarantee comparable constructs. For any system that fuses signals from multiple devices over time, including a virtual twin, this provides a reproducible methodological foundation and a clear warning: cross-ecosystem integration generally requires device-specific recalibration.
Authors
Pierrick Hauguel, Louis-Philippe Noel, Nicolas Anctil. All authors are affiliated with BioTwin Inc.