Summary
Consumer wearable data from different devices cannot be treated as interchangeable, yet the specific failure modes are poorly characterized for long-term, multi-device use. Using an N-of-1 longitudinal dataset spanning 2,443 days (Garmin Fenix), 819 days (Oura Ring Gen 3), 240 days (Whoop 4.0), and 284 days (TwinMe Watch), this preprint quantifies inter-device agreement, not clinical accuracy, across six metric categories: resting heart rate, heart rate variability, sleep, steps, respiratory rate, and SpO2.
Agreement varies systematically by metric: sleep duration (CCC = 0.643) is more consistent than HRV (0.362), respiratory rate (0.315), and SpO2 (-0.016). Garmin-Oura resting heart rate agreement is poor (CCC = 0.106 raw), driven by scale mismatch rather than location bias, and day-to-day directional agreement is near zero, meaning two devices can disagree about whether your RHR went up or down. A Ridge Regression harmonization layer raises Whoop to Garmin RHR agreement from CCC = 0.190 to 0.768 on aggregated out-of-sample predictions, though per-fold performance is highly variable.
Why it matters
Identical metric labels across wearables do not guarantee comparable constructs. For any system that fuses signals from multiple devices over time, including a virtual twin, this provides a reproducible methodological foundation and a clear warning: cross-ecosystem integration generally requires device-specific recalibration.
Authors
Pierrick Hauguel, Louis-Philippe Noel, Nicolas Anctil. All authors are affiliated with BioTwin Inc.
Important: This article may discuss BioTwin research, medical vision, regulated clinical pathways, or TwinMe wellness education. TwinMe wellness outputs are not medical or laboratory tests. BioTwin clinical outputs are available only where authorized and through licensed healthcare professionals.