Summary
A 1,257-participant cohort with 18,288 dried blood spot (DBS) samples collected over 134 analytical batches and 15 months is used to test whether untargeted LC-MS metabolomic profiles carry enough individual-level signal to identify a participant from a single fingerprick. After batch-aware normalization, supervised feature selection, biological signal filtering, and user-level majority voting across DBS cards, the model reaches 94.1% user-level accuracy and 85.5% sample-level accuracy under 10-fold GroupKFold (group = batch). On a held-out future-batch set of 17 batches, the model reaches 96.1% user-level and 92.6% sample-level accuracy across 1,134 classes, against a chance baseline of 0.088%.
A second contribution of the paper is methodological: the authors show that naive random splitting inflates accuracy because 92.8% of test samples share their (user, batch) pair with the training set. Group-aware splitting is required to measure real generalization.
Why it matters
Most biomarker science still treats lab values as one-shot snapshots. This preprint lays out the case, with data, for treating biology as a trajectory and for evaluating change against an individual’s own baseline. It also frames the protocol and validation discipline that the rest of BioTwin’s research programme is built on.
Authors
Pierrick Hauguel, Nicolas Anctil, Louis-Philippe Noel. All authors are employees and shareholders of BioTwin Inc. PCT patent pending.