- Training
How accurate is the Galaxy Watch for measuring calories during running?
Learn to train smart, run fast, and be strong with this endurance performance nerd alert from Thomas Solomon, PhD.
Validity of Galaxy Watch for Estimating Energy Expenditure During Intermittent Running: Cross-Sectional Study
Ferreira et al. (2026) JMIR Formative Research (click here to open the original paper)
Are findings of this study useful for runners and coaches?
◦ This is useful for runners, coaches, and curious smartwatch nerds because the findings suggest that the Galaxy Watch 6 and 7 are decent for very rough calorie tracking during structured intermittent treadmill running, but not precise enough to treat every calorie number as gospel. Should you use the watch to fine-tune daily food intake down to the last snack? Probably not. For broad training monitoring, though, the signal looks usable. However, the utility of the findings also drops a bit because the study only included healthy, lean adults doing 1 treadmill protocol, and the funding came from the watch manufacturer (Samsung).
What was my Rating of Perceived scientific Enjoyment (RPsE)?
5 out of 10 → I experienced low scientific enjoyment because although the paper clearly stated the design and described the participants, the inclusion and exclusion criteria, the testing, and the statistics, it leaves some annoying but important loose ends. The authors did not clearly report the sampling time frame or the adherence/response rate, they did not report how they dealt with non-response or missing data, the data were not fully reported or made availability, and the researchers did not preregister their protocol. The paper is solid enough to read with interest, just not the sort of thing that makes me jump for joy.
Remember: Don’t make any major changes to your training habits based on the findings of one study, especially if the study is small and/or provides a low quality of evidenceA low quality of evidence means that, in general, studies in this field have several limitations. This could be due to inconsistency in effects between studies, a large range of effect sizes between studies, and/or a high risk of bias (caused by inappropriate controls, a small number of studies, small numbers of participants, poor/absent randomization processes, missing data, inappropriate methods/statistics). When the quality of evidence is low, there is more doubt and less confidence in the overall effect of an intervention, and future studies could easily change overall conclusions. The best way to improve the quality of evidence is for scientists to conduct large, well-controlled, high-quality randomized controlled trials.. Do other trials on this topic confirm the findings of this study? If there is a meta-analysis, what is the effect sizeA standardised measure of the magnitude of an effect of an intervention. Unlike p-values, effect sizes show the size of the effect and how meaningful it might be. Common effect size measures include standardised mean difference (SMD), Cohen’s d, Hedges’ g, eta-squared, and correlation coefficients. and quality of evidenceCertainty of evidence tells us how confident we are that the results reflect the true effect. It’s based on factors like study design, risk of bias, consistency, directness, and precision. Low certainty means more doubt and less confidence, and that future studies could easily change the conclusions. High certainty means that the current evidence is so strong and consistent that future studies are unlikely to change conclusions.? Visit veohtu.com/trainingload and veohtu.com/trimp for a deep dive on this topic.
What type of study is this?
◦ This study is a validation studyA validation study checks if something does what it’s supposed to do. For example, it might test whether a new medical test actually finds the disease it claims to detect, or whether a new exercise test really measures performance, not just general exercise metrics. So, a validation study is a way to make sure the tool, method, or system is accurate, reliable, and worth trusting. using a cross-sectional studyA cross-sectional study is a type of observational study where the exposure and outcome are measured at a single point in time, giving a snapshot of a population—what’s happening right now. Cross-sectional studies are used in health surveys, prevalence studies, or for hypothesis generation, and can show prevalence (how common something is) and associations (but not cause and effect). E.g., What percentage of runners currently report using recovery supplements, and is use associated with age or training volume? design.
What was the hypothesis or research question?
◦ The authors aimed to test how accurately the Samsung Galaxy Watch 6 and Galaxy Watch 7 estimated energy expenditure during intermittent moderate-intensity running, using indirect calorimetryIndirect calorimetry is a way to measure how much energy your body is using by analyzing your breathing. It tracks how much oxygen you breathe in and how much carbon dioxide you breathe out. From that, it estimates how many calories you burn and how much of that energy comes from fat or carbohydrate. from the Cosmed K5 as the reference method.
How did the researchers test the hypothesis or answer the research question?
◦ The researchers recruited 148 healthy, physically active adults: 80 men and 68 women, with an average age of about 30. Each participant completed a 27-minute treadmill protocol: 2 minutes of walking to warm up, then 7 repeated cycles of 2 minutes of running and 1 minute of walking, followed by 4 minutes of walking recovery. The walking speed was 5 kilometers per hour, while the running speed ranged from 8 to 16 kilometers per hour based on participant preference. The authors compared calorie estimates from the Galaxy Watch 6 or 7 against indirect calorimetry measured breath by breath with the K5 portable metabolic system. The main outcome was energy expenditure in kilocalories.
◦ The authors also reported a post-study power calculationA power calculation is a way to figure out how many people or data points you need in a study so you can reliably spot a real effect if it exists. It balances four things: the size of the effect you care about, how much random variation there is, how strict you are about false alarms, and how likely you want to be to detect the effect. In plain terms: it helps you avoid running a study that’s too small to be useful or so big that it wastes time and money. and said the achieved statistical powerStatistical power is the probability that a statistical test will correctly detect a real effect if there is one: a true positive. (In jargon: power is the probability that a statistical test correctly rejects a false null hypothesis). Higher statistical power reduces the risk of a false negative (failing to detect a true effect; or a Type II error). Power is typically influenced by sample size, effect size, significance level, and variability in the data, with a common target being at least 80% (or 0.8). was 1.0 for all comparisons (1 is the highest possible value). They used paired T-testsA statistical test used to compare the means of two groups to determine whether they are statistically different from each other. Types include independent samples t-test, paired samples t-test, and one-sample t-test., Bland-Altman analysesA method for assessing agreement between two quantitative measurements by plotting the differences against the averages of the two measures. It identifies bias (mean difference) and limits of agreement (range where most differences fall)., and the intraclass correlation coefficient (ICC)The ICC is a measure of the reliability or agreement between multiple measurements or raters assessing the same target. It reflects both the degree of correlation and the agreement between measurements, with values ranging from 0 (no agreement) to 1 (perfect agreement).. to judge agreement.
What did the study find?
◦ For the full sample, the K5 measured a mean energy expenditure of 213.60 kilocalories, while the watches gave similar averages overall. The Galaxy Watch 6 averaged 219.53 kilocalories in its subgroup, and the Galaxy Watch 7 averaged 202.67 kilocalories in its subgroup. The paired comparisons did not show statistically significant differences between the watches and the K5. In plain English, the watches were not wildly off on average.
◦ Agreement between the watches and the K5 values was moderate rather than excellent. The correlations ranged from 0.63 to 0.70, and the ICC values ranged from 0.65 to 0.74. The mean absolute percentage error ranged from 10.1% to 12.6%, which is not terrible, but it is also not tiny. The Bland-Altman analysis showed fairly wide 95% limits of agreementIn Bland-Altman analysis, the range within which 95% of the differences between two measurement methods are expected to fall. They are calculated as the mean difference (bias) ± 1.96 times the standard deviation of the differences. Narrow limits suggest good agreement; wide limits suggest poor agreement., with combined values spanning about –61.93 to +65.80 kilocalories, so individual estimates could still drift quite a bit from the criterion value. This far-from-excellent agreement between watch-based estimates and actual values is important.
◦ The authors concluded that both Galaxy Watch models showed moderate validityValidity means you're measuring what you think you're measuring. If a test claims to measure exercise performance, validity asks: Does it really? So, validity is about accuracy, and a valid tool hits the target. Without validity, results might look good but lead you in the wrong direction. for this specific running task and could be suitable as practical, lower-cost tools for everyday activity tracking.
What were the strengths of the study?
◦ The big strength is the sample size (N)N is how many participants or observations are analyzed. A bigger N usually means more precise estimates and more power (ability to detect a true effect). A smaller N results in a study that is less likely to detect a true effect (false negative/type II error) and is more likely to report false positives (type I error). Of course, a badly designed study is still bad even if it has a big N. of 148, which is larger than many earlier smartwatch validation studies the authors cited. The methods were also pretty standardized: the treadmill protocol was controlled, the criterion method was a portable metabolic cart, and the paper reported several useful agreement metrics instead of leaning on just 1 shiny statistic. That makes the paper more trustworthy than the average “this gadget is amazing” write-up.
What were the limitations of the study?
◦ The limitations matter. The study only tested healthy, lean, physically active adults, so the generalisabilityGeneralisability is about how far you can confidently stretch a study’s findings beyond the specific people, place, and conditions that were tested. In simple terms, it asks: “If this result is true here, how likely is it to also be true in other groups or real-world settings?” It’s closely associated with external validity. to other groups is limited. The protocol only covered 1 kind of exercise, so we cannot assume the same accuracy for outdoor running, intervals with sharper pace changes, resistance training, or people with different body compositions. The authors also used 2 K5 units without reporting between-device validation, which opens the door to a bit of measurement biasThe way we measure or ask the question is flawed, so people get classified wrongly. Bad tool, bad data.. And, although the paper included an “after study” power calculationA power calculation is a way to figure out how many people or data points you need in a study so you can reliably spot a real effect if it exists. It balances four things: the size of the effect you care about, how much random variation there is, how strict you are about false alarms, and how likely you want to be to detect the effect. In plain terms: it helps you avoid running a study that’s too small to be useful or so big that it wastes time and money., it did not pre-register the protocolPreregistration is when a detailed description of a study plan is deposited in an open-access repository before collecting the study data. It promotes transparency and accountability, and boosts research integrity. Without preregistration, it is easier for scientists to change outcomes after seeing the data, selectively report “exciting” results, or run many analyses and only show the ones that work, which can introduce bias and weaken the trustworthiness of the findings., provide the full data, or report details about the “response-rate” (e.g., how many people were approached, excluded, declined, or dropped out before the final sample of 148 was obtained).
Who funded the study, and were there any conflicts of interestA conflict of interest happens when a person or group has a personal, financial, or professional interest that could influence their judgment. It does not always mean they did something wrong. But it can create bias or make others question whether the decision or result is fully fair and trustworthy.?
◦ The study was funded by Samsung. The acknowledgements say the project was conducted by the Sidia Institute in partnership with Samsung. All authors were employees of the Sidia Institute of Science and Technology. The paper says Samsung was not involved in data or sample collection, analysis, or interpretation, and placed no restrictions on reporting. Even so, this is still industry-funded research on a Samsung device, so that conflict needs to stay in view.
Thanks for reading! If you enjoy these nerd alerts, please help me out!
Reviews and follows train the magical algorithms to promote my content higher up the rankings so that more folks see high-quality information.
Other running science and sports nutrition articles I've recently reviewed:
Running economy in super shoes: what matters most?
This small randomised crossover trial in 22 trained runners found no clear winner among 3 advanced shoe models for running economy. But within each runner, a shorter contact time was associated with a small improvement, about 1% per 4 milliseconds. The small sample and the single-visit treadmill design lower confidence a bit. Get the full details →
Combined heat and hypoxia hurts performance
This meta-analysis pooled 23 studies in 414 healthy adults and found moderate to large short-term performance impairments in heat, hypoxia, and especially their combination. However, small study sizes, variable protocols, limited post-exposure data, and no formal certainty assessment lower confidence in the findings. Get the full details →
Caffeine mouth rinse: small endurance boost?
This meta-analysis of 31 studies found a trivial to small exercise benefit from caffeine mouth rinse, strongest for aerobic endurance. Cognitive effects were inconsistent. Heterogeneity, sparse female data, and low certainty for cognition reduce confidence. Get the full details →
And, to help you wash down the evidence, here's a snifter from my recent indulgence:
Hot Cakes #7 (from Pulfer Brewery)
Smoothie pastry sour. 5.5% ABV. Take a sip of the review →
Access to education is a right, not a privilege
Equality in education, health, and sustainability matters deeply to me. I was fortunate to be born into a social welfare system in which higher education was free. Sadly, that's no longer true. That's why I created Veohtu: to make high-quality exercise science and sports nutrition education freely available to folks from all walks of life. All content is free and always will be. This nerd alert newsletter is part of that offering. Check out more free educational resources at veohtu.com.
Every day is a school day.
Empower yourself to train smart.
Be informed. Stay educated. Think critically.