Any physical system that measures something has to deal with uncertainty. When collecting data, we need to quantify that uncertainty to understand how reliable our measurements are and how much we can trust what we feed our models. Just as importantly, it helps us uncover the limits and tradeoffs in our data collection process.
In this article, we walk through how to test the precision of a sensor, which intuitions are actually useful, which ones tend to mislead, and how to approach uncertainty in a practical, grounded way.
Why Does Sensor Precision Matter?
If robots are going to operate in the physical world, they need to move with precision. Many tasks involve manipulating small objects, where even slight errors matter. Take inserting a screw: shift just a few centimeters in either direction, and you miss the hole entirely.
That level of precision depends on perception. Sensors, often paired with vision, give robots their understanding of the world. When those sensors drift, that understanding degrades. The data fed into a model stops reflecting reality, and performance suffers.
Sensor precision, then, is not a detail, it is foundational. The more accurate the sensors, the closer the robot’s internal representation is to the real world, and the more reliably it can act within it.
1 - Defining what you want to measure
Before designing a test, you need to be clear about what you are trying to learn from it. Measuring “sensor precision” is not a single question. You might care about short term repeatability, long term drift, sensitivity to environmental changes, or overall accuracy against a known reference. Each of these leads to a different kind of test.
Just as importantly, you need a way to define error. What does it mean for your sensor to perform well or poorly? This could be the spread of repeated measurements, the difference from a known value, or the maximum deviation you observe under certain conditions. The metric you choose will shape both how you design the test and how you interpret the results.
Being explicit about your objective and your error metric upfront prevents you from collecting data that looks useful but doesn’t actually answer your question. It also makes the rest of the process more focused, since every decision in the test design can be tied back to what you are trying to measure.
The foundational framework for this kind of structured thinking about measurement error is provided by the Joint Committee for Guides in Metrology (JCGM) in the Guide to the Expression of Uncertainty in Measurement (GUM) [1], which defines measurement uncertainty as a non-negative parameter characterizing the dispersion of the quantity values attributed to a measurand [2].
These documents provide the standardized vocabulary and methodology used throughout metrology and sensor validation [3].
2 - Choosing an Error Metric
Once you know what you want to measure, you need to decide how you will quantify error. Different metrics capture different aspects of performance, and choosing the wrong one can lead to misleading conclusions.
Here are some of the most common error metrics and when they are useful:
Mean Absolute Error (MAE)
This measures the average absolute difference between your measurements and the true value.
It is simple, easy to interpret, and works well when you care about overall accuracy without giving too much weight to large errors. Use it when all deviations matter roughly equally. The MAE is a linear score, meaning all individual differences are weighted equally in the average.
Mean Squared Error (MSE)
This measures the average of the squared differences between your measurements and the true value.
Because errors are squared, larger deviations have a much bigger impact. This makes it useful when large errors are particularly undesirable, but it can also make the metric sensitive to outliers.
Root Mean Squared Error (RMSE)
This is the square root of MSE, which brings the error back to the same units as the original measurements.
It retains the sensitivity to large errors while being easier to interpret than MSE. It is commonly used when you want a balance between interpretability and penalizing large deviations. The RMSE penalizes variance by giving errors with larger absolute values more weight than errors with smaller absolute values [4]. When the error distribution is expected to be Gaussian and enough samples are available, the RMSE has an advantage over the MAE in capturing error distribution characteristics [5].
Standard Deviation
This measures how spread out your measurements are around their mean.
It is useful when you care about precision or repeatability rather than accuracy. A low standard deviation means your sensor is consistent, even if it may not be correct.
Maximum Error
This captures the largest deviation observed in your measurements.
It is useful in systems where worst-case performance matters more than average behavior.
How to choose
No single metric is universally correct. The right choice depends on what you care about:
Use standard deviation when evaluating precision
Use MAE or RMSE when comparing against a known reference
Use maximum error when worst-case behavior matters
The important part is to decide this upfront. Your choice of metric will influence how you design your test and how you interpret the results.

3 - Designing your test
Before you begin testing, you need to translate your objective into a concrete procedure. This means deciding what will be measured, under what conditions, and how the data will be collected. This step sets the foundation for everything that follows, and small decisions here can have a large impact on the conclusions you draw later. A poorly designed test often leads to results that look precise but fail to capture the true behavior of the system.
The resources available to you (sensors, number of machines, environments, etc.) and the time frame you’re working with will shape what is feasible. Being explicit about these constraints helps you avoid overcomplicating the setup or oversimplifying it in ways that hide important sources of uncertainty.
A well-designed test shares a few key characteristics:
Reproducible
A test being reproducible means that it is specific enough to be repeatable. Say someone on the other side of the world has the same setup as you, they should be able to run the same test and get similar results. A reproducible test is specific, with clear procedures, defined conditions, and minimal ambiguity in how measurements are taken. It removes as much guesswork as possible so that differences in results come from the system itself, not from how the test was carried out.
Statistically meaningful
Your test design should include repeated runs as part of the original experiment. A single run is rarely enough to draw meaningful conclusions, and relying on it can lead to misleading results.
Think of it this way: if you want to understand how accurate a scale is, weighing an object once doesn’t tell you much. That one measurement could be slightly off for any number of reasons. But if you weigh the same object multiple times, patterns start to appear, and you get a clearer picture of how the scale behaves.
To avoid misleading conclusions, you need multiple measurements. As a rule of thumb, aiming for around 20 observations is a good starting point, since this is roughly where patterns begin to emerge and basic statistical analysis becomes more reliable. This is also approximately the sample size at which formal normality tests such as the Shapiro Wilk test begin to yield reliable results [6].
Controlled
Your test should isolate the behavior you want to measure and reduce variation from everything else. If too many factors change at once, you won’t know what is driving the results.
Think of it this way: if you want to understand how accurate a scale is, you shouldn’t change what you’re weighing every time. Weighing different objects introduces variation that has nothing to do with the scale itself. Instead, you weigh the same object under the same conditions so that any differences in the measurements come from the scale, not from external factors.
A good test captures the signal you care about, not noise from the environment.
Grounded in a reference
When possible, you should test your data against something you trust or fully know. Without a reference point, it becomes difficult to tell not only whether anything has drifted, but also which measurements are closer to the true value. You may still observe patterns or changes over time, but you have no way to judge correctness, only consistency.
Going back to the scale, now imagine you have no idea how much the object you’re weighing actually weighs. If the readings change, you can’t tell whether the scale is improving, getting worse, or simply fluctuating. You can compare measurements to each other, but not to the truth.
Having a known reference, or ground truth, expands what your test can tell you. It allows you to move from measuring consistency to evaluating accuracy, and to understand whether your system is getting closer to or further from the true value. This distinction between precision and accuracy is central to the GUM framework [1].
Well Documented
Every step of the testing process should be documented. Parameters need to be clearly defined, along with their meaning and how they are used. The setup itself should be thoroughly described so that it can be understood and replicated without ambiguity.
Any changes made during the test should also be recorded. Even small deviations can affect the results, and without documentation it becomes difficult to trace their impact.
Documentation should be clear, structured, and easy to follow. This can include diagrams, photos, or sketches, anything that helps make the test setup and procedure understandable to someone who was not involved in running it.
4 - Running the test
Running the test means executing the procedure you defined and collecting data in a consistent and disciplined way. At this stage, the focus is not on changing the setup, but on following it as closely as possible.
Measurements should be taken under the conditions you specified, using the same process each time. Small deviations in execution can introduce variability that is not part of the system you are trying to study. This is why consistency during data collection matters as much as the design itself.
As you run the test, record all relevant information alongside your measurements. This includes timestamps, environmental conditions, parameter values, and any unexpected events. Even if something seems minor, it can become important later when interpreting results.
Once the data has been collected, you can move to your software of choice to begin analysis. At this point, you should already know which metrics you will compute and what questions you are trying to answer.
5 - Distribution fit tests
Before applying statistical tests or summarizing your data, it is useful to understand how your measurements are distributed. Many statistical methods assume a particular distribution, most commonly a normal distribution.
You can check this by fitting your data to a distribution and evaluating how well it matches. This can be done visually, using histograms or Q-Q plots, or through formal tests. The Shapiro Wilk test [6] is one of the most widely recommended tools for this purpose, particularly for small to moderately sized samples, as it tends to have better statistical power than alternatives such as the Kolmogorov Smirnov test [7].
If your errors follow a normal distribution, you can rely on standard tools such as confidence intervals, standard deviation, and many common hypothesis tests. If they do not, these methods may no longer be appropriate, and you may need to use non-parametric approaches or transform your data.
Understanding the distribution of your data helps you choose the right statistical tools and avoid drawing incorrect conclusions from invalid assumptions.
6 - Build what you need
Once your data is collected and you understand its distribution, the next step is to compute the quantities that answer your original question. This usually means building statistical summaries that describe the behavior of your measurements in a meaningful way.
In many measurement problems, the goal is to estimate an upper bound on error. This is useful when you need guarantees about worst-case behavior or want to ensure that your system stays within certain limits. In other cases, you may be more interested in confidence intervals, which provide a range of plausible values for the true parameter based on your data [8].
You may also choose to perform a hypothesis test, especially if you are trying to determine whether a change in the system has had a significant effect or whether two sets of measurements differ in a meaningful way [9].
A simple way to guide this choice is:
Use bounds when you need guarantees.
Use confidence intervals when you want to estimate a true value.
Use hypothesis tests when comparing scenarios or validating assumptions.
These methods all build on the error metric you defined earlier, so the interpretation of your results will depend directly on how you chose to measure error.
Finally, keep in mind that these tools are only as reliable as the data and assumptions behind them. Small sample sizes, noisy measurements, or incorrect assumptions about your data distribution can lead to overconfident or misleading conclusions [8].
7 - Interpreting the results
The final step is to interpret the results of the test. A good interpretation should connect back to the original question, the methodology used, and the level of uncertainty you defined at the start. The goal is not just to report numbers, but to explain what those numbers mean in the context of your system.
Results should always be reported in a clear and consistent way. This includes stating the metric used, the number of observations, and the conditions under which the data was collected. Whenever possible, report results alongside measures of uncertainty, such as confidence intervals, standard deviation, or bounds, rather than relying on single values [1].
It is also important to distinguish between different types of conclusions. Statistical significance does not necessarily imply practical relevance. A result can be statistically valid but too small to matter in practice, or large enough to matter but not supported by enough data [8].
Interpretation should also reflect the limitations of the test. Any assumptions made during analysis, such as the choice of distribution or the stability of the environment, should be made explicit. If those assumptions do not hold, the conclusions may not be reliable.
Finally, be careful not to over-interpret the data. Small sample sizes, noisy measurements, or poorly controlled conditions can produce results that appear more certain than they actually are. Your conclusions should match the strength of your evidence, and it is often better to state uncertainty clearly than to make overly strong claims.
A good result is not just one that looks precise, but one that is honest about what can and cannot be concluded from the data.
Nurvai final remarks on this
Uncertainty is part of every measurement process. The goal is not to eliminate it, but to quantify it in a way that is useful and reliable [1].
By defining clear objectives, designing controlled and reproducible tests, and choosing appropriate metrics, you can turn raw data into meaningful insight. The quality of your conclusions will always depend on the quality of your test.
Careful design and honest interpretation matter more than complex methods. If those are in place, the results will follow.
If you need help validating data or improving measurement reliability, let’s talk and book your free consultation: Free Consultation with Nurvai
Connect with us on socials
References
[1] Joint Committee for Guides in Metrology, “Evaluation of measurement data — Guide to the expression of uncertainty in measurement,” BIPM, JCGM 100:2008, 2008. Available: https://www.bipm.org/documents/20126/2071204/JCGM_100_2008_E.pdf
[2] Joint Committee for Guides in Metrology, “International vocabulary of metrology — basic and general concepts and associated terms (VIM),” BIPM, JCGM 200:2012, 2012. Available: https://www.bipm.org/documents/20126/2071204/JCGM_200_2012.pdf
[3] M. J. Stanger et al., “Measurement uncertainty in clinical validation studies of sensors,” Sensors, vol. 23, no. 5, p. 2900, 2023, doi: 10.3390/s23052900.
[4] T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? — Arguments against avoiding RMSE in the literature,” Geoscientific Model Development, vol. 7, no. 3, pp. 1247–1250, 2014, doi: 10.5194/gmd-7-1247-2014.
[5] T. O. Hodson, “Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not,” Geoscientific Model Development, vol. 15, no. 14, pp. 5481–5487, 2022, doi: 10.5194/gmd-15-5481-2022.
[6] S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3/4, pp. 591–611, 1965, doi: 10.1093/biomet/52.3-4.591.
[7] A. Ghasemi and S. Zahediasl, “Normality tests for statistical analysis: A guide for non-statisticians,” International Journal of Endocrinology and Metabolism, vol. 10, no. 2, pp. 486–489, 2012, doi: 10.5812/ijem.3505.
[8] S. Greenland et al., “Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations,” European Journal of Epidemiology, vol. 31, no. 4, pp. 337–350, 2016, doi: 10.1007/s10654-016-0149-3.
[9] W. M. K. Trochim, “Statistical inference (Part 3): Statistical hypothesis testing and confidence interval estimation,” Research Methods Knowledge Base, 2000, Available: https://conjointly.com/kb/statistical-inference/

