FAQ's on Method Validation

James O. Westgard, PhD

Please Note: The Basic Method Validation manual is now in its FOURTH edition. The questions presented in this article are drawn from the first edition. Many of the answers here still apply, but some things, particularly on the regulatory side, have changed in the past 20 years.

Why is it necessary to validate method performance when the manufacturer already has?
What analytical performance is needed for a laboratory test?
Who should perform the validation studies in a laboratory?
In setting up a new method for validation studies, how important is it to calibrate the method using primary standards instead of commerical calibrators?
What performance characteristics are usually validated?
What experiments are usually performed?
Does linearity have to be validated?
Does detection limit have to be validated for all tests?
How many materials need to be analyzed in a replication experiment?
What comparison method should be used in the comparison of methods experiment?
Why is there so much emphasis placed on the comparison of methods experiment?
Why can't the correlation coefficient be used to judge the agreement between methods in a comparison of methods study?
Why are regression statistics still recommended?
What's the proper way to use t-test statistics?
What's the proper way to use regression statistics?
What's Deming regression?
What's the alternative to more complicated regression calculations?
What tests are likely to have a narrow range of data and require more care?
Why can't acceptability be judged by tests of significance, such as t-test and F-test?
How does the "method decision chart" approach compare with the "performance criteria" approach?
Where can I find more detailed protocols and statistical guidelines for method validation?

Why is it necessary to validate method performance when the manufacturer has already performed extensive studies?

It's important to demonstrate that the method performs well under the operating conditions of your laboratory and that it provides reliable test results for your patients. There are many factors that can affect method performance, such as different lots of calibrators and reagents, changes in supplies and suppliers of instrument components, changes in manufacturing from the production of prototypes to final field instruments, effects of shipment and storage, as well as local climate control conditions, quality of water, stability of electric power, and, of course, the skills of the analysts. In US laboratories, method validation studies are actually required by the CLIA regulations.

What analytical performance is needed for a laboratory test?

In the US, CLIA defines minimum standards of analytical quality in the form of the criteria for acceptability in proficiency testing surveys. These criteria define the allowable total error around a target value (TV). For example, acceptable test results for cholesterol are described as TV plus/minus 10%, which means that test results should be within 10% of the correct value. Most countries have proficiency testing or external quality assessment schemes that define standards for analytical quality in a similar manner. Note that use of an allowable total error does not provide a specification for an individual characteristic, such as imprecision, inaccuracy, interference, recovery, etc., but provides a requirement of the total amount of error when all sources are combined.

Who should perform the validation studies in a laboratory?

One possibility is that the manufacturer's technical personnel will perform validation studies when installing a new system in your laboratory. This seems to be a growing trend in the US, probably due both to tight laboratory staffing and also the strategy of purchasing whole systems and holding the manufacturer accountable for all problems. If the manufacturer performs the studies, it's important that you review the experimental design, monitor the data collection, and perform your own statistical analysis and interpretation of the data.

In many other cases, however, the studies will need to be organized and carried out by the laboratory itself. It is advisable to have one analyst organize the studies, monitor the data collection, review the data, perform the statistical analysis of the data, and be responsible for the interpretation and conclusions. Other analysts can participate as operators and perform the tests needed in the different validation experiments.

In setting up a new method for validation studies, how important is it to calibrate the method using primary standards instead of commercial calibrators?

The method should be operated in the way intended under routine service conditions. If routine service operation will make use of commercial calibrators, then those calibrators must be part of the testing process that is validated. It is generally advisable to analyze both commercial calibrators and primary standards together, when possible, to see if they agree. Any disagreement should be resolved prior to performing the recovery, interference, and comparison of methods experiments.

What performance characteristics are usually validated?

These almost always include the reportable range, precision (or imprecision), accuracy (or inaccuracy, bias), and the reference interval. Sometimes the studies include detection limit (or sensitivity), interference, and recovery. In US laboratories, the CLIA regulations define which characteristics need to be validated for methods with difference classifications of complexity. Fewer studies are required with less complex methods. More extensive testing is necessary for methods developed by the laboratory or modified by the laboratory.

What experiments are usually performed?

Reportable range is validated by a linearity experiment, imprecision (or random error) determined from a replication study, and inaccuracy (or systematic error) assessed from a comparison of methods experiment, as well experiments for interference (constant systematic error) and recovery (proportional systematic error). Sensitivity is determined by a detection limit experiment. Reference intervals can be verified by testing samples from healthy people.

Does linearity have to be validated?

It's actually the reportable range that must be validated. The objective in determining reportable range is to define the highest value that can be reported without diluting the sample. This is usually done by performing a linearity type of experiment, but there is no strict requirement that the method response has to be linear. However, the readout from instrument systems often is linear in the units that are reported.

Does detection limit have to be validated for all tests?

No, for most tests it is sufficient to validate the reportable range using a linearity type of experiment. A more exact estimate of analytical performance around zero is needed only when there is special significance attached to low values for the test. Drug tests are an obvious example. Tumor markers are another example.

How many materials need to be analyzed in a replication experiment?

Good planning would be to analyze the number of materials that will be used in routine quality control for that test. In US laboratories, CLIA places certain requirements on the number of materials to be used for different tests - e.g., a minimum of 2 levels or materials. Laboratory practices commonly include 3 materials for certain tests, such as blood gases and hematology. When possible, select control materials that can be continued for QC once the test is implemented in your laboratory.

What comparison method should be used in the comparison of methods experiment?

Ideally, the comparison method should be a method that is free of systematic errors, i.e., a method whose accuracy or bias is minimal. In practice, most studies involve the routine service method that is to be replaced by the new method. In such studies, the objective is really to assess whether there will be any systematic changes in test values between the "old" method and the "new" method. If such systematic changes are uncovered, then it is important to document which method has the problem. Interference and recovery experiments are often helpful for pinpointing the problem and the method at fault.

Why is there so much emphasis placed on the comparison of methods experiment?

Probably because this experiment uses real patient samples and reveals the kind of errors that will be encountered when the tests are used for patient care, which is particularly important when a laboratory changes methods. It also reveals different kinds of errors - proportional systematic, constant systematic, random error between methods - therefore providing a lot of quantitative information about method performance. Some of the other experiments seem to test conditions that may not be observed very often - e.g., interference, recovery, and detection limit.

Why can't the correlation coefficient be used to judge the agreement between methods in a comparison of methods study?

Perfect correlation, i.e., a correlation coefficient of 1.000, means that the values by the test method increase directly in proportion to the values by the comparison method increase. However, a value of 1.000 doesn't mean that the test method values are identical to those of the comparison method. Systematic differences can be present, e.g., the test method could be running 100 units higher than the comparison method, or the test method could be providing results that are only half of the values by the comparison method, yet the correlation coefficient could still give a value near 1.000. Because the comparison of methods experiment is performed to validate the accuracy of a method, the statistical analysis must provide estimates of systematic errors, not just the correlation or results.

The best use of the correlation coefficient is to help decide whether ordinary linear regression will provide reliable estimates of slope and intercept. If r=0.99 or greater, it is generally accepted that ordinary linear regression calculations are adequate for estimating the errors between the methods.

Why are regression statistics still recommended, given recent publications that emphasize the use of a difference plot as the primary way to present the data from the comparison of methods experiment?

Remember that the purpose of the comparison of methods experiment is to estimate systematic errors, which may be constant or proportional in nature. Regression statistics can provide estimates of these components of systematic error by the y-intercept and slope, as well as estimation of the overall systematic error or bias at any decision level concentration of interest by calculation from the regression equation. The difference plot, on the other hand, emphasizes the random errors between the methods. You actually need to calculate the average difference or bias from paired t-test statistics to get a good estimate of the systematic error, thus the difference plot by itself (without statistical calculations) does not provide sufficient information about the systematic error of the method. Regression statistics are preferred over t-test statistics in order to calculate the systematic error at any decision level, as well as getting estimates of the proportional and constant components of systematic error.

What's the proper way to use t-test statistics?

There are two cases where t-test statistics will provide reliable estimates of systematic errors.

Case 1: proportional error is absent, therefore the estimate of systematic error or bias is applicable throughout the range of the data.
Case 2: the estimate of systematic error or bias is interpreted at a decision level near the mean of the data.

Plot the data on a comparison plot (test value on the y-axis, comparison value on the x-axis) to assess whether proportional error is present or absent. If absent, then plot the data on a difference plot, i.e., the plot the difference of the test minus comparison values on the y-axis versus the comparison values on the x-axis.

When using t-test statistics, present the following:

bias
standard deviation of the differences
mean of the data,
t-value, and the
difference plot.

What's the proper way to use regression statistics?

Plot the test value on the y-axis versus the comparison value on the x-axis, then inspect the data for:

nonlinearity
outliers
wide range of data

Calculate the correlation coefficient as a measure of the range of data, however, you should first inspect a graph to be sure the data is spread fairly uniformly over the range so the r value is not being influenced by a few high or low points. If r=0.99 or greater, the range of data is wide enough to provide reliable estimates of the slope and y-intercept using ordinary linear regression analysis. If r<0.95, it is generally advised to use an alternate statistical technique, such as t-test statistics, to estimate the overall systematic error; or use an alternate regression technique, such as Deming regression, to calculate the slope and y-intercept.

Calculate the slope, y-intercept, and standard deviation of points about the regression line. Interpret the deviation of the slope from an ideal value of 1.000 as proportional error, the deviation of the y-intercept from an ideal value of 0.00 as an estimate of constant systematic error, and the value of the standard deviation of the points about the regression line as a measure of the random error between the methods.

Calculate the systematic error at medically important decision concentrations (Xc) using the regression equation. SE = Yc - Xc = (a + bXc) - Xc.
Present the following:

slope,
y-intercept,
standard deviation of points about the regression line,
standard deviation of the slope (when available),
standard deviation of the y-intercept (when available),
correlation coefficient, and the
comparison plot.

What's Deming regression?

This refers to an alternate way of calculating regression statistics when the range of data isn't as wide as desired for ordinary linear regression (i.e., the correlation coefficient doesn't satisfy the criterion of being 0.99 or greater). An assumption in ordinary linear regression is that the x-values are well known and any difference between x and y-values is assignable to error in the y-value. In Deming regression, the errors between methods are assigned to both methods in proportion to the variances of the methods. The calculations are not commonly available in standard statistical programs, however, special computer programs for laboratory method validation will often include Deming regression.

For a detailed discussion of Deming regression and the calculations, see Cornbleet PJ, Gochman N. Incorrect least-squares regression coefficients in method-comparison analysis. Clin Chem 1979;25:432-438.

What's the alternative to more complicated regression calculations (such as Deming regression)?

You can collect the data very carefully to permit the application of other statistical calculations. Consider strategies to:

Expand the analytical range of the test results so ordinary linear regression statistics will be valid.
Reduce the variation of the comparison method by performing duplicate measurements, i.e., reduce the error in the x-value to better satisfy the assumption in ordinary linear regression.
Collect the data around the medically important decision concentrations, then analyze subsets of data using t-test statistics.
Interpret the data only at the mean of the data set to minimize the effect of the regression technique on the estimate of systematic error.

What tests are likely to have a narrow range of data and require more care and attention to data collection and statistical calculations?

Tests that may have a narrow analytical range include analytes such as calcium, chloride, and sodium, where the body itself attempts to maintain a narrow range of concentrations. Other tests, such as creatinine, may have a narrow concentration range in a healthy population and therefore need to be evaluated using a patient population from a hospital. Therapeutic drug levels, of course, will depend on obtaining patient specimens for varying doses and varying times following the doses. As a general strategy, make use specimens from a hospital population to obtain a wide range of concentrations.

Why can't acceptability be judged by tests of significance, such as t-test and F-test?

Tests of significance are useful mainly to assess whether there are sufficient data to support a conclusion that a difference or error exists (statistical significance), not whether that difference or error is large enough to invalidate the usefulness of a test (clinical significance). It is best to judge the acceptability of method performance by comparison of the observed errors to the total error that is allowable (such as defined in the CLIA criteria for acceptability of proficiency testing performance).

How does the "method decision chart" approach compare with the "performance criteria" used in the past to judge the acceptability of a method?

The method decision chart provides a graphical way of comparing the observed errors with standards of performance, whereas the earlier performance criteria provided a mathematical comparison. Therefore, the method decision chart is easier to use. In addition, the method decision chart permits simultaneous assessment against the different definitions of allowable total error, such as bias + 2s, bias +3s, and bias + 4s, which have evolved since the original description of "performance criteria."

For the original discussion of "performance criteria", see Westgard JO, Carey RN, Wold S. Criteria for judging the precision and accuracy in method development and evaluation. Clin Chem 1974;20:825-833.

Where can I find more detailed protocols and statistical guidelines for method validation experiments?

The National Committee for Clinical Laboratory Standards (NCCLS, 90 West Valley Road, Suite 1400, Wayne, PA 19087-1898, phone 610-688-0100) provides a series of documents that provide extensive information about individual experiments:

EP5-A. Evaluation of precision performance of clinical chemistry devices.
EP6-P. Evaluation of the linearity of quantitative analytical methods.
EP7-P. Interference testing in clinical chemistry.
EP9-A. Method comparison and bias estimation using patient samples.
EP10-A. Preliminary evaluation of quantitative clinical laboratory methods.
EP14-P. Evaluation of matrix effects.
EP15-P. User demonstration of performance for precision and accuracy.
C28-A. How to define and determine reference intervals in the clinical laboratory.

Tools, Technologies and Training for Healthcare Laboratories

Questions