EQA

Understanding Z-Scores in EQA: How to Read Your PT Report

CalibDue blog hero — Understanding z-scores in EQA

Every EQA report arrives the same way: a grid of analytes, a column of z-scores, and a verdict. Most labs read the verdict, file the report, and move on — which means most labs are using about 10% of the information they paid for.

A z-score is not just a pass/fail flag. Read across rounds, z-scores are the earliest warning system your lab has for calibration drift, reagent lot problems, and method bias — often months before anything actually fails. This post covers what the number means, the thresholds, the patterns worth acting on, and the traps in interpreting them.

What a z-score actually is

Your z-score answers one question: how far is your result from the consensus, measured in units of scatter?

The provider takes all participating labs’ results for an analyte (usually within your peer group — same method, same instrument family), computes the mean and standard deviation, and places your result on that distribution:

z = (your result − peer group mean) / peer group SD

A z of 0 means you matched the consensus exactly. A z of +1.0 means you were one standard deviation above it. The sign tells you the direction — positive means you read high, negative means low.

The crucial property: z-scores are unitless and comparable. A glucose z of +2.1 and a TSH z of +2.1 represent the same degree of disagreement with peers, even though the analytes, units, and absolute errors are wildly different. That’s what makes z-scores trendable.

The thresholds

The conventional bands, used by most providers and accreditation schemes:

Z-scoreInterpretationExpected action
|z| ≤ 2.0AcceptableNone required — but trend it
2.0 < |z| < 3.0Warning signalInvestigate; document what you found
|z| ≥ 3.0UnacceptableCorrective action required

The statistics behind the bands: if your method is unbiased and behaving, ~95% of results land within ±2 SD purely by chance, and ~99.7% within ±3 SD. So a single |z| between 2 and 3 might be luck — about one result in twenty will land there even when nothing is wrong. A |z| beyond 3 almost never happens by chance. That’s why 3 is the action line and 2 is the squint line.

This is also why a single warning-zone result is not a crisis, but a pattern of them is. One z of +2.3 in fifty results is statistics. Three consecutive +2.x results on the same analyte is a signal — and it’s a signal you’ll only see if someone is looking across rounds, not at one report at a time.

SDI: the same idea, different label

Some providers report SDI (Standard Deviation Index) instead of z-score. For practical purposes they’re the same calculation — your deviation from the peer mean in SD units — and the same thresholds apply. If your CAP report says SDI and your RCPA report says z-score, you can trend them with the same eyes.

Reading the trend: shift, drift, and scatter

The verdict column tells you about this round. The trajectory tells you about your method. Three patterns matter:

Shift — your z-scores were hovering near 0 for six rounds, then jumped to +1.8 and stayed there. Something changed at a specific point in time: a new reagent lot, a recalibration, a component replacement, a new technologist’s pipetting technique. The fix starts with “what changed in the lab between round 6 and round 7?” — and your calibration and maintenance logs are exactly where you look.

Drift — z-scores climbing steadily: +0.3, +0.7, +1.1, +1.6, +2.1. Nothing “happened”; something is happening. Gradual calibration drift, a deteriorating lamp or electrode, slow reagent degradation. Drift is the pattern that rewards attention most, because you can see it coming three rounds before it crosses a threshold — and fix it before you ever file a PT failure.

Scatter — z-scores bouncing: +1.9, −1.7, +0.2, −2.2. Your average is fine but your precision isn’t. Random error this size points at inconsistent technique, sample handling, an unstable instrument, or true within-lab imprecision that needs a precision study, not a recalibration.

From z-scores to verdicts: how rounds are judged

A modern PT event isn’t one number — it’s typically several samples per round (a CAP chemistry mailing is commonly five), each generating a result per analyte. The per-sample z-scores get collapsed into a per-analyte verdict, and the rules differ by accreditation:

  • CAP and CLIA use an 80% rule: 4 of 5 samples acceptable means the analyte passes the event. One outlier sample doesn’t fail the round.
  • ISO 15189 schemes commonly sit near 75%.
  • Stricter schemes treat any failed sample as a failed event.

Then comes the rule with real teeth: consecutive failures. Under CAP and CLIA, repeated failure on the same analyte across testing events (the classic formulation: 2 of the last 3 events) escalates from “investigate” to a reportable condition that can threaten the authority to test. Two details labs get wrong here:

  1. The count is per testing event, not per sample — three bad samples in one mailing is one failed event, not three.
  2. The clock doesn’t reset just because time passed. Fail round 12 and round 14, and the pattern stands even though round 13 was fine under a 2-of-3 rule.

This is also why review deadlines exist — CAP expects PT results to be reviewed within its defined window (30 days), and a graded report sitting unopened in an inbox is itself a finding. We’ve covered the scheme differences in detail in CAP vs CLIA proficiency testing.

The traps

A few interpretation mistakes that show up over and over:

Wrong peer group. A z-score against all methods is meaningless if your method has a known bias relative to the field. Make sure the comparison group on the report is your method/instrument peer group, and that it’s large enough (small peer groups produce unstable SDs and erratic z-scores that say more about the group than about you).

Chasing single results. Recalibrating because one z hit +2.4 is how you add variance. Investigate, document, and watch the next round — unless the trend says it’s real.

Tiny SDs producing scary z-scores. With a very tight peer group, a clinically trivial absolute difference can produce |z| > 2. The z-score measures disagreement with peers, not clinical impact — both lenses matter.

Treating “acceptable” as “reviewed”. Most schemes — and CAP explicitly — expect the lab to review all PT results, not just failures. The review of a passing report with a worrying trend is where EQA earns its cost.

A reading routine that takes ten minutes

When a graded report lands:

  1. Verdicts first — any unacceptable result starts the corrective action clock immediately.
  2. Warning zone next — anything with 2 < |z| < 3 gets checked against its history: first time, or part of a pattern?
  3. Trends third — scan each analyte’s trajectory for shift, drift, or widening scatter, including the comfortable passers.
  4. Sign and date the review — the review itself is an auditable event with a deadline. A reviewed-and-annotated report is evidence; an opened PDF is not.

Do that every round and EQA stops being a quarterly exam you hope to pass. It becomes what it was designed to be: the only measurement system in the lab that tells you how you compare to everyone else running the same test — for free, three weeks before it matters.


Related reading:

Your next audit starts today.

Calibration, training, EQA, maintenance, and documents — one platform, one readiness score. Join the waitlist and get early access when we launch.

Be the first to know when CalibDue launches.

ISO 15189 · CAP · CLIA · UKAS — Built for accredited labs.