EQA
Understanding Z-Scores in EQA: How to Read Your PT Report
Every EQA report arrives the same way: a grid of analytes, a column of z-scores, and a verdict. Most labs read the verdict, file the report, and move on — which means most labs are using about 10% of the information they paid for.
A z-score is not just a pass/fail flag. Read across rounds, z-scores are the earliest warning system your lab has for calibration drift, reagent lot problems, and method bias — often months before anything actually fails. This post covers what the number means, the thresholds, the patterns worth acting on, and the traps in interpreting them.
What a z-score actually is
Your z-score answers one question: how far is your result from the consensus, measured in units of scatter?
The provider takes all participating labs’ results for an analyte (usually within your peer group — same method, same instrument family), computes the mean and standard deviation, and places your result on that distribution:
z = (your result − peer group mean) / peer group SD
A z of 0 means you matched the consensus exactly. A z of +1.0 means you were one standard deviation above it. The sign tells you the direction — positive means you read high, negative means low.
The crucial property: z-scores are unitless and comparable. A glucose z of +2.1 and a TSH z of +2.1 represent the same degree of disagreement with peers, even though the analytes, units, and absolute errors are wildly different. That’s what makes z-scores trendable.
The thresholds
The conventional bands, used by most providers and accreditation schemes:
| Z-score | Interpretation | Expected action |
|---|---|---|
| |z| ≤ 2.0 | Acceptable | None required — but trend it |
| 2.0 < |z| < 3.0 | Warning signal | Investigate; document what you found |
| |z| ≥ 3.0 | Unacceptable | Corrective action required |
The statistics behind the bands: if your method is unbiased and behaving, ~95% of results land within ±2 SD purely by chance, and ~99.7% within ±3 SD. So a single |z| between 2 and 3 might be luck — about one result in twenty will land there even when nothing is wrong. A |z| beyond 3 almost never happens by chance. That’s why 3 is the action line and 2 is the squint line.
This is also why a single warning-zone result is not a crisis, but a pattern of them is. One z of +2.3 in fifty results is statistics. Three consecutive +2.x results on the same analyte is a signal — and it’s a signal you’ll only see if someone is looking across rounds, not at one report at a time.
SDI: the same idea, different label
Some providers report SDI (Standard Deviation Index) instead of z-score. For practical purposes they’re the same calculation — your deviation from the peer mean in SD units — and the same thresholds apply. If your CAP report says SDI and your RCPA report says z-score, you can trend them with the same eyes.
Reading the trend: shift, drift, and scatter
The verdict column tells you about this round. The trajectory tells you about your method. Three patterns matter:
Shift — your z-scores were hovering near 0 for six rounds, then jumped to +1.8 and stayed there. Something changed at a specific point in time: a new reagent lot, a recalibration, a component replacement, a new technologist’s pipetting technique. The fix starts with “what changed in the lab between round 6 and round 7?” — and your calibration and maintenance logs are exactly where you look.
Drift — z-scores climbing steadily: +0.3, +0.7, +1.1, +1.6, +2.1. Nothing “happened”; something is happening. Gradual calibration drift, a deteriorating lamp or electrode, slow reagent degradation. Drift is the pattern that rewards attention most, because you can see it coming three rounds before it crosses a threshold — and fix it before you ever file a PT failure.
Scatter — z-scores bouncing: +1.9, −1.7, +0.2, −2.2. Your average is fine but your precision isn’t. Random error this size points at inconsistent technique, sample handling, an unstable instrument, or true within-lab imprecision that needs a precision study, not a recalibration.
From z-scores to verdicts: how rounds are judged
A modern PT event isn’t one number — it’s typically several samples per round (a CAP chemistry mailing is commonly five), each generating a result per analyte. The per-sample z-scores get collapsed into a per-analyte verdict, and the rules differ by accreditation:
- CAP and CLIA use an 80% rule: 4 of 5 samples acceptable means the analyte passes the event. One outlier sample doesn’t fail the round.
- ISO 15189 schemes commonly sit near 75%.
- Stricter schemes treat any failed sample as a failed event.
Then comes the rule with real teeth: consecutive failures. Under CAP and CLIA, repeated failure on the same analyte across testing events (the classic formulation: 2 of the last 3 events) escalates from “investigate” to a reportable condition that can threaten the authority to test. Two details labs get wrong here:
- The count is per testing event, not per sample — three bad samples in one mailing is one failed event, not three.
- The clock doesn’t reset just because time passed. Fail round 12 and round 14, and the pattern stands even though round 13 was fine under a 2-of-3 rule.
This is also why review deadlines exist — CAP expects PT results to be reviewed within its defined window (30 days), and a graded report sitting unopened in an inbox is itself a finding. We’ve covered the scheme differences in detail in CAP vs CLIA proficiency testing.
The traps
A few interpretation mistakes that show up over and over:
Wrong peer group. A z-score against all methods is meaningless if your method has a known bias relative to the field. Make sure the comparison group on the report is your method/instrument peer group, and that it’s large enough (small peer groups produce unstable SDs and erratic z-scores that say more about the group than about you).
Chasing single results. Recalibrating because one z hit +2.4 is how you add variance. Investigate, document, and watch the next round — unless the trend says it’s real.
Tiny SDs producing scary z-scores. With a very tight peer group, a clinically trivial absolute difference can produce |z| > 2. The z-score measures disagreement with peers, not clinical impact — both lenses matter.
Treating “acceptable” as “reviewed”. Most schemes — and CAP explicitly — expect the lab to review all PT results, not just failures. The review of a passing report with a worrying trend is where EQA earns its cost.
A reading routine that takes ten minutes
When a graded report lands:
- Verdicts first — any unacceptable result starts the corrective action clock immediately.
- Warning zone next — anything with 2 < |z| < 3 gets checked against its history: first time, or part of a pattern?
- Trends third — scan each analyte’s trajectory for shift, drift, or widening scatter, including the comfortable passers.
- Sign and date the review — the review itself is an auditable event with a deadline. A reviewed-and-annotated report is evidence; an opened PDF is not.
Do that every round and EQA stops being a quarterly exam you hope to pass. It becomes what it was designed to be: the only measurement system in the lab that tells you how you compare to everyone else running the same test — for free, three weeks before it matters.
Related reading: