Why do assessment commentators denigrate examiners and exam marking?

Geoff Chapman
8 hours ago
4 min read

The perennial issue of exam marking accuracy popped up in the UK Sunday papers last weekend. The author wrote a book 4 years ago about a marking scandal. They claim to have a smoking gun and have continually dragged England’s exam regulator. The re-emergence and re-tread of their arguments is puzzling and poorly timed, creating maximum distress to learners and families.

From an assessment science and operational perspective, the core arguments in the opinion piece rely on a fundamental misunderstanding of what assessment is. It is coupled with an outdated view of how modern grading and marking actually works. Here’s what the article presented, and why it (ironically) misses the mark.

1) ‘1 in 4 exam grades is objectively wrong’ The major point portrayed is that a regular examiner’s mark may differ from a senior examiner’s ‘definitive’ mark. Therefore, 25% of all awarded grades are statistically ‘wrong’ and candidates lose out unfairly.

This point relies on a creaky premise that subjective subjects (such as History, English, Art) have a single, perfectly objective, definitive true mark to achieve. They do not.

When evaluating complex, higher-order thinking, there’s a legitimate range of valid academic judgement. If an essay’s true quality sits somewhere between 14 and 16 marks, awarding a 14 isn't ‘wrong’, just because a senior examiner might awarded a 15.

Equating valid human variance with an ‘error rate’ does a disservice to qualitative assessment and examiners. However, zero variance is available via multiple-choice questions (MCQs). And while experts can craft exceptional MCQs, their reputation is just to test lower-order knowledge and recall. A perceived regression in educational quality and the dreaded dumbing-down call.

2) ‘Grading’s a lottery depending on who marks your paper’ It’s robustly argued that which side of a grade boundary a student falls on is pure luck. That it is dictated entirely by the strictness or leniency of the examiner who ends up being assigned to mark that paper.

Sadly, this argument is becoming tenuous. With digitisation and techniques such as comparative judgement and item-level marking, candidates’ papers are increasingly marked by multiple examiners.

Through on-screen item-level marking, physical scripts are scanned, anonymised, and chopped up by question. Simply, a single exam paper might be marked by ten different specialists. This helps reduce bias, eases cognitive load switching, and keeps examiners aligned to what good looks like.

Dedicating examiners to single anonymised questions and learner answers helps reduce bias, eases cognitive loads, and keeps examiners on track

On-screen marking also uses seed scripts. These are pre-marked, benchmarked answers injected invisibly into an examiner's queue. If their marking drifts from the chief examiner’s standard on a seed script, the marking system flags for peer or supervisor review. Just like any seeding system, it ensures a highly engineered quality-control pipeline and outcomes. Whole paper grading is not ruined by impaired examiners. It is not a random draw.

3) ‘Grades are only reliable to one grade either way, making exams unfit for purpose’ Because of standard statistical error, high-stakes decisions (such as university admissions) shouldn't be made on a single exam result.

There is a strong argument for additional insight (admission test, viva exam, personal statement) to support high-stakes decisions. But all test psychometricians know that assessments have a standard error of measurement. No test is perfectly precise. Sadly, the article (and many policy folk) denigrate or ignore alternatives.

Every assessment method has error. During the pandemic, Teacher Assessed Grades were showing to be vastly less reliable across a national cohort. They are highly susceptible to unconscious bias against disadvantaged groups, or kids that teachers really don’t like.

Marking anonymised exam papers reduces unconscious bias.

We’re dealing with systematic bias. Right now, exams remain the most equitable, blind, and scalable mechanism to mitigate this. However, the modern discourse is to consider a basket of digitally-enabled tools to spread and lower the cliff-face risk. Formative assessments to build a composite learner profile, lowering the ultimate pressure on the final paper, while maintaining the rigour, and retaining the trust between learner and exam owner.

4) ‘Report exact marks and a fuzziness metric instead of absolute grades’ Listing a raw mark and statistical range (such as 65 +/- 3), rather than an absolute grade, is not without merit. It is statistically purer, but fails the societal utility test.

Assessment doesn't exist in a vacuum; it helps university admission officers and employers make high-volume, practical decisions. Getting people-in-post quicker. Introducing a ‘fuzziness index’ would slow admissions processes and trigger an endless, legally-fraught appeals culture. Our society is becoming more litigious, not less. (‘My score means I might have crossed the threshold, so you must admit me’).

A Fuzziness Metric and health warnings for grades displayed on exam certificates create uncertainty. They ruin assessment currency and validity.

Grades act as necessary heuristics or short-hand. But to function at the huge scale, they need to be blunt. Until e-portfolios, digital credentials, and similar granular information are better understood by the aforementioned gatekeepers, grades are still the de facto standard.

On reflection: a tired and outdated outlook that ignores what the (digital) assessment sector is actually doing It’s becoming incredibly tiresome and patronising to young people to continually suggest that their (paper) exam certificate should carry a demeaning caveat of a ‘health warning’. It also sneeringly punches down on expert examiners, who give their valuable free time to the assessment sector.

Better to suggest how the assessment regime needs to change – taking advantage of a digitally-enabled toolkit and high-order subject matter expertise. Slapping a tawdry sticker on a certificate merely shows a paucity of ideas, and a barren understanding of the assessment sector’s art-of-the-possible. Once again, the public discourse is damaged by someone without operational experience of high-stakes assessment programmes. It’s easier to do a gotcha for a cheap headline (and sell their book), than it is to create a better assessment world for learners.

It’s reasonable to say that the school exam system hasn’t evolved with societal norms. Many families will say they are trying to modernise, often unknowingly, an archaic and somewhat arbitrary educational assessment system. It is ironic that an author who claims to find fault with the exam system is actually hindering progress to fairness, validity, and reliability.