
This is a note that at some point might become an essay. It follows on from the previous notes on Lonely which was about applying contemporary accounts of vagueness to educational assessment. What I'm doing in these and other essays/notes is to try and illustrate how philosophy of education might apply issues of mainstream metaphysics , epistemology and logic to educational issues. What I hope is that people might see that this might be a fruitful approach even if they think how I do it is not up to scratch.
So we can think of assessment systems as gadgets for producing valid and reliable assessments. Validity here isn't logical validity, rather it means that what is being assessed is the subject and not something else. Assessment engineers talk about this as being the construct. For example, if you're testing math but the test requires good natural language skills then the test isn't valid because it isn't just testing the salient construct of math, it is also testing language. In such a case the test is not valid. Reliability is the other chief requirement. It is the requirement that results of assessment are the same everywhere. Assessment engineers build gadgets to secure both.
The dominant gadget in use today adopts an additive model. It can be understood in terms of mereology, the science that studies the relationship of wholes to parts. The model identifies what are the essential parts that make up a whole. Each part is given a range of measurements and a threshold is agreed which is the minimum measurement required to achieve the whole. Student performances are decided by adding up how many of the parts they have achieved and whether they have crossed the threshold.
The merits of such a system are clear. The parts are clearly set out and are invariable. The mechanism by which you reckon up whether a student has achieved what is required is also clear and simple. So long as the parts are genuine parts of the whole, which is the construct, then validity is secured. So long as the simple adding is done accurately then reliability is also secured.
What I want to do is show that despite this there is a better model which can also secure the requirements of validity and reliability but in a much more fine grained and rich way. It can make important distinctions that the additive model cannot. As such, it is a better model of assessment than our current system. If we think we should use the best model of assessment available then it follows that we should abandon our present system and replace it with the better alternative.
The new model draws on work in dependence logic by van Bentham, plus insights from the vagueness literature of Kit Fine, Timothy Williamson and Roy Sorensen. In order to show how the alternative model works I shall throughout discuss how a typical essay about Macbeth might be handled in an English literature class in a secondary school.
So let's begin with the model, drawing on the work of contemporary mathematical logician van Bentham. Imagine you are marking a set of Macbeth essays and, instead of thinking of them as “better or worse”, you treat each essay as a bundle of assessable features, the sort of things assessment rubrics name. Let the variables be features like X, Y, Z. For concreteness, let X be “control of textual evidence”, Y be “quality of interpretive claim”, and Z be “coherence of overall line of argument”.
Each variable can take a small range of values, say 0, 1, 2, where 0 is weak, 1 is adequate, 2 is strong. Now build a table like a spreadsheet. Each row is a possible essay profile, one way an essay might come out across those features. A row might say X=2, Y=1, Z=1, another might say X=1, Y=2, Z=2, and so on.
van Benthem insists that the real action is not the variables themselves, it is the space of rows you allow. If you allow every conceivable combination of X, Y, Z values, then nothing depends on anything. A student could have brilliant interpretive claims with no textual evidence, or immaculate evidence handling with incoherent argument, and every combination is “permitted” by your model. In that gap free world, changing X does not force any change in Y or Z, because you can always find rows that hold Y and Z fixed while varying X.
That is independence. It is the logical idealisation behind an additive rubric fantasy, where each criterion can move up or down without affecting the others. But assessment practice does not look like that, even when rubrics pretend it does. In real marking, some combinations are impossible, or at least unstable. For instance, truly strong coherence of argument Z=2 is hard to sustain if textual control X=0, because the argument is meant to be anchored in the play’s language and scenes. Equally, a high level of interpretive quality Y=2 often brings with it, or demands, a certain minimum of textual evidence X≥1, otherwise it reads like free association.
Van Benthem’s key move is to represent that by allowing gaps in the table. A gap is a missing row, a combination you do not treat as a live possibility for competent Macbeth essays. Once you accept gaps, dependence appears.
Dependence is nothing over and above patterns of missing rows. van Bentham’talks about a “fixing” test. Let's see what this is in Macbeth terms. Say you want to test whether the coherence of argument Z depends on the control of textual evidence X. What does that mean? It means: if two essay profiles agree on X, they must also agree on Z. In the spreadsheet picture, take any two rows that have the same X value, for example X=2, and check their Z values. If whenever X matches, Z matches, then Z is determined by X, Z depends on X. That is the strongest, cleanest form of dependence, a kind of functional determination.
This strict notion lets you see the logic clearly. It maps to a certain style of assessment claim that people often make implicitly, for example “once a student can handle quotation and close reading at a high level, the argument will be coherent”. That is a dependence claim, and the strict test asks whether it is really true in the data of possibilities you recognise.
Now notice how asymmetry enters. You might find that if you fix X=2, Z is indeed always 2, but if you fix Z=2, X might vary between 1 and 2. That would mean Z depends on X but X does not depend on Z. In plain Macbeth marking language, strong evidence handling might guarantee coherence, but coherence might be achievable with merely adequate evidence handling. So dependence doesn’t work the other way around – it's a one way only street. Dependence is an asymmetry you can read off allowed profiles.
We then add transitivity. Suppose you find that interpretive claim quality Y depends on evidence control X, and coherence Z depends on interpretive claim quality Y. Then coherence Z depends on evidence control X. In Macbeth terms, if evidence handling fixes the level of interpretation, and interpretation fixes the level of overall coherence, then evidence handling fixes coherence. This is a very familiar explanatory pattern in English teaching, the difference is that the dependence logic is forcing you to be explicit about which “fixes” which, and whether those claims are stable when you look at the full range of allowed profiles.
This is “global dependence”, because it makes an all rows claim, which simply means for any essay profile you treat as possible, the fixing relation holds. Alongside global dependence Van Benthem has “local dependence. Global dependence is like saying, for all Macbeth essays that could occur in your marking world, fixing X fixes Y.
Local dependence is weaker and more situated. It says that at one particular essay profile, at one particular row, fixing X to the value it has in that row fixes Y to the value it has in that row. So picture a particular essay, call it Essay S. It has X=0, Y=1, Z=1. The local question is: at this essay, does Y depend on X? That means: among all allowed profiles that keep X at 0, is Y forced to stay at 1. If yes, then locally, at this essay, Y depends on X. If no, then at X=0 there are allowed profiles where Y can be 0 or 2, so locally Y does not depend on X at this essay.
Why whould we care? Because in many real assessment situations, dependence is not uniform across the whole space. The structure is patchy. At low evidence control, interpretation may be forced into a narrow range, but once evidence control is decent, interpretation can vary widely, you can use evidence competently while still making banal claims, or you can use it to make genuinely illuminating claims. That means local dependence might hold in one region of the space but not another.
Van Bentham calls the everywhere version global dependence, and says that global dependence is just local dependence holding everywhere. He shows that the “family of functions” does not have to be full. The “functions” are the complete essays profiles, assignments of values to all your features. A full function space would contain every combination of evidence control, interpretation quality, coherence, and so on. But in real marking practice, you do not think every combination is a live possibility. Some combinations strike you as incoherent as performances. For example, you might think a profile like X=0, Y=2, Z=2 is not a genuine possibility for a Macbeth essay, because high interpretation and high coherence cannot be achieved with zero textual control, at least not as an English literature performance rather than a speculative monologue. That means the row is missing.
That missing row is a gap. And gaps are dependence. This is the core: the dependency structure is exactly the pattern of impossibilities, the impossibilities are what your assessment concept treats as “not a coherent performance”.
We can now begin to see what’s gone wrong with current assessment systems. They are largely additive systems. What does that mean? Additive rubrics act as if the space is full. There are no gaps. They treat each criterion as independently adjustable. They are as if any combination is possible and the overall grade is just a weighted sum. Additivity implicitly assumes closure under recombination. What does that mean? It means that if you can improve evidence control without affecting anything else, and improve interpretation without affecting anything else, then you can fuse those improvements and get an essay that is good in both ways. That is the assumption that the space of possible essays is closed under the fusion of improvements.
Van Bentham’s dependence logic begins by denying closure, by allowing gaps. When you allow gaps, you can represent the possibility that local virtues cannot always be jointly realised in a single coherent performance. Now we can re express some familiar marking puzzles as dependence puzzles. Consider two essays. Essay A has strong textual evidence control but its interpretive claims are merely safe, so X=2, Y=1, Z=1. Essay B is uneven, it misquotes or handles scenes loosely, but it has a striking interpretive idea that reshapes how you see Macbeth, so X=1, Y=2, Z=2. A rubric might try to average them. But teachers often feel there is a real structural difference, not just a different distribution of parts. Van Benthem’s table helps you see why.
In the allowed space, a jump in Y from 1 to 2 might require a certain looseness in X for some students, because the interpretive leap is not yet disciplined by close reading. For others, the jump in Y comes with a rise in X. Those are different dependence profiles. The system that treats criteria as independent cannot represent those developmental or performance constraints.
There are abstract properties that can be applied. Reflexivity says: any feature is determined by any set that includes it. If you fix evidence control X, you have fixed X. In marking terms, that is trivial. Monotonicity says: if coherence Z is fixed by evidence control X, then it is also fixed by evidence control plus interpretation quality, {X,Y}. In marking language, if controlling evidence already pins down coherence, then adding extra information cannot unpin it.
Transitivity says: if interpretation Y is fixed by evidence X, and coherence Z is fixed by interpretation Y, then coherence Z is fixed by evidence X. In marking language, if the evidence handling determines the quality of interpretation, and the quality of interpretation determines the coherence of the argument, then evidence handling determines coherence. These are just the skeleton of how explanatory dependency claims work. And van Benthem’s representation theorem says that if your dependency talk respects those three structural constraints, you can model it as arising from gaps in a space of possible essay profiles. That is why the modelling framework has bite. It tells you that dependence talk is really, underneath, talk about which combinations of features your assessment world counts as possible.
I want to now look at the distinction between consequence and dependence because then we can see why a dependence logic is much richer than a consequence relation. A consequence relation is like saying: if an essay satisfies these premises, it must satisfy this conclusion. For instance, if an essay is coherent and well evidenced, then it meets the top band. That is a truth preservation claim because it cares about one designated value - meeting the standard. I think much assessment today works along such lines.
Dependence is richer. It cares about all the values, and it cares about determination, not merely about preserving success. What’s determination? In assessment terms, it’s the key element of dependence. Dependence asks a stronger question than consequentialism. It asks, for any possible level of evidence control, does it determine the level of coherence? This is more like saying: any answer to one question fixes the answer to another question. It is “question to question” dependence, not just “if you pass these conditions, you pass”.
This is why I find dependence logic attractive, because I’m trying to diagnose where rubrics distort constructs. What does that mean? Construct distortion often happens when the system treats one feature as independent, when in the actual performance domain it is structurally coupled to another. Dependence logic gives you a way to articulate that coupling without immediately collapsing into subjective impression.
Let’s go back to the example of assessing a Macbeth essay. Take a very simplified rubric space. Let X be “textual warrant”, Y be “interpretive depth”, Z be “argument structure”. Suppose in your professional judgement, for essays in the bottom range, if textual warrant is 0 then interpretive depth cannot exceed 1, because all its claims will float free of the very thing it’s supposed to be discussing. That corresponds to gaps: rows with X=0 and Y=2 are missing. But in the upper range, if textual warrant is 2, interpretive depth can be 1 or 2, because you can do close reading in a routine way or an illuminating way. That means among rows with X=2, Y is not fixed. So globally, Y does not depend on X, because fixing X at 2 does not fix Y. Locally, at rows with X=0, Y may depend on X. This is exactly the local versus global distinction made meaningful in assessment terms.
Now consider a “lonely” Macbeth response. A lonely response is one that does not sit comfortably in the existing space of allowed profiles. It may look, by the rubric, like X=0, Y=0, Z=0, a complete fail. But a teacher senses something else, an organisational virtue, a kind of intellectual move that is not captured by the rubric variables. In the dependence picture, that often means your variable set V is inadequate.
You are missing a variable, call it W, “transformative reframing”, the ability to change what the question is really asking. In a full space, you might assume W can be high even when X is low, but in your current rubric, because W is not represented, its effects get forced into the existing variables and misclassified. Once you introduce W, the gap pattern changes. Some profiles that were previously treated as impossible become possible, because what looked like incoherent argument Z=0 is actually coherence relative to a reframed question, coherence in W space. The lonely essay was not off the map, it was just that the map was too small.
That is the dependence logic version saying that the construct is being deformed by what you chose to sample and count. So the dependence machinery gives me a diagnostic workflow. It tells me what to do.
First, list the variables your rubric actually tracks. That is your feature set V. Second, infer from real marking practice, exemplars, and disagreements which combinations are treated as impossible or unstable. Those are your gaps. Third, read off dependence claims from those gaps. Where fixing X eliminates variation in Y, you have a dependence.
Fourth, test whether the rubric’s additive assumptions correspond to a full space, and where they fail, ask whether the failure is noise or a sign that the construct is not decomposable that way. Fifth, and this is where the lonely case lives, ask whether repeated rubric capture of teacher recognised value indicates a missing variable or a mis-specified dependency structure. In other words, the system thinks “if X is low then overall grade must be low”, but in fact there may be a route where W is high and that shifts how X and Z should be interpreted. That is a new dependency graph, not a subjective exception.
If you do all this then you are not just grading essays, you are working inside an implicit space of possible Macbeth essay performances. A rubric is a crude coordinate system on that space. Van Benthem’s dependence logic tells you that the interesting structure of that space is where combinations are missing, where the coordinate system cannot represent certain performances as coherent, and where determination relations hold only locally, in some regions, not globally.
Now some people will read this and say that they knew all this without the logic and the ‘dependency as gaps’ structures. But what I would say is that by regimenting thoughts that are already acceptable by many assessors we avoid thinking it’s a hand wavy subjective issue as to whether we accept these assumptions or those of consequential relations and additivity. If we can show what the logic of additive consequentialist assessment is, and compare it to that of dependency we can then see which is the richer , more powerful and more fine grained explanatory theory being used for assessment. It's clear that dependency is much richer than the additive consequentialist system because it handles which combinations within a construct rubric are possible or impossible. Importantly it gives a regimented account of how ‘lonely’ candidates can be better accommodated within an assessment system by being much more fine grained about what makes a good performance.
So we can now summarise the van Bentham machinery When van Benthem says “gaps are dependence”, he means that some combinations are impossible. That’s it. Imagine two switches:
Case A: No gaps (no dependence) Suppose every combination is allowed:
| X (switch) | Y (brightness) |
| On | Bright |
| On | Dark |
| Off | Bright |
| Off | Dark |
Every combination exists. That means:
There is no dependence. Because nothing is ruled out.
Case B: There are gaps Now suppose the only possible rows are:
| X (switch) | Y (brightness) |
| On | Bright |
| Off | Dark |
The other two rows are missing. Those missing rows are the gaps. Now look what happens: If I fix X = On, Y must be Bright.
If I fix X = Off, Y must be Dark. Y is completely determined by X. That determination exists only because certain combinations are missing. If the “On + Dark” row existed, then Y would not depend on X. So: Dependence = missing combinations. No missing combinations - no dependence.
Missing combinations - dependence appears. That is all “gaps are dependence” means
Now let’s translate this into Macbeth assessment. Let’s use just two features:
Now imagine all combinations are possible:
| X | Y |
| Low | Weak |
| Low | Strong |
| High | Weak |
| High | Strong |
That means: A student could have Strong interpretation with Low textual evidence. In that model, interpretation does not depend on evidence.
Now introduce realistic gaps. In real marking, teachers might feel this combination is not genuine: Low evidence + Strong interpretation because deep interpretation in English literature normally requires textual support. So we remove that row. Now the table is:
| X | Y |
| Low | Weak |
| High | Weak |
| High | Strong |
Notice what happened. Now: If X = Low, Y must be Weak.
There is no allowed combination where X = Low and Y = Strong. So interpretive quality depends on evidence at the low level. The dependence appears because we removed a combination. The gap created the dependence.
The key insight - Dependence is not a force.
It is not a hidden mechanism. It is just this: When some combinations are not allowed, then fixing one variable restricts the others. If everything were allowed, nothing would depend on anything
Why this matters for assessment ? Additive rubrics silently assume something like the “no gaps” world. They assume:
-You can improve interpretation without affecting evidence.
-You can improve coherence without affecting interpretation.
-Each criterion moves independently.
But real performances don’t behave like that. Certain combinations feel unstable or incoherent. For example:
Those are gaps in the real performance space. And those gaps mean:
The lonely student example. Suppose your rubric tracks:
But a student writes something strange:
The rubric may treat this as impossible: Low X should imply Low Y and Low Z. That assumption is a gap built into the system. But perhaps the real space of possible intellectual performances includes a missing variable:
When W is high, different combinations become possible. The “gap” was not in the student’s performance. The gap was in the model of assessment.
Summary - Think of dependence like this: If all combinations are allowed then nothing depends on anything. If some combinations are forbidden then dependence appears. So when van Benthem says gaps are dependence he means the structure of what is impossible is exactly what creates determination relations. No metaphysics is required, just missing rows in a table.
Now let’s look at the distinction between global and local dependence. Let’s simplify to just two variables:
Step 1: Start with a “full” space (no gaps)
| X | Y |
| Low | Weak |
| Low | Strong |
| High | Weak |
| High | Strong |
Every combination is allowed. Now ask: Does Y depend on X? Test it. Fix X = Low.
Y can be Weak OR Strong. So Y is not determined.
Fix X = High.
Y can be Weak OR Strong. Still not determined. So this is proof of no dependence at all. This is what additive rubrics silently assume: You can vary each criterion independently. Step 2: Introduce a realistic gap.
Suppose, in your professional judgement, this combination is not a genuine possibility: Low evidence + Strong interpretation. So remove that row. Now the table becomes:
| X | Y |
| Low | Weak |
| High | Weak |
| High | Strong |
Now test dependence again. Fix X = Low.
Only one row fits - Y must be Weak. So at Low X, Y is determined. Fix X = High.
Two rows fit - Y can vary. So globally Y does not depend on X. But locally, at Low X, Y does depend on X. This is the difference:
In real marking, dependence is often local, not global. For weaker essays, textual control may determine interpretive ceiling.
For stronger essays, interpretation can vary even with strong evidence. So dependence is regional in performance space. That is what van Benthem’s local/global distinction captures.
I want to add to this by connecting all this to Kit Fine’s notion of Non-Compossibility.
Assume three variables:
Additive thinking assumes: If X is good, and Y is good, and Z is good, then the essay is good. But this assumes something deeper. It assumes that the features are compossible. Compossible means they can exist together in one coherent whole. Fine’s idea of non-compossibility is that some features cannot fuse into a single coherent state, even if each feature individually is good. Let’s go back to Macbeth. Imagine: Essay A:
Essay B:
Now suppose someone says, ‘Let’s just take the good parts of A and the good parts of B and combine them.’
We’ll get: X = High
Y = High
Z = High But here’s the Finean point: That fusion may not actually be possible. Why? Because in Essay A, the high interpretation was exploratory and speculative, which destabilised coherence. In Essay B, coherence came from interpretive restraint. Those two virtues might not be structurally compatible in the same performance. So they are individually possible but not jointly compossible.
That is exactly what a “gap” in the table represents. The missing row: High X + High Y + High Z is not missing by accident. It is missing because the performance structure does not permit that fusion. If so then Gaps = non-compossibility.
Van Benthem shows gaps create dependence.
Fine shows gaps represent structural impossibility of fusion.
They are describing the same structural phenomenon from two angles.
Why This Challenges Additive Grading
Additive grading assumes that a total Score = X + Y + Z
That assumes:
But if gaps exist, then some improvements require other improvements. Some improvements block others. Some virtues distort others. This means the space is structured, not flat. Additive grading flattens it.
Consider a lonely response to a Macbeth essay.
The rubric might say: Low evidence = Band 3
Strong interpretation = Band 5
Moderate coherence = Band 4
Average = Band 4
But what if the strong interpretation actually reorganises the whole question? What if the coherence is unconventional but internally tight? What if the evidence weakness is surface level rather than structural? Additive marking cannot see this because it assumes independence. Dependence logic forces you to ask whether this interpretation actually depends on evidence? Or is it grounded differently?
Fine’s structure lets you ask whether these features are compossible in this form or whether the rubric is misspecifying the space. Now we can unify them. Van Benthem models how dependence appears when combinations are missing. Fine shows how missing combinations represent structural non-compossibility.
Applied to assessment we can now argue that additive rubrics assume no structural constraints. But we know that real intellectual performances have structural constraints. So we can now regiment these ideas and say that additive grading assumes a full function space whereas real intellectual achievement lives in a space with gaps. That means that additive grading misrepresents the construct.
Imagine assessment as a 3D space. The additive model is a full cube. All combinations are possible. The Fine / van Benthem model is a carved sculpture by Barbara Hepworth . Some regions are empty. Some features cluster. Some combinations cannot exist. Dependence logic is simply mapping the shape of the sculpture.
We can now model what a borderline case is in assessment. Imagine a student is sitting between, say, a 4 and a 5. The system then encourages you to look for a decisive local feature, a missing quotation, a thin contextual point, a paragraph that does not link back, and you treat the borderline as if it were a tiny ambiguity about a particular element. The hidden picture is, “if we just settle this small point, the grade follows.”
The dependence and gap picture tells you that this is often the wrong geometry. A borderline does not always appear because one local detail is unclear. It often appears because the work sits in a region where the criteria interact in a structured way. The easiest way to see this is to take three common rubric dimensions for Macbeth. Call them E for evidence and reference to the text, I for interpretation and conceptual depth, and C for coherence, meaning the way the essay hangs together as a single act of thinking rather than a list of points.
In an additive rubric, you imagine that you can fix E, fix I, fix C, and the grade is a simple function of those three scores. But your real experience of marking is already evidence that the function is not simple. You see essays where the evidence is plentiful but the interpretation is flat, and you see essays where the interpretation is sharp but the evidence is thin, and you also see essays where coherence is the decisive feature. The crucial point is that coherence is not just another part score. Coherence is often a whole act property. It is not located in any one criterion, it is in the way the criteria are realised together. We can recognise and model this in a disciplined way if we bring in the idea of gaps in the space of performances.
A gap is where there are combinations of E, I, and C that the rubric pretends are possible and stable, but in the actual practice of writing and reading, they are not. For instance, suppose we notice a regular pattern at the lower end. When E is very weak, I tends not to be genuinely strong. Students can make bold claims, but without textual anchoring those claims do not function as interpretation, they function as free association. So for weaker essays there is a dependence. Fix E at a low level and you effectively fix I to a low ceiling. That is a structural fact about what counts as interpretation in the discipline.
It is a local dependence, local because it holds in that region of the space. It may cease to hold at higher levels, where excellent evidence control still permits variation in interpretive depth. This reshapes borderlines. A borderline between a 4 and a 5 is often treated as “is this point enough for a 5.” But on the gap picture, the borderline question to be answered is more like; “does this essay occupy the region where interpretation is genuinely constrained by evidence, or has it crossed into the region where interpretation is free to be strong because evidence is strong enough to support it?” In other words, the borderline is a boundary between regions of dependence, not merely a local ambiguity about one line.
This is also where Fine’s global vagueness also presses in here. Fine’s thought is that vagueness is not always a matter of an isolated borderline case. It can be a property of a whole range. What that means is that it is not always that one essay is “the borderline essay.” Instead, the grade boundary can be a property of the series of essays you have in front of you. You can be in a situation where no single script is intrinsically borderline, but the set of scripts does not allow a clean cut off that respects all the structural relations you care about. This is exactly what teachers report in moderation. They can be confident about the extremes - certain pass, certain fail, but the middle refuses to yield a stable threshold without distortion.
Sorensen’s forced march idea adds to this insight. A forced march is when the system demands a yes or no at every step, “is this a 4 or a 5, is this a 4 or a 5, etc etc” even though competent judgement would include suspending judgement on some cases until you see the wider pattern. In marking, the equivalent is a rank order where you cannot responsibly identify the first 5 from the last 4. Yet the system requires you to do it anyway. The forced march produces the appearance that there must be a specific borderline script, but that is an artefact of the demand for complete classification.
So borderlines in educational assessment are often treated as if they were local ambiguities. The dependence and Finean global picture says many borderlines are structural. They arise because the performance space has gaps and region boundaries, and because the judgement is global over a range rather than intrinsic to one item. That gives us a new style of borderline decision. Instead of asking, “is this one criterion just high enough?” you ask, “which region of the performance space does this script inhabit, given how these dimensions constrain each other here?” That makes the borderline decision less like a coin flip and more like placing the performance in the right structural neighbourhood.
Reliability, standardisation, and consistency are key elements to assessment systems and I argue that the van Bantham/Kit Fine structural approach is helpful in securing each. Reliability in most assessment discourse is treated as if it were primarily about reducing noise, getting markers to apply the same rules in the same way, tightening rubrics, adding exemplars, training. That assumes there is a stable decision procedure that can be applied locally to each script, and the variation is mainly human error. The gap picture changes what reliability is tracking. If the construct itself has a structured space with region boundaries and missing combinations, then disagreement is not always marker error. Some disagreement is the symptom of a misfit between the rubric’s geometry and the construct’s geometry.
Here is the simplest way to see this. If your rubric assumes full independence, then two markers should converge by scoring parts and summing. But if the performance contains a whole act property that is not decomposable, then you will see persistent disagreement even among competent markers, because they are not disagreeing about whether a criterion is met, they are disagreeing about how the criteria fuse into a coherent intellectual act.
This is where Fine’s talk of fusion, closure, and non-closure helps. In an additive world, you assume closure, meaning if a script has feature A and feature B, then it has the combined feature A and B in a coherent way. But in real intellectual work, closure can fail. You can have local virtues that do not fuse into a global virtue. A script can have plenty of quotations and plenty of conceptual vocabulary and still not have an argument. The rubric can mistakenly award points for the presence of parts while missing that the parts do not compose. Reliability programmes often respond to this by narrowing criteria to what can be reliably observed. That is how constructs get distorted. Poetry writing gets dropped because it is harder to standardise, even though it is central to English as a discipline. Context becomes a checkbox because it is easy to standardise, even though the most interesting work sometimes transforms the question in a way that makes the checkbox irrelevant. The drive for reliability can deform validity.
So how do you preserve reliability without flattening the construct? You need a different conception of reliability. Instead of reliability meaning “same local scores for the same local features,” reliability becomes “stable placement in the same region of the performance space under competent judgement.” That can still be operationalised. You can design moderation around region placement rather than micro scoring. A practical way to do this is to treat exemplars not as fixed anchors for each band, but as maps of region structure. You show that in this region of scripts, evidence constrains interpretation strongly, and in that region it does not. You train markers to recognise the shift. That yields a kind of reliability that is sensitive to dependence. You are not asking markers to be identical calculators, you are asking them to share a structural model.
I also want to bring in concerns about vagueness and bring in Williamson and Sorensen’s epistemicism strategically, because I want to retain the idea that there is a fact of the matter about what the grade is. Epistemicism says there is a sharp boundary (between the highest fail, the lowest pass) even if we cannot know where it is. In assessment terms, this can be used as a discipline against complacency. It prevents the easy slide into “it is all subjective anyway.” It preserves the idea that the construct is real, that excellence is not invented by the rubric. That is the rhetorical and philosophical value of epistemicism in this context. But epistemicism also shows that if the boundary is sharp but unknowable, then the aspiration to perfect classification is impossible. That makes the forced march problem explicit. The system’s demand for complete determinacy at each script is a demand for something human practice cannot supply. Epistemicism can therefore be used to criticise overconfident standardisation. It says, even if there is a truth, your procedures cannot access it.
Fine’s global view then softens the conclusion in a useful way. It says the indeterminacy is not necessarily located at an individual script level. It can be a property of the range and the demand for universal classification. That position does not require you to say there is a hidden sharp boundary you can never know. It allows you to say that the classification task itself creates indeterminacy when it forces you to impose a complete cut off across a range that does not support it.
In this way, for these reasons, reliability needs to be reframed. It should not be equated with ever more local precision and ever more rigid rubrics. It should be designed as stability of judgement under a model of dependence and fusion, coupled with procedural humility about forced march classification demands. In policy terms, you can aim for reliable region placement and defensible moderation narratives rather than false micro exactness.
This approach can also address issues about teacher expertise, moderation, and “connoisseurship” and treat them as disciplined perception rather than just subjectivity. In many current systems, assessment expertise is treated as generic, “can apply the rubric, can follow the mark scheme, can standardise.” Disciplinary knowledge is valued, but it is often reduced to factual content, “knows the context, knows the terminology.” What gets lost is a kind of expertise we might call modal connoisseurship. This is the ability to sense which combinations of features are genuinely possible as a coherent performance, and which are pseudo combinations produced by surface compliance.
A teacher with deep disciplinary formation in English literature tends to notice when an essay has a living line of thought even if it breaks a rubric expectation, and tends to notice when an essay is a dead assembly even if it ticks boxes. That is sensitivity to structure, to what follows from what, to what is doing explanatory work and what is decorative. In Fine’s terms, it is sensitivity to grounding and to fusion failure. In van Benthem’s terms, it is sensitivity to dependence, to what is determined by what in this region of performance space.
We can connect this to what I’ve called the “lonely” student performance , inspired by Sorensen’s lonely thought experiment. Lonely is meant to test Fine’s requirement that vagueness involves at least two cases, at least a range - we can intuitively grasp this when we think of a borderline as being between two cases. The lonely case asks whether a single object could be vaguely lonely without a range. In assessment terms, a lonely response is a response that does not sit comfortably in the usual comparison class. It may be a low scoring script by rubric features, but it has a distinctive excellence that is not captured by the standard dimensions. It is “lonely” because it lacks peers in the script population, or because it realises the construct in an unusual configuration.
Teachers recognise these. They can see that the student is failing the formal requirements but know there is something here. The system then pressures them to treat that as bias, halo effect, or softness. The Finean and dependence apparatus gives you a way to treat it as a legitimate structural phenomenon. How? You treat the lonely excellence as a feature whose usual companions are missing. That is, the student has an unusual strength that is not fused with the ordinary supports. In a typical top band Macbeth essay, originality is fused with control of evidence, organisation, and conceptual framing. The lonely essay might have the conceptual move without the surface polish. Additive rubrics punish this because they treat missing companions as decisive. Fine’s framework says we should ask whether the conceptual move is a genuine verifier of high level understanding even if other verifiers are absent. It also says we should consider whether the missing companions are truly required by the construct, or are merely required by the current measurement regime.
This is where validity issues return. Validity is about whether you are measuring the construct you claim to measure. If the construct is “literary understanding of Macbeth,” and if part of that includes the ability to make a powerful re description of a scene or a motif, then a lonely essay or piece of creative writing might display that construct even if it fails the usual proxy indicators. The rubric then distorts the construct by operationalising it as a sum of proxies. A Finean response is not to abandon reliability, it is to redesign what is made reliable. You build a controlled route for recognising lonely excellence.
For example, you include in the rubric a criterion that is explicitly non additive and explicitly holistic, such as “transformative insight that reconfigures the question or re-organises textual detail into a new explanatory pattern.” Then you do not score it by counting features. You score it by moderated exemplars and by narrative justification constrained by discipline norms. That can be standardised by training and moderation without pretending it is mechanical.
This also reframes the debate about powerful knowledge. Powerful knowledge discourse can become fact centric, as if disciplinary power is mainly possession of information. The Finean angle suggests that disciplinary power is about modal perception, seeing what is possible, what follows, what would count as a reason, what would count as evidence, what configurations of claims and quotations form a coherent intellectual act. That is precisely what experienced teachers develop, and precisely what generic skills frameworks struggle to acknowledge.
So teacher expertise becomes more central. If you accept dependence, fusion failure, and lonely achievement as real features of the performance space, then you cannot outsource judgement to additive rubrics alone. You need disciplined modal connoisseurship, but you can also regiment it, not by turning it into points, but by making it accountable through shared models, exemplars that illustrate region boundaries, and explicit narrative warrants that are evaluated in moderation.
So now we can see the connection between how we might understand borderline cases, reliability and teacher expertise. Borderline decisions become decisions about region placement in a structured space rather than local ambiguity resolution. Reliability becomes stability of region placement and warranting practices rather than micro scoring convergence. Teacher expertise becomes the disciplined capacity to perceive dependence, fusion, and lonely excellence, and to justify it within shared disciplinary norms.
Pedagogy changes when operating with this dependency model. If we take van Benthem seriously, then the important thing is not the criteria in isolation but the shape of the space of possible performances. Some combinations are not genuine possibilities. Some features determine others in certain regions. Some apparent virtues cannot actually coexist in a coherent act of thinking. Some strange performances, the lonely ones, reveal that the rubric has omitted a crucial variable. Once that is the picture, the essay is no longer a bag of features. It becomes a structured whole whose parts only make sense through their relations of dependence, support, tension, and sometimes incompatibility.
That changes teaching immediately. At present, a great deal of teaching in schools and universities is pushed towards criterion harvesting. Students are coached to display the signals attached to each box in the mark scheme. They are trained to produce visible evidence of evidence use, visible evidence of interpretation, visible evidence of context, visible evidence of structure. Even where teachers know this is reductive, the assessment regime rewards it.
Under a dependence model, the teacher’s task would change from distributing local techniques across a rubric grid to inducting students into the structure of the performance space itself. The question would no longer be merely, “have you included textual evidence?” but, “what kind of interpretive move does this evidence make possible here?” It would no longer be merely, “have you developed a coherent argument?” but, “what is the relation between the conceptual ambition of this reading and the degree of textual control that could sustain it?” In other words, teaching would become more relational, more modal, and more disciplinary.
A dependence based approach does not merely make marking more sophisticated after the fact. It feeds back into what is taught, because once one accepts that some features depend on others, pedagogical sequencing changes. If strong interpretive depth at one region of the space presupposes a certain level of textual control, then teaching cannot simply offer interpretation and evidence as parallel skills. It has to treat them as structurally linked. If coherence is not an independent add on but the whole act of bringing textual detail and conceptual framing into a unified line of thought, then coherence cannot be taught as a paragraphing skill alone. It has to be taught as the shaping of intellectual movement across a piece of writing. If some dazzling but unstable interpretive gestures are non compossible with disciplined argument at an early stage, then teachers need to know that, not to suppress risk, but to recognise which forms of risk are developmentally live and which are still hollow. What this does is pressure assessment to be part of the process of writing the essay in real time, as it happens, as recommended by formative assessment processes.
This would alter the classroom atmosphere too. At present, much rubric driven pedagogy trains students to think that excellence is the accumulation of visible tokens. Quote more, mention context, embed terminology, signpost your points, conclude clearly. There is some use in all of that, but it often creates a flat view of intellectual life. A dependence model would push teaching towards the internal economy of a judgement. Why does this quotation matter here? What does this claim commit you to? What follows if you describe Macbeth’s imagination in this way? What kind of coherence has been achieved, mechanical, rhetorical, conceptual, dramatic? That is closer to the actual discipline and as such such, captures the construct being taught.
In practice, this means the assessment system would need to look not only at whether a script contains certain features, but at whether the relations among features are warranted. A teacher might need to justify that this essay’s coherence is genuine rather than cosmetic, because the interpretation is actually supported by the pattern of textual selection and deployment. Or that this apparently weak script contains a lonely excellence, a reframing move, a novel conceptual reorganisation, that reveals a missing variable in the rubric and therefore should not simply be averaged down. Or that two scripts with similar surface properties occupy different regions of the performance space because in one case evidence constrains interpretation strongly, while in the other the student has crossed into a region where evidence is strong enough to permit genuine conceptual freedom.
Once that becomes central, the kind of writing that prospers is no longer merely the writing that looks right. It is the writing whose internal determinations can be tracked. And tracking is the key word. A dependence based regime would require tracking in at least three senses.
First, it would require tracking developmental dependencies. Teachers would need to see how certain capacities open or constrain others over time. A student’s interpretive depth would not just be scored at a moment. It would be located in relation to their actual control of textual warrant, conceptual risk, and argumentative integration.
Second, it would require tracking within script dependencies. Markers would need to see not just that an essay has quotations and claims, but which claims are fixed by which textual choices, which parts of the essay are doing genuine explanatory work, and where surface virtues fail to fuse into a whole act.
Third, it would require tracking exceptions and lonely performances. Instead of treating unusual but compelling work as noise, halo effect, or rubric misfit, the system would need a principled route for saying, here the map is incomplete, here a variable is missing, here a performance is real though rare.