A few anomalies? No, baseline is flawed from start to finish

In the pilot year of baseline assessment, the most popular provider Early Excellence has already had to apologise to its 12,000 schools for faulty scores resulting from “a few anomalies” (see Schools Week). Even this apology is economical with the truth.

Early Excellence argues that its overall bands, rather than its scores for separate aspect such as literacy, should be used to predict future outcomes. This is disingenuous. In fact, Early Excellence has no research data to show how well its scores relate accurately to future outcomes. All it can show is a vague resemblance between aggregate scores for a whole school and the school’s recent KS1 results.

High attaining school [low scores on left, higher on right]

Even this is seriously misleading for lower attaining schools. In its trial, EE found that far more children scored low at baseline than scored low at KS1.

Low attaining school

On this basis, there is no way of ascertaining whether a child with a particular baseline score is likely to gain a similar result at KS1 or KS2. It makes no difference whether you use the overall baseline score or that for a specific strand. That is an ill-founded act of faith which sits in contradiction with other analysis: we know that very many children’s levels vary from one stage to the next.

NFER

The problem with baseline is not limited to Early Excellence and its ‘few anomalies’. NFER also lack longitudinal data: in other words, they are unable to link individual baseline scores with subsequent attainment. They simply cannot show whether their baseline tests are reliable predictors of a child’s future attainment. They argue that baseline tests should only be used for evaluating a whole school’s performance.

This is an honest admission, but naive. It is inevitable that schools and teachers will make judgements about individual pupils based on their baseline scores. In the present context schools will inevitably use these results to track pupils on an individual basis. Teachers will use the results for practices such as “ability grouping”, and make assumptions about a child’s ability and potential. Indeed, it is highly likely that Ofsted inspectors will look at scores for sample children and question whether or not the school has added the expected quantity of “value”. This has very serious consequences for children, and could have a serious impact on their opportunities and development. It will also lead to seriously flawed evaluations of schools.

CEM

The most experienced provider CEM have 20 years experience of supplying individual schools with predictive tests. They are using a version of their longstanding PIPS test for baseline. The correlation between their tests and KS2 results is around 0.7. In other words, the tests make accurate predictions of roughly half the children. [No space here for a full explanation of how 0.7 becomes half, but, briefly, it relates a squaring within the formula.] We have shown, in a previous post, how children with the same baseline score diverge across 60 to 80 percentiles of outcomes. In other words, even CEM’s well developed tests are not a precision tool, but more like a sawn-off shotgun.

Testing 4 year olds

This will surprise no one who actually knows a four year old. Parents and professionals know just how idiosyncratic they can be. It is often impossible to make clear yes or no decisions (eg do they relate letters to sounds), it just depends which day and in which context.

We also know just how much they change in the year from their fourth birthday. A low score in a baseline assessment is highly likely to mean simply: this child is not yet old enough.

The DfE have known for years that assessing young children is a perilous venture (see RR034, 2010). For example, EYFS scores for Communication, Language and Literacy correlate 0.68 with KS1 Reading. As with CEM’s baseline tests, that means accurate predictions for around half the children (and you don’t know in advance which ones). Whether you try to predict single KS1 subjects or combined scores, and on the basis of single EYFS strands or combined, the correlation is similar or even worse. Of children with the midpoint score at EYFS Reading, 21% reached level 1 at KS1, 22% reached 2c, 31% 2b, and 25% 2a or 3.

From the same starting point, i.e. the midpoint baseline score, only half reached the average Level 4 at age 11 but 30% reached level 5 and 21% level 3 or below. Information like this is useless.

As we pointed out in an earlier post, baseline assessment is fantasy dressed up as science. But it is a dangerous fantasy. It leads to misleading assumptions about each child’s “ability” or “potential”, and will do untold damage.