On Monday and Tuesday of this week, I had the pleasure to be invited to participate in the National Assessments in the Age of Global Metrics symposium at Deakin University. This symposium was organised by Deakin University’s Research for Educational Impact (REDI) in collaboration with Laboratory of International Assessment Studies. The aims of the symposium are to “bring together scholars and practitioners from around the world to examine models of national assessments and explore how they are affecting the policy discourse and the practices of education in different parts of the world.”

The aims of the symposium were to address the following questions.

  • How are national and sub-national assessments evolving in the age of global metrics?
  • What is the relationship between national assessments and ILSAs?
  • What effects are they having?
  • What can we learn from the experiences over the past couple of decades?

What I liked about this event was that it aimed to bring together academics from diverse backgrounds to engage in dialogue, and maybe even learn from each other, in the fields of large-scale international assessments. And it was a bit of a star cast with presentations from Ray Adams (ACER), Sara Ruto (PAL), Anil Kanjee (Tshwane University of Technology), Sue Thompson (ACER), Hans Wagemaker (ex-IEA), Sam Sellar (MMU), and Barry McGaw (ex-ACARA). Ray Adams’ presentation was very interesting, making the case for homogenising ILSAs using criteria to enable a form of meta-standardisation and I may blog on this at some stage once I have thought about this further.

On the Tuesday morning there was a panel discussion that addressed the question ‘What’s the point of national assessments’. One of the participants was Barry McGaw, who was one of the architects of Australia’s NAPLAN and MySchool intervention, an area I have done a fair bit of work in. I must admit, during the presentation I was a bit annoyed, and when there was a chance for discussion, I asked a few questions. Because this was live-streamed, there were a number of people who tweeted out that I’d asked some questions, and I got lots of responses as to what they were. Here’s my list of questions:

  1. If NAPLAN is impactful, and I think on this we agree, why is it only ever impactful in positive ways such as in the anecdote that you shared? Why aren’t we equally interested in the negative impacts including trying to understand all of those schools that have gone backwards?
  2. Given the objective of this event, I am wondering which qualitative researchers you have read on the effects of NAPLAN that informed your attempts to make the assessments better through designing responses to the unintended consequences of the assessment?
  3. Results across Australia have flatlined since 2010*, how do you justify that NAPLAN has been a success in its own terms?
  4. I’m always concerned when people mischaracterise the unattended consequences of tests as being ‘teaching to the test’. It would be better to see a hierarchy of unintended consequences ranging from:
    1. making decisions about people’s livelihoods such as whether to renew contracts for teachers based on NAPLAN results
    2. making decisions about who to enroll in a school or a particular program based on NAPLAN results
    3. a narrowed curriculum focus where some subjects are largely ignored, or worse, not taught at all so that schools can focus on NAPLAN prep
    4. teaching to the test which may or may not be a problem depending upon how closely the test aligns with curriculum etc
  5. The problem with the branched design for online tests is not whether students will like it or not, it is a) whether schools have the computational capacity to run the tests, extending to whether or not BYOD schools advantage/disadvantage some students depending upon the type of device they use, problems of internet connection in rural and remote schools, bandwidth in large school etc. I am interested how you characterise this as a success?**

I was unimpressed with the answers I got, but I imagine that’s my problem. I think that psychometricians do rigorous research and have important insights into education systems that need to be taken seriously, but I equally think that qualitative fieldwork is desperately needed to advance the validity of this assessments, and when you shut that insight down you only damage your own assessments in the long run.

* At the end of the session, John Ainley from ACER came over and suggested to me that there had been significant improvement in Year 3 Reading and Year 5 Numeracy, with a bump in 2016 and 2017. I conceded the point, I stopped researching NAPLAN in 2015 so I hadn’t updated my trendlines. Across the other domains, however, they have remained fairly stable since 2010. This is known as the ‘wash back effect’ in the assessment literature.

** I had this question down to ask, but felt I had gone on too long so didn’t ask it.

BERA 2016

I presented this paper (co-written with Lenore Adie and Val Klenowski) at BERA in Leeds 2016. Here are the slides thompson-adie-klenowski

Slide 3

Validity has been described as ‘the most fundamental consideration in developing and evaluating tests’ (AERA, APA, NCME, 2014, p. 11), however there is a lack of agreement about the best way to define the term ‘validity’ (Newton & Shaw, 2015). To understand the confusion around the term, it is helpful to consider its evolution. In their recent survey of the history of validity in educational assessment, Newton and Shaw (2014) argue that there have been four distinct periods in the evolution of the concept of validity in educational measurement. As Figure 1 shows, each of these periods corresponds with key debates around validity.


From the mid-1800s, nation-states became increasingly reliant on structured assessments “as a basis for making complex decisions about individuals and institutions” (Newton & Shaw, 2014, p. 17). Validity emerged during this period through the combined fortune of improved statistical procedures and knowledge, with concerns regarding how interested parties could be assured that the tests measured what they claimed to measure. However, even during this early period, there was tension between aptitude test communities and achievement test communities that centred on whether content criterion or evidence of correlation were the critical business of validity (Newton & Shaw, 2014, p. 19). This tension led to a fragmentation in the 1950s that attempted to classify validity into types (ultimately content, predictive, concurrent and construct validity) through the publication of a set of Standards. In 1966 the Standards revised validity to three types, content, criterion-related and construct. However, many educational measurement experts became concerned that this fragmentary approach was causing validation studies to ignore the intertwined relationship of each of these types.


In the 1970s, Samuel Messick (amongst others) argued conception of validity, and that it was necessary to unify the science of validity with the ethics of validity (its consequences). As Messick (1989; 1998) argues, to really grapple with validity we need to recognise that testing is a political and social process informed by a variety of assumptions and expectations, and that it is not an objective and straightforward task. “For a fully unified view of validity, it must also be recognised that the appropriateness, meaningfulness and usefulness of score-based inferences depend as well on the social consequences of the testing. Therefore, social values cannot be ignored in considerations of validity” (Messick, 1989, p. 19). However, as Newton and Shaw (2014, p. 22) charge, while Messick argued a case for the importance of the ethics, or consequences of testing, ‘he failed to provide a persuasive synthesis of science and ethics within validity theory’ and the end result was confusion within the field of educational measurement. In this they echo Popham’s (1997) criticism of Messick in regards to the consequences of testing: “The social consequences of test use are vitally important…. but social consequences of test use should not be confused with the validity of interpretations based on examinees’ performances” (p. 13). Perhaps most importantly for this paper is the fourth stage, that of deconstruction of validity and the work of Michael Kane, where what was being deconstructed referred to ”a new methodology for guiding validation practice: argumentation” that encompassed both the scientific issue of score meaning and the ethical consideration of the consequences of testing (Newton & Shaw, 2014, p. 136).

Slide 4

Kane (2015) argues that validity, or more specifically how validity is used or understood, is dependent on the claims being made within a given context. Thus, validity is a judgement based on either score interpretation, “whether the scores mean what they are supposed to mean” (Kane, 2015, p. 2), or evaluation of the uses of the test. Validity is not a static property of a test, nor is it necessarily a case of validity being a scientific theorisation of both score interpretation and uses. Rather validity relates to the ambitions, and to an extent the stakes, of the scores and their interpretations in specific contexts and how those interpretations and uses can be justified. Kane distinguished between observable attributes and theoretical constructs and argued that much validity investigation could be simplified by focusing on observable attributes at a less ambitious scale or level.

I think of validity as the extent to which the proposed interpretations and uses of test scores are justified. The justification requires conceptual analysis of the coherence and completeness of the claims and empirical analyses of the inferences and assumptions inherent in the claims (Kane, 2015, p. 1).

Kane’s justification proposes a two-step argument-based approach. This involves first specifying the intended interpretation and use of the test as an interpretation/use argument (IUA) which includes “the network of inferences and assumptions leading from test performances to conclusions and decisions based on the test scores” (Kane, 2015, p. 4). Second is the argument based on whether the interpretations and uses of the test are supported by appropriate evidence.

More often than not, however, claims made for tests and how tests are used are much more complicated than this example, and vary through different levels of ambition. Thus, the ambition to use test scores to promote teacher and school accountability requires a very different degree of inquiry than using tests scores to ‘check-in’ on student progress.

For us, Kane’s (2015; 1992) argument-based approach provides a useful conceptual lens because it both simplifies and reframes validity to include both score interpretation and use where those elements are appropriate for the aims/intentions of the assessment. Kane’s argumentative approach to validity places as much emphasis on the use of the tests as on the statistical processes through which score interpretation is grounded. Kane’s approach requires that the end-users, that is the people making decisions with the data (such as teachers, principals and policymakers), are key participants in validation. This argumentative approach enables users to consider both purpose and context regarding data use. The greater the likely impact, the more careful users need to be when drawing inferences from the data, and the more thoughtful, evidenced and precise an argument about the validity of that approach is required. Correspondingly, low-order uses of the test data require less evidence and investigation in order to make decisions.


Each year ACARA releases a final report which includes national and state or territory results as well as results differentiated by gender, Indigenous status, language background other than English status, parental occupation, parental education, and geolocation (metropolitan, provincial, remote and very remote) at each year level and for each domain of the test. Participation rates are also included. Such jurisdictional data is used for comparative purposes.

  • While the participation rates are important evidence in considering how school systems are responding to the tests, these statistics do not provide information on why some groups of students are not participating on the tests.
  • As Figures 2 and 3 show, respectively, there are significant differences in participation across the States that have remained proportionally stable over time.
  • For example, in 2014 Victoria had 91.3% of the required Year 9 student population sit the NAPLAN tests. At the state level, if the non-participation is representative, we can be fairly confident of using the scores to make inferences about student literacy and numeracy levels within Victoria.
  • Of course, if the non-participation was significantly skewed (say, they came from the bottom quartile of student achievement) we could be less confident. However, we can be less confident when comparing Victoria’s results to jurisdictions that have much higher or lower rates of participation. For instance, in 2014 Victoria’s 91.3% participation rate is significantly lower than New South Wale’s (NSW) 93.9%. Any comparison between the average scores of the states must take into account the likelihood that what is being compared is as much different rates of participation as it is different levels of student achievement.
  • Of course, to be more certain about this, we would have to understand the spread of non-participation across various indicators. For example, if NSW’s non-participation rate was evenly spread across an indicator like socioeconomic status (SES) (or Index of Community Socio-Educational Advantage (ICSEA)1 as used by NAPLAN), and Victoria’s non-participation rate was unevenly spread from those likely to do worse on the tests, the validity of any interpretation based on this comparison is decreased because these patterns would likely effect the average scores being compared.

The problem of the interpretation of NAPLAN results derived from comparisons is further complicated by how they are depicted on the MySchool website. MySchool enables statistical comparison between up to 60 ‘like’ schools, where likeness is measured predominantly by ICSEA. Over time, this comparison has been used as a form of league table that significantly increases the stakes of the tests. One of the unintended consequences of this between school comparison has been that some schools have sought to gain advantage by influencing the population of students who sit the test. Consider the following example which reports on the Year 3 Reading test in 2012 between 44 ‘like’ schools (Figure 4).

School 7 (as represented by the blue dot) appears to be doing well in comparison to the 43 other ‘like’ schools, performing in the top quarter of these schools based on the ICSEA calculation. However, the data conceals the problem of participation. Analysis of the ‘like schools’ participation rates, as shown in Figure 5, indicates that the average participation rate across the schools (not including School 7) was higher than School 7’s participation rate of 79%. This is even true for schools that performed statistically below and significantly below School 7 based on their average student achievement in Year 3 Reading in 2012. What the participation rate shows is the trend that participation increases as the average student achievement on the tests decreases.

To increase the validity of interpretations based on comparative data, the publication and consideration of other forms of data would be necessary. When the tests are being used for accountability purposes, policymakers and testing authorities require an improved way to integrate the data that they collect to support more valid comparisons. Alternatively, interpretative claims made in reporting practices should be altered in response to the lessened validity that results from disregarding significant variables as demonstrated by this example of the inherent problems with comparison.

The comparison of ‘like’ schools on MySchool conceals that participation is being measured as much as are literacy and numeracy attainment. If policymakers approached the rankings of schools via NAPLAN results using Kane’s argumentative approach to validity, two things would immediately have become apparent. First, given the stakes involved, detailed evidence would have had to be collected to justify or argue for the reasonableness of this comparison. This would have alerted them to international research which talks about the problem of participation in these kinds of tests (Berliner, 2011; Stobart, 2008). Second, since validation is an ongoing process that requires consideration of the context and purpose for each case, and since overall participation in NAPLAN has fallen since it began in 2008; a validation study conducted in 2008 would likely have generated a very different view than in 2014. Hence the necessity for a new study instead of basing decisions on data that cannot validly be compared.