BERA 2016

I presented this paper (co-written with Lenore Adie and Val Klenowski) at BERA in Leeds 2016. Here are the slides thompson-adie-klenowski

Slide 3

Validity has been described as ‘the most fundamental consideration in developing and evaluating tests’ (AERA, APA, NCME, 2014, p. 11), however there is a lack of agreement about the best way to define the term ‘validity’ (Newton & Shaw, 2015). To understand the confusion around the term, it is helpful to consider its evolution. In their recent survey of the history of validity in educational assessment, Newton and Shaw (2014) argue that there have been four distinct periods in the evolution of the concept of validity in educational measurement. As Figure 1 shows, each of these periods corresponds with key debates around validity.


From the mid-1800s, nation-states became increasingly reliant on structured assessments “as a basis for making complex decisions about individuals and institutions” (Newton & Shaw, 2014, p. 17). Validity emerged during this period through the combined fortune of improved statistical procedures and knowledge, with concerns regarding how interested parties could be assured that the tests measured what they claimed to measure. However, even during this early period, there was tension between aptitude test communities and achievement test communities that centred on whether content criterion or evidence of correlation were the critical business of validity (Newton & Shaw, 2014, p. 19). This tension led to a fragmentation in the 1950s that attempted to classify validity into types (ultimately content, predictive, concurrent and construct validity) through the publication of a set of Standards. In 1966 the Standards revised validity to three types, content, criterion-related and construct. However, many educational measurement experts became concerned that this fragmentary approach was causing validation studies to ignore the intertwined relationship of each of these types.


In the 1970s, Samuel Messick (amongst others) argued conception of validity, and that it was necessary to unify the science of validity with the ethics of validity (its consequences). As Messick (1989; 1998) argues, to really grapple with validity we need to recognise that testing is a political and social process informed by a variety of assumptions and expectations, and that it is not an objective and straightforward task. “For a fully unified view of validity, it must also be recognised that the appropriateness, meaningfulness and usefulness of score-based inferences depend as well on the social consequences of the testing. Therefore, social values cannot be ignored in considerations of validity” (Messick, 1989, p. 19). However, as Newton and Shaw (2014, p. 22) charge, while Messick argued a case for the importance of the ethics, or consequences of testing, ‘he failed to provide a persuasive synthesis of science and ethics within validity theory’ and the end result was confusion within the field of educational measurement. In this they echo Popham’s (1997) criticism of Messick in regards to the consequences of testing: “The social consequences of test use are vitally important…. but social consequences of test use should not be confused with the validity of interpretations based on examinees’ performances” (p. 13). Perhaps most importantly for this paper is the fourth stage, that of deconstruction of validity and the work of Michael Kane, where what was being deconstructed referred to ”a new methodology for guiding validation practice: argumentation” that encompassed both the scientific issue of score meaning and the ethical consideration of the consequences of testing (Newton & Shaw, 2014, p. 136).

Slide 4

Kane (2015) argues that validity, or more specifically how validity is used or understood, is dependent on the claims being made within a given context. Thus, validity is a judgement based on either score interpretation, “whether the scores mean what they are supposed to mean” (Kane, 2015, p. 2), or evaluation of the uses of the test. Validity is not a static property of a test, nor is it necessarily a case of validity being a scientific theorisation of both score interpretation and uses. Rather validity relates to the ambitions, and to an extent the stakes, of the scores and their interpretations in specific contexts and how those interpretations and uses can be justified. Kane distinguished between observable attributes and theoretical constructs and argued that much validity investigation could be simplified by focusing on observable attributes at a less ambitious scale or level.

I think of validity as the extent to which the proposed interpretations and uses of test scores are justified. The justification requires conceptual analysis of the coherence and completeness of the claims and empirical analyses of the inferences and assumptions inherent in the claims (Kane, 2015, p. 1).

Kane’s justification proposes a two-step argument-based approach. This involves first specifying the intended interpretation and use of the test as an interpretation/use argument (IUA) which includes “the network of inferences and assumptions leading from test performances to conclusions and decisions based on the test scores” (Kane, 2015, p. 4). Second is the argument based on whether the interpretations and uses of the test are supported by appropriate evidence.

More often than not, however, claims made for tests and how tests are used are much more complicated than this example, and vary through different levels of ambition. Thus, the ambition to use test scores to promote teacher and school accountability requires a very different degree of inquiry than using tests scores to ‘check-in’ on student progress.

For us, Kane’s (2015; 1992) argument-based approach provides a useful conceptual lens because it both simplifies and reframes validity to include both score interpretation and use where those elements are appropriate for the aims/intentions of the assessment. Kane’s argumentative approach to validity places as much emphasis on the use of the tests as on the statistical processes through which score interpretation is grounded. Kane’s approach requires that the end-users, that is the people making decisions with the data (such as teachers, principals and policymakers), are key participants in validation. This argumentative approach enables users to consider both purpose and context regarding data use. The greater the likely impact, the more careful users need to be when drawing inferences from the data, and the more thoughtful, evidenced and precise an argument about the validity of that approach is required. Correspondingly, low-order uses of the test data require less evidence and investigation in order to make decisions.


Each year ACARA releases a final report which includes national and state or territory results as well as results differentiated by gender, Indigenous status, language background other than English status, parental occupation, parental education, and geolocation (metropolitan, provincial, remote and very remote) at each year level and for each domain of the test. Participation rates are also included. Such jurisdictional data is used for comparative purposes.

  • While the participation rates are important evidence in considering how school systems are responding to the tests, these statistics do not provide information on why some groups of students are not participating on the tests.
  • As Figures 2 and 3 show, respectively, there are significant differences in participation across the States that have remained proportionally stable over time.
  • For example, in 2014 Victoria had 91.3% of the required Year 9 student population sit the NAPLAN tests. At the state level, if the non-participation is representative, we can be fairly confident of using the scores to make inferences about student literacy and numeracy levels within Victoria.
  • Of course, if the non-participation was significantly skewed (say, they came from the bottom quartile of student achievement) we could be less confident. However, we can be less confident when comparing Victoria’s results to jurisdictions that have much higher or lower rates of participation. For instance, in 2014 Victoria’s 91.3% participation rate is significantly lower than New South Wale’s (NSW) 93.9%. Any comparison between the average scores of the states must take into account the likelihood that what is being compared is as much different rates of participation as it is different levels of student achievement.
  • Of course, to be more certain about this, we would have to understand the spread of non-participation across various indicators. For example, if NSW’s non-participation rate was evenly spread across an indicator like socioeconomic status (SES) (or Index of Community Socio-Educational Advantage (ICSEA)1 as used by NAPLAN), and Victoria’s non-participation rate was unevenly spread from those likely to do worse on the tests, the validity of any interpretation based on this comparison is decreased because these patterns would likely effect the average scores being compared.

The problem of the interpretation of NAPLAN results derived from comparisons is further complicated by how they are depicted on the MySchool website. MySchool enables statistical comparison between up to 60 ‘like’ schools, where likeness is measured predominantly by ICSEA. Over time, this comparison has been used as a form of league table that significantly increases the stakes of the tests. One of the unintended consequences of this between school comparison has been that some schools have sought to gain advantage by influencing the population of students who sit the test. Consider the following example which reports on the Year 3 Reading test in 2012 between 44 ‘like’ schools (Figure 4).

School 7 (as represented by the blue dot) appears to be doing well in comparison to the 43 other ‘like’ schools, performing in the top quarter of these schools based on the ICSEA calculation. However, the data conceals the problem of participation. Analysis of the ‘like schools’ participation rates, as shown in Figure 5, indicates that the average participation rate across the schools (not including School 7) was higher than School 7’s participation rate of 79%. This is even true for schools that performed statistically below and significantly below School 7 based on their average student achievement in Year 3 Reading in 2012. What the participation rate shows is the trend that participation increases as the average student achievement on the tests decreases.

To increase the validity of interpretations based on comparative data, the publication and consideration of other forms of data would be necessary. When the tests are being used for accountability purposes, policymakers and testing authorities require an improved way to integrate the data that they collect to support more valid comparisons. Alternatively, interpretative claims made in reporting practices should be altered in response to the lessened validity that results from disregarding significant variables as demonstrated by this example of the inherent problems with comparison.

The comparison of ‘like’ schools on MySchool conceals that participation is being measured as much as are literacy and numeracy attainment. If policymakers approached the rankings of schools via NAPLAN results using Kane’s argumentative approach to validity, two things would immediately have become apparent. First, given the stakes involved, detailed evidence would have had to be collected to justify or argue for the reasonableness of this comparison. This would have alerted them to international research which talks about the problem of participation in these kinds of tests (Berliner, 2011; Stobart, 2008). Second, since validation is an ongoing process that requires consideration of the context and purpose for each case, and since overall participation in NAPLAN has fallen since it began in 2008; a validation study conducted in 2008 would likely have generated a very different view than in 2014. Hence the necessity for a new study instead of basing decisions on data that cannot validly be compared.