Problem

The Importance of Assessment

Among the most important roles of instructors in higher education is the task of certifying that each individual learner in their course has achieved a particular standard in relation to the intended outcomes of the course and that achievement is a valid, reliable, and fair representation of the learner's true ability. The importance of this determination is a reflection of how student achievement data are used in not only summative course assessments, but also predicting future success, awarding scholarships, and determining acceptance into competitive programs [@guskeyExploringFactorsTeachers2019; @bairdAssessmentLearningFields2017]. In order for this accounting for learning to accurately reflect the goals of any given course, it is necessary to ensure that the assessment strategy be aligned with course learning outcomes [@biggsTeachingQualityLearning2011]. However, Broadfoot [-@broadfootAssessmentTwentyFirstCenturyLearning2016] and Pellegrino and Quellmalz [-@pellegrinoPerspectivesIntegrationTechnology2010] argue that the goals and intended outcomes for higher education have changed as society has become more saturated with digital technologies. These changes have precipitated parallel shifts in both the cognitive and affective competencies required of digitally literate citizens. Predictably, this has led to an increasing gap between traditional assessment structures which prioritize validity, reliability, fairness, and objectivity through psychometric analyses and modern goals of higher education prioritizing more affective constructs such as cooperation, empathy, creativity, and inquiry [@worldeconomicforumFutureJobsReport2020].

Complicating this problem is the trend, accelerated by SARS-Cov-2 and COVID-19, towards the use of digital technologies to create, administer, and score assessments. Typically, digital technologies are used to increase the efficiency and objectivity of test administration [@benjaminRaceTechnologyAbolitionist2019] using automated scoring on selected-response tests, reinforcing traditional assessment structures. However, as Shute et al. [-@shuteAdvancesScienceAssessment2016] argue, digital technologies could be used to drive innovations in assessment practice while balancing the need for both quantitative and qualitative approaches to assessment.

Defining Assessment

Assessment, according to the National Research Council's (NRC) 2001 report Knowing what students know, is simply, reasoning from evidence [@nationalresearchcouncilKnowingWhatStudents2001, p. 43], based on Mislevy's assertion that "test theory is machinery for reasoning from students' behavior to conjectures about their competence, as framed in a particular conception of competence." [-@mislevyTestTheoryReconcieved1994 p. 4]. Such a parsimonious description, however, hides the complexities of actually coming to know what learners know and can do in relation to particular outcomes. Since knowledge of a particular domain cannot be directly observed in a learner, instructors must rely on data gathered during the teaching process to support a particular inference about what a learner probably knows. The data gathered from performance tasks such as exams, essays, portfolios, labs, etc, become evidence when they support an inference about what a learner knows and can do. Hence, all summative grades are probablistic, not deterministic.

The NRC identifies three interdependent components of educational assessment which together form the "assessment triangle": cognition, or a model of the domain to be learned; observation, or the performance task learners will complete to demonstrate their competence; and an inference or interpretation of the data produced by the observation. The interdependent nature of the three components requires that both the observation and interpretation components be grounded in the nature of the cognitive model of the domain. For example, if the domain of knowledge is, broadly speaking, arithmetic, then the observation, or performance task, must elicit responses which require the examinee to demonstrate competence in arithmetic [@gerritsen-vanleeuwenkampAssessmentQualityTertiary2017]. If the performance task requires the ability to speak Icelandic, a different cognitive domain, then a Swahili-speaking examinee's responses will not be representative of their ability in arithmetic, but rather their lack of ability to speak Icelandic. Consequently, the examiner will have no basis for making an inference about the examinee's arithmetic ability; in other terms, the inference would be invalid because the performance task is not aligned with the cognitive model.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0a4ac183-af44-462a-a706-cf0174544d9e/z-assessment-triangle.png

Similarly, speaking more broadly than an individual course outcome, if the competencies required of citizens of a digitally saturated society [@timmisRethinkingAssessmentDigital2016] are different from those required in the past (representing a change in the cognitive model of higher education in general), then it may be necessary for instructors in higher education to consider whether or not the performance tasks they require of learners are aligned with those competencies. The World Economic Forum tracks and publishes the demand for skills in its Future of Jobs Report, including trends and forecasts as reported by employers. They report that there is consistently increasing demand for employees to be skilled in critical thinking and analysis and problem solving. New to their report is the increasing importance of what they call "self-management" skills like active learning, resilience, stress tolerance, and flexibility [-@worldeconomicforumFutureJobsReport2020]. They also report increasing demand for technological skills related to designing, programming, and using technological tools. Indeed, researchers have called for instructors to reconsider their assessment practices [@timmisRethinkingAssessmentDigital2016] as they recognize that the aims of higher education in the 21st century have shifted from a predominantly top-down transmission model of requiring graduates to demonstrate knowledge mastery in a cognitive domain, to a model that demands graduates demonstrate skills and attitudes in non-cognitive domains such as cooperation, problem-solving, creativity, and empathy, assessment of which selected-response instruments are ill-suited [@broadfootAssessmentTwentyFirstCenturyLearning2016]. Encouraging a broader range of assessment structures will require paradigm shifts in all of pedagogy, assessment, and use of technology that centre a relational, human centred approach to transformative learning experiences for learners [@blackAssessmentClassroomLearning1998]. Understanding how instructors think about and implement assessment structures and also how those assessment structures impact learners can help stakeholders plan for assessment in the 21st century.

Purposes of assessment

Not only can there be internal misalignment between the three components of the assessment triangle, there can also be misalignment between how assessment instruments are intended to be used and how they are actually used. For example, some large-scale assessments (LSA) are used to predict success in higher education (SAT, GRE, MCAT), other assessments are used by certifying agencies to confirm that an applicant has attained the requisite knowledge, skills, and attitudes to succeed in a given profession (NCLEX). Still others are used to compare educational attainment by certain age groups in different countries (PISA) and to inform national or state educational policy. These LSAs are designed to meet very high standards of psychometric rigour due to the very significant consequences of mis-categorizing a person, perhaps an aspiring nurse, as having attained the requisite knowledge, skills, and attitudes to be a nurse, when, in fact, they have not done so. These high psychometric demands require that LSAs used for these purposes be valid, reliable, and fair. Validity refers to the quality of an inference drawn from the examinee's performance. If the inference is that the examinee has demonstrated competence in the domain that was intended to be measured by the instrument, then the inference is valid. Recall the example of a test of arithmetic which required the ability to understand Icelandic. This particular instrument might lead to valid inferences for examinees who are fluent in Icelandic, but not for those who only speak Swahili. Gerritsen-van Leeuwenkamp, et al., call this "construct irrelevant difficulty" [-@gerritsen-vanleeuwenkampAssessmentQualityTertiary2017, p. 102]. It is important to note that validity is a quality of the inference, not of the instrument itself. Validity might be understood as the accuracy of the instrument. Reliability refers to the instrument yielding consistent scores regardless of the population of examinees. A reliable instrument should yield very similar scores for two examinees who have the same levels of ability, regardless of their context. Finally, a fair instrument should not yield results that discriminate for or against one particular subgroup of examinees. For example, if examinees in group A with ability level $\theta$ earn scores of $\theta$, while examinees in group B with the same ability level $\theta$ consistently score $\theta-2$, then it is likely that the instrument is unfair.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/48f2bb6a-d904-45b6-a486-596289f7e912/valid-reliable-fair.png

Assessment is also used in classroom contexts to provide information about a particular learner's status and progress in relation to learning outcomes, sometimes called assessment of learning, or to directly inform the learner and instructor about the next steps that a learner might take in order to progress, sometimes called assessment for or as learning [@earlAssessmentLearningUsing2013]. Classroom assessment models need also to follow the pattern of the assessment triangle with a cognitive model, an observation, and an inference, but the context of the learner is much more important. Instructors in higher education often need to accommodate various learner characteristics in order to ensure fairness. Learners who live with anxiety, depression, attention disorders, learning difficulties, and many other conditions that affect their ability to perform in assessment tasks, are provided with accommodations in the time they are given to complete a task, reduced task requirements, or other strategies to ensure their performance matches their actual ability. Further, classroom instructors often have greater insight into the conditions of the performance task and can adjust accordingly. For example, an instructor might recognize that a local power outage affected the ability of learners to complete a task and adjust the timeline or expectations accordingly, or they might realize that learners are having difficulty with a particular key idea, pause the assessment task, provide supplemental instruction, and then resume.

The critical difference between LSAs and classroom assessments is that of context. In order for an LSA to be valid and reliable, the effects of local context must be controlled in such a way that regardless where or when a learner completes the assessment task, their performance will match their ability and any deviance from this match is known as "error" [@gerritsen-vanleeuwenkampAssessmentQualityTertiary2017]. This task of increasing validity and reliability requires a large amount of performance data and considerable attention to controlling the environment in which the task is completed. Conversely, classroom instructors have very little data with which to work relative to the requirements of the majority of psychometric analyses, and it is very common to accommodate learner needs with respect to their context. This means that these two purposes of assessment, while both legitimate and necessary, are incompatible in significant ways.

First, the methods used to gather data (the observation component of the assessment triangle) are fundamentally different. LSAs require large amounts of data collected in very controlled environments, but classroom assessments provide small amounts of data collected in less controlled environments. Second, the data derived from performance tasks in LSA contexts is different from the data derived from performance tasks in classroom contexts. Data from LSAs is often in dichotomous or binary form (correct/incorrect), although models for polytomous data (3 or more categories of responses; often used for Likert scales) are becoming more common [@penfieldNCMEInstructionalModule2014]. This data from LSAs is presumed to be 'objective' in that it represents an absolute raw score (an examinee answered 34 out of 57 selected-response items correctly) [@michellPsychometricsPathologicalScience2008]. Conversely, data from classroom assessments is often qualitative and subjective, especially in humanities courses. An implication of these differences is that data from one context cannot necessarily be used for inferences in the other context.

The different purposes of assessment data require caution when being used to support inferences in any context for which they were not collected. This is why data from LSAs, intended to inform public policy and system-wide analyses, should never be used to support inferences about individual learner progress, or to determine the effectiveness of a particular instructor. It is also why classroom assessment data should not be used to inform public policy. However, these differences and cautions do not imply that either LSAs or classroom assessments are better than the other, rather, they should be used for their intended purposes and with awareness of their limitations. For example, the large sample sizes required to draw significant inferences from LSAs are an advantage when comparing across education systems, but they are a disadvantage when considering whether an individual learner has met the learning outcomes in a particular unit. Unfortunately, instructors tend to mimic the tools and processes of LSA (selected-response items completed under high security and speeded conditions) and then report raw scores or percentages. Lipnevich [-@lipnevichWhatGradesMean2020] reports that 78.8% of courses they examined used exams as part of the final grade calculation. Given that the vast majority of classes do not enrol enough learners to obtain reasonable levels of significance, it would be inappropriate to presume that the scores and grade inferences derived from these exams are as valid, reliable, or fair as they need to be.

Approaches to Assessment