DIRECTIONS e-journal

Vasso Oikonomidou and Bessie Dendrinos
SCRIPT ASSESSMENT AND WRITING TASK DIFFICULTY: THE SCRIPT RATERS’ PERSPECTIVE

The demands of KPG writing activities are different from those of all other exam batteries. But KPG is not unique or alone in this. Each and every exam battery is different in its demands, simply because demands and expectations for writing production are related to a wide variety of factors inherent in the system itself; factors such as the view of language and language testing that the exam system embraces, the writing test specifications of each exam battery, its evaluation criteria and scoring grid, its assessment procedure, who its raters are, how well they have been trained, etc. Therefore, a comparison of writing task difficulty across exam batteries is a tough job and the outcomes, which often result from the consideration of a single factor, are frequently of questionable quality. Nor is it an easy job to investigate and compare writing task difficulty across tests of the same exam battery. Here, too, one needs to adopt a multifaceted research approach and separately investigate each factor that may be linked with the degree of writing task difficulty, so as to be able to explain in what way each factor contributes to task difficulty or simplicity.

The English KPG research team has been investigating factors related to reading, writing, and listening task difficulty. Where writing task difficulty is concerned,[1] one of the factors thought to play a crucial role is the least obvious of all: the views of script raters. Why do their views play a role? Because, ultimately, it is script raters that assess test-takers’ performance and finally assign the scores, on the basis of which we measure the difficulty or simplicity of a writing task.

1. In systematically investigating KPG script raters’ views, we have been witnesses to how, for example, particular views of error gravity or lack of familiarity with a specific genre overrides our assessment criteria and detailed guidelines. The result may be intra- or inter-rater unreliability. On the other hand, we have also realized, through our investigation, that rater views of writing task difficulty do not correlate significantly with test score results. For instance, script raters may think that writing mediation tasks are more difficult than ‘regular’ writing tasks, but this view is not backed up by the scores on the writing tests.

2. The data of our study have been drawn from Script Rater Feedback Forms[2], which experienced KPG script raters were asked to complete and return to us in 2009. They included questions about which writing tasks they think are most difficult for candidates and which are most difficult for them to mark. Their views are discussed below and compared with the test score results,[3] and with candidates’ views about specific writing tasks, as recorded in questionnaires which candidates themselves completed, right after the exam.[4]

Who are KPG script raters? – Their background

KPG script raters, whose views are discussed in this paper, are experienced English language teachers with a university degree. Many of them also have postgraduate titles. Many of them are experienced script raters, having worked for other examination systems, mainly for the Panhellenic University Entrance Exams. Once they are selected to take part in the script rating process,[5] they must consistently attend training seminars and go through on-the-job evaluation. Otherwise, they are not called back to take part in the rating process.

Difficulty entailed in KPG script assessment

According to KPG script rater views, marking experience seems to enhance raters’ confidence regarding the assessment of different types of writing tasks across different proficiency levels[6]. In particular, the data reveal that script raters consider B1 and C1 level papers more difficult to mark than B2 level papers. Of course, this is not surprising since these two levels were introduced later than the B2 level exam. What is more, the C1 level exam was introduced earlier than the B1 level, which is considered the most difficult of all to mark, confirming once more our initial hunch that rater opinion relates directly to their experience. Also, in each examination period there is a much larger number of candidates sitting for the B2 level exam than for any other level. As a result, KPG raters have received much more ‘on-the-job’ training with B2 level script assessment.

Script rater marking experience also relates to inter-rater reliability (i.e., agreement between raters on scores assigned to papers). The highest inter-rater reliability is detected in the scores assigned to B2 level papers, which raters are most experienced in marking. Systematic analysis of test scores shows that the Intra-class Correlation Coefficient (ICC) estimates are consistently higher for candidate scores at B2 level than at C1 level, implying that the more experienced the script raters are, the more reliable their marking is.

This interpretation is blurred, however, when we look at the findings linked to the B1 level scores. Analysis of the scores shows that the ICC for B1 level is higher than for B2 level and this is attributed to either the lower level of test-taker proficiency or to the model text given to candidates as input, thus leading to a more homogeneous output (Hartzoulakis, this volume). This confirms another assumption stated above: that rater views do not correlate significantly with test score results. In this case, we can see a discrepancy between script rater views about difficulty of B1 level script assessment and the high marking consistency that analysis shows. This result agrees with findings by Hamp Lyons & Matthias (1994), on the basis of which they argue that while EFL teachers and script raters share considerable agreement about writing task difficulty, their views do not always correlate with the results of test score analysis, no matter how experienced they are.

Lack of experience has also been viewed by raters as the reason that they find the ‘mediation’ activity more difficult to mark than the ‘regular’ writing activity. By lack of experience here, however, raters are referring to classroom or general assessment experience with mediation task requirements. But, quantitative analysis of test score data shows high correlations between the regular writing task and the mediation task scores for each exam period and each proficiency level. This means that though script raters think that mediation is more difficult to mark, the test score results show that it is not more difficult a task for test takers. This means that their views are based on their attitudes and feelings toward mediation activities, rather than test-taker performance. Undoubtedly, they are affected by their feelings of anxiety over dealing with something as unfamiliar to them as mediation is.[7]

It seems, therefore, that whatever is new or unfamiliar to script raters creates insecurity regarding script assessment. So, it not surprising that ‘e-mails’, for example, are considered the easiest text-type to mark. Yet, this view is not corroborated by test score data, the analysis of which shows that ICC estimates are consistent with regular writing activities.

Likewise, a significant number of script raters consider focusing on the assessment of language appropriacy more difficult than focusing on grammatical accuracy and vocabulary range, i.e., the traditional assessment criteria that most raters are familiar with. Actually, it is part of a dominant practice in ELT for many years to focus on the accuracy rather than the appropriacy of learners’ or test-takers’ performance.

What is consoling is that training does work. Analysis of our test score data over a period of three years shows that raters learn to apply the KPG English exam assessment criteria as instructed and that, as a result, ICC estimates increase over time. This is why training is so important for the KPG writing project. Our goal is to minimize potential measurement error and assure both inter- and intra- rater reliability. To this end, the KPG system uses another means to assure reduction of potential measurement error: the double-rating of scripts. Many other examination systems fail to use two raters and scripts get a single score, merely because it is less expensive to have one rater marking each script.

KPG writing task difficulty

KPG script raters do not only consider mediation difficult to mark but also difficult for candidates to deal with. Yet, their view does reflect reality. That is, if what they think was right, test-takers’ performance in mediation tasks would be poor and their scores would be lower than in the regular writing tasks. However, a comparison of candidate scores in both activities of each exam level and each examination period show that average score difference is insignificant, and of course, this is a factor in favour of the KPG writing test validity.

Nevertheless, the above considerations only reflect script rater perceptions of writing task difficulty. Do candidate and script rater views actually match? The data from KPG candidate questionnaires, which were completed after candidates took the exam, show that candidate views on writing task difficulty do not always match with rater perceptions. This finding of ours agrees with the findings of other research projects (cf. Tavakoli, 2009). For instance, the data from candidate questionnaires delivered in May 2009 show that candidates at B1 and B2 level do not consider the mediation activity more difficult than the regular writing task in the same test.

It is interesting that whereas test-takers judge difficulty on the basis of how they think they did on a particular test, raters tend to think that the most difficult aspect of writing is that which they themselves are least familiar with. For example, besides mediation, a significant number of raters think that what is exceedingly difficult for test-takers is to respond to generic conventions and produce writing with lexico-grammatical features appropriate for the genre in question. However, our research shows that even without preparation classes, candidates are able to draw on their background knowledge, meet such requirements and perform successfully.

It seems, therefore, that familiarity with the type of writing test and approach to script assessment is regarded by KPG script raters as one of the greatest factors affecting task difficulty. Of course, familiarity does play a significant role in test performance and script assessment and it is, therefore, quite to our satisfaction to see that marking deviation is reduced with time because of training.

Other factors affecting difficulty have been investigated and each one is important as well as exceedingly interesting. In a future KPG Corner article, we shall discuss other factors which affect writing task difficulty.

References

Dendrinos, B. Testing and teaching mediation (this volume).

Hamp-Lyons, L., & S. P. Matthias. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49-68.

Hartzoulakis, V. (in press). Rater Agreement in Writing Assessment. Directions in English Language Teaching and Testing. Vol. 1, RCeL Publications, University of Athens.

Karavas, E. (2009). Training Script Raters for The KPG Exams. Available at:
http://rcel.enl.uoa.gr/kpg/gr_kpgcorner_jun2009.htm

Tavakoli, P. (2009). Investigating task difficulty: learners’ and teachers’ perceptions. International Journal of Applied Linguistics, 19(1), 1-25.

[1] Leading the research of writing task difficulty is Vasso Oikonomidou, RCeL Research Assistant.

[2] The Script Rater Feedback Form included both open and closed questions about the writing test itself, the quality of script rater training and of script rater tools and facilitators, as well as about KPG script raters’ experiences.

[3] For a quantitative analysis of test scores in the examination periods between 2005 and 2008, see Hartzoulakis, this volume.

[4] KPG candidates are asked to complete a questionnaire designed by the KPG research team at the end of the examination.

[5] For information regarding the selection of KPG script raters, see Karavas (2009). Available at: http://rcel.enl.uoa.gr/kpg/kpgcorner_jun2009.htm

[6] The particular Script Rater Feedback Form was used for feedback on writing task design and script assessment in only the three levels of proficiency: B1, B2, and C1.

[7] In her article “Testing and Teaching Mediation” (this volume), Dendrinos discusses the reasons for the ELT community’s unfamiliarity with mediation practices.

[Back]