Vasso Oikonomidou and
Bessie Dendrinos
SCRIPT
ASSESSMENT AND WRITING TASK DIFFICULTY:
THE SCRIPT RATERS’ PERSPECTIVE
The demands of KPG
writing activities are different from
those of all other exam batteries. But
KPG is not unique or alone in this. Each
and every exam battery is different in
its demands, simply because demands and
expectations for writing production are
related to a wide variety of factors
inherent in the system itself; factors
such as the view of language and
language testing that the exam system
embraces, the writing test
specifications of each exam battery, its
evaluation criteria and scoring grid,
its assessment procedure, who its raters
are, how well they have been trained,
etc. Therefore, a comparison of writing
task difficulty across exam batteries is
a tough job and the outcomes, which
often result from the consideration of a
single factor, are frequently of
questionable quality. Nor is it an easy
job to investigate and compare writing
task difficulty across tests of the same
exam battery. Here, too, one needs to
adopt a multifaceted research approach
and separately investigate each factor
that may be linked with the degree of
writing task difficulty, so as to be
able to explain in what way each factor
contributes to task difficulty or
simplicity.
The English
KPG research team has been investigating
factors related to reading, writing, and
listening task difficulty. Where writing
task difficulty is concerned,[1]
one of the factors thought to play a
crucial role is the least obvious of
all: the views of script raters. Why do
their views play a role? Because,
ultimately, it is script raters that
assess test-takers’ performance and
finally assign the scores, on the basis
of which we measure the difficulty or
simplicity of a writing task.
1. In
systematically investigating KPG script
raters’ views, we have been witnesses to
how, for example, particular views of
error gravity or lack of familiarity
with a specific genre overrides our
assessment criteria and detailed
guidelines. The result may be intra- or
inter-rater unreliability. On the other
hand, we have also realized, through our
investigation, that rater views of
writing task difficulty do not correlate
significantly with test score results.
For instance, script raters may think
that writing mediation tasks are more
difficult than ‘regular’ writing tasks,
but this view is not backed up by the
scores on the writing tests.
2. The data of
our study have been drawn from Script
Rater Feedback Forms[2],
which experienced KPG script raters were
asked to complete and return to us in
2009. They included questions about
which writing tasks they think are most
difficult for candidates and which are
most difficult for them to mark. Their
views are discussed below and compared
with the test score results,[3]
and with candidates’ views about
specific writing tasks, as recorded in
questionnaires which candidates
themselves completed, right after the
exam.[4]
Who are KPG script
raters? – Their
background
KPG script
raters, whose views are discussed in
this paper, are experienced English
language teachers with a university
degree. Many of them also have
postgraduate titles. Many of them are
experienced script raters, having worked
for other examination systems, mainly
for the Panhellenic University Entrance
Exams. Once they are selected to take
part in the script rating process,[5]
they must consistently attend training
seminars and go through on-the-job
evaluation. Otherwise, they are not
called back to take part in the rating
process.
Difficulty entailed
in KPG script assessment
According to
KPG script rater views, marking
experience seems to enhance raters’
confidence regarding the assessment of
different types of writing tasks across
different proficiency levels[6].
In particular, the data reveal that
script raters consider B1 and C1 level
papers more difficult to mark than B2
level papers. Of course, this is not
surprising since these two levels were
introduced later than the B2 level exam.
What is more, the C1 level exam was
introduced earlier than the B1 level,
which is considered the most difficult
of all to mark, confirming once more our
initial hunch that rater opinion relates
directly to their experience. Also, in
each examination period there is a much
larger number of candidates sitting for
the B2 level exam than for any other
level. As a result, KPG raters have
received much more ‘on-the-job’ training
with B2 level script assessment.
Script rater marking
experience also relates to inter-rater
reliability (i.e., agreement between
raters on scores assigned to papers).
The highest inter-rater reliability is
detected in the scores assigned to B2
level papers, which raters are most
experienced in marking. Systematic
analysis of test scores shows that the
Intra-class Correlation Coefficient
(ICC) estimates are consistently higher
for candidate scores at B2 level than at
C1 level, implying that the more
experienced the script raters are, the
more reliable their marking is.
This interpretation
is blurred, however, when we look at the
findings linked to the B1 level scores.
Analysis of the scores shows that the
ICC for B1 level is higher than for B2
level and this is attributed to either
the lower level of test-taker
proficiency or to the model text given
to candidates as input, thus leading to
a more homogeneous output (Hartzoulakis,
this volume). This confirms another
assumption stated above: that rater
views do not correlate significantly
with test score results. In this case,
we can see a discrepancy between script
rater views about difficulty of B1 level
script assessment and the high marking
consistency that analysis shows. This
result agrees with findings by Hamp
Lyons & Matthias (1994), on the basis of
which they argue that while EFL teachers
and script raters share considerable
agreement about writing task difficulty,
their views do not always correlate with
the results of test score analysis, no
matter how experienced they are.
Lack of
experience has also been viewed by
raters as the reason that they find the
‘mediation’ activity more difficult to
mark than the ‘regular’ writing
activity. By lack of experience here,
however, raters are referring to
classroom or general assessment
experience with mediation task
requirements. But, quantitative analysis
of test score data shows high
correlations between the regular writing
task and the mediation task scores for
each exam period and each proficiency
level. This means that though script
raters think that mediation is more
difficult to mark, the test score
results show that it is not more
difficult a task for test takers. This
means that their views are based on
their attitudes and feelings toward
mediation activities, rather than
test-taker performance. Undoubtedly,
they are affected by their feelings of
anxiety over dealing with something as
unfamiliar to them as mediation is.[7]
It seems, therefore,
that whatever is new or unfamiliar to
script raters creates insecurity
regarding script assessment. So, it not
surprising that ‘e-mails’, for example,
are considered the easiest text-type to
mark. Yet, this view is not corroborated
by test score data, the analysis of
which shows that ICC estimates are
consistent with regular writing
activities.
Likewise, a
significant number of script raters
consider focusing on the assessment of
language appropriacy more difficult than
focusing on grammatical accuracy and
vocabulary range, i.e., the traditional
assessment criteria that most raters are
familiar with. Actually, it is part of a
dominant practice in ELT for many years
to focus on the accuracy rather than the
appropriacy of learners’ or test-takers’
performance.
What is consoling is
that training does work. Analysis of our
test score data over a period of three
years shows that raters learn to
apply the KPG English exam assessment
criteria as instructed and that, as a
result, ICC estimates increase over
time. This is why training is so
important for the KPG writing project.
Our goal is to minimize potential
measurement error and assure both inter-
and intra- rater reliability. To this
end, the KPG system uses another means
to assure reduction of potential
measurement error: the double-rating of
scripts. Many other examination systems
fail to use two raters and scripts get a
single score, merely because it is less
expensive to have one rater marking each
script.
KPG writing task difficulty
KPG script raters do
not only consider mediation difficult to
mark but also difficult for candidates
to deal with. Yet, their view does
reflect reality. That is, if what they
think was right, test-takers’
performance in mediation tasks would be
poor and their scores would be lower
than in the regular writing tasks.
However, a comparison of candidate
scores in both activities of each exam
level and each examination period show
that average score difference is
insignificant, and of course, this is a
factor in favour of the KPG writing test
validity.
Nevertheless, the
above considerations only reflect script
rater perceptions of writing task
difficulty. Do candidate and script
rater views actually match? The data
from KPG candidate questionnaires, which
were completed after candidates
took the exam, show that candidate views
on writing task difficulty do not always
match with rater perceptions. This
finding of ours agrees with the findings
of other research projects (cf.
Tavakoli, 2009). For instance, the data
from candidate questionnaires delivered
in May 2009 show that candidates at B1
and B2 level do not consider the
mediation activity more difficult than
the regular writing task in the same
test.
It is interesting
that whereas test-takers judge
difficulty on the basis of how they
think they did on a particular test,
raters tend to think that the most
difficult aspect of writing is that
which they themselves are least familiar
with. For example, besides mediation, a
significant number of raters think that
what is exceedingly difficult for
test-takers is to respond to generic
conventions and produce writing with
lexico-grammatical features appropriate
for the genre in question. However, our
research shows that even without
preparation classes, candidates are able
to draw on their background knowledge,
meet such requirements and perform
successfully.
It seems, therefore,
that familiarity with the type of
writing test and approach to script
assessment is regarded by KPG script
raters as one of the greatest factors
affecting task difficulty. Of course,
familiarity does play a significant role
in test performance and script
assessment and it is, therefore, quite
to our satisfaction to see that marking
deviation is reduced with time because
of training.
Other factors
affecting difficulty have been
investigated and each one is important
as well as exceedingly interesting. In a
future KPG Corner article, we shall
discuss other factors which affect
writing task difficulty.
References
Dendrinos, B. Testing
and teaching mediation (this volume).
Hamp-Lyons,
L., & S. P. Matthias. (1994). Examining
expert judgments of task difficulty on
essay tests. Journal of Second
Language Writing, 3(1), 49-68.
Hartzoulakis, V. (in
press). Rater Agreement in
Writing Assessment. Directions in
English Language Teaching and Testing.
Vol. 1, RCeL Publications, University of
Athens.
Karavas, E. (2009).
Training Script Raters for The KPG
Exams. Available at:
http://rcel.enl.uoa.gr/kpg/gr_kpgcorner_jun2009.htm
Tavakoli, P. (2009).
Investigating task difficulty: learners’
and teachers’ perceptions.
International Journal of Applied
Linguistics, 19(1), 1-25.
[Back]