ELT News, September 2010
Script Assessment and Writing Task
Difficulty: The Script Raters'
Perspective
Vasso
Oikonomidou and Bessie Dendrinos
The demands of
KPG writing activities are different
from those of all other exam
batteries, a recent “KPG Corner”
article points out. But KPG is not
unique or alone in this. Each and
every exam battery is different in
its demands, simply because demands
and expectations for writing
production are related to a wide
variety of factors inherent in the
system itself; factors such as the
view of language and language
testing that the exam system
embraces, the writing test
specifications of each exam battery,
its evaluation criteria and scoring
grid, its assessment procedure, who
its raters are, how well they have
been trained, etc. Therefore, a
comparison of writing task
difficulty across exam batteries is
a tough job and the outcomes, which
often result from the consideration
of a single factor, are frequently
of questionable quality. Nor is it
an easy job to investigate and
compare writing task difficulty
across tests of the same exam
battery. Here, too, one needs to
adopt a multifaceted research
approach and separately investigate
each factor that may be linked with
the degree of writing task
difficulty, so as to be able to
explain in what way each factor
contributes to task difficulty or
simplicity.
The
English KPG research team has been
investigating factors related to
reading, writing, and listening task
difficulty. Where writing task
difficulty is concerned,[1]
one of the factors thought to play a
crucial role is the least obvious of
all: the views of script raters. Why
do their views play a role? Because,
ultimately, it is script raters that
assess test-takers’ performance and
finally assign the scores, on the
basis of which we measure the
difficulty or simplicity of a
writing task.
In systematically
investigating KPG script raters’
views, we have been witnesses to
how, for example, particular views
of error gravity or lack of
familiarity with a specific genre
overrides our assessment criteria
and detailed guidelines. The result
may be intra- or inter-rater
unreliability. On the other hand, we
have also realized, through our
investigation, that rater views of
writing task difficulty do not
correlate significantly with test
score results. For instance, script
raters may think that writing
mediation tasks are more difficult
than ‘regular’ writing tasks, but
this view is not backed up by the
scores on the writing tests.
The data
of our study have been drawn from
Script Rater Feedback Forms[2],
which experienced KPG script raters
were asked to complete and return to
us in 2009. They included questions
about which writing tasks they think
are most difficult for candidates
and which are most difficult for
them to mark. Their views are
discussed below and compared with
the test score results,[3]
and with candidates’ views about
specific writing tasks, as recorded
in questionnaires which candidates
themselves completed, right after
the exam.[4]
Who are KPG
script raters? – Their background
KPG
script raters, whose views are
discussed in this paper, are
experienced English language
teachers with a university degree.
Many of them also have postgraduate
titles. Many of them are experienced
script raters, having worked for
other examination systems, mainly
for the Panhellenic University
Entrance Exams. Once they are
selected to take part in the script
rating process,[5]
they must consistently attend
training seminars and go through
on-the-job evaluation. Otherwise,
they are not called back to take
part in the rating process.
Difficulty
entailed in KPG script assessment
According to KPG script rater views,
marking experience seems to enhance
raters’ confidence regarding the
assessment of different types of
writing tasks across different
proficiency levels[6].
In particular, the data reveal that
script raters consider B1 and C1
level papers more difficult to mark
than B2 level papers. Of course,
this is not surprising since these
two levels were introduced later
than the B2 level exam. What is
more, the C1 level exam was
introduced earlier than the B1
level, which is considered the most
difficult of all to mark, confirming
once more our initial hunch that
rater opinion relates directly to
their experience. Also, in each
examination period there is a much
larger number of candidates sitting
for the B2 level exam than for any
other level. As a result, KPG raters
have received much more ‘on-the-job’
training with B2 level script
assessment.
Script rater
marking experience also relates to
inter-rater reliability (i.e.,
agreement between raters on scores
assigned to papers). The highest
inter-rater reliability is detected
in the scores assigned to B2 level
papers, which raters are most
experienced in marking. Systematic
analysis of test scores shows that
the Intra-class Correlation
Coefficient (ICC) estimates are
consistently higher for candidate
scores at B2 level than at C1 level,
implying that the more experienced
the script raters are, the more
reliable their marking is.
This
interpretation is blurred, however,
when we look at the findings linked
to the B1 level scores. Analysis of
the scores shows that the ICC for B1
level is higher than for B2 level
and this is attributed to either the
lower level of test-taker
proficiency or to the model text
given to candidates as input, thus
leading to a more homogeneous output
(Hartzoulakis, ibid). This confirms
another assumption stated above:
that rater views do not correlate
significantly with test score
results. In this case, we can see a
discrepancy between script rater
views about difficulty of B1 level
script assessment and the high
marking consistency that analysis
shows. This result agrees with
findings by Hamp Lyons & Matthias
(1994), on the basis of which they
argue that while EFL teachers and
script raters share considerable
agreement about writing task
difficulty, their views do not
always correlate with the results of
test score analysis, no matter how
experienced they are.
Lack of
experience has also been viewed by
raters as the reason that they find
the ‘mediation’ activity more
difficult to mark than the ‘regular’
writing activity. By lack of
experience here, however, raters are
referring to classroom or general
assessment experience with mediation
task requirements. But, quantitative
analysis of test score data shows
high correlations between the
regular writing task and the
mediation task scores for each exam
period and each proficiency level.
This means that though script raters
think that mediation is more
difficult to mark, the test score
results show that it is not more
difficult a task for test takers.
This means that their views are
based on their attitudes and
feelings toward mediation
activities, rather than test-taker
performance. Undoubtedly, they are
affected by their feelings of
anxiety over dealing with something
as unfamiliar to them as mediation
is.[7]
It seems,
therefore, that whatever is new or
unfamiliar to script raters creates
insecurity regarding script
assessment. So, it not surprising
that ‘e-mails’, for example, are
considered the easiest text-type to
mark. Yet, this view is not
corroborated by test score data, the
analysis of which shows that ICC
estimates are consistent with
regular writing activities.
Likewise, a
significant number of script raters
consider focusing on the assessment
of language appropriacy more
difficult than focusing on
grammatical accuracy and vocabulary
range, i.e., the traditional
assessment criteria that most raters
are familiar with. Actually, it is
part of a dominant practice in ELT
for many years to focus on the
accuracy rather than the appropriacy
of learners’ or test-takers’
performance.
What is consoling
is that training does work. Analysis
of our test score data over a period
of three years shows that raters
learn to apply the KPG English
exam assessment criteria as
instructed and that, as a result,
ICC estimates increase over time.
This is why training is so important
for the KPG writing project. Our
goal is to minimize potential
measurement error and assure both
inter- and intra- rater reliability.
To this end, the KPG system uses
another means to assure reduction of
potential measurement error: the
double-rating of scripts. Many other
examination systems fail to use two
raters and scripts get a single
score, merely because it is less
expensive to have one rater marking
each script.
KPG writing task
difficulty
KPG script raters
do not only consider mediation
difficult to mark but also difficult
for candidates to deal with. Yet,
their view does reflect reality.
That is, if what they think was
right, test-takers’ performance in
mediation tasks would be poor and
their scores would be lower than in
the regular writing tasks. However,
a comparison of candidate scores in
both activities of each exam level
and each examination period show
that average score difference is
insignificant, and of course, this
is a factor in favour of the KPG
writing test validity.
Nevertheless, the
above considerations only reflect
script rater perceptions of writing
task difficulty. Do candidate and
script rater views actually match?
The data from KPG candidate
questionnaires, which were completed
after candidates took the
exam, show that candidate views on
writing task difficulty do not
always match with rater perceptions.
This finding of ours agrees with the
findings of other research projects
(cf. Tavakoli, 2009). For instance,
the data from candidate
questionnaires delivered in May 2009
show that candidates at B1 and B2
level do not consider the mediation
activity more difficult than the
regular writing task in the same
test.
It is interesting
that whereas test-takers judge
difficulty on the basis of how they
think they did on a particular test,
raters tend to think that the most
difficult aspect of writing is that
which they themselves are least
familiar with. For example, besides
mediation, a significant number of
raters think that what is
exceedingly difficult for
test-takers is to respond to generic
conventions and produce writing with
lexico-grammatical features
appropriate for the genre in
question. However, our research
shows that even without preparation
classes, candidates are able to draw
on their background knowledge, meet
such requirements and perform
successfully.
It seems,
therefore, that familiarity with the
type of writing test and approach to
script assessment is regarded by KPG
script raters as one of the greatest
factors affecting task difficulty.
Of course, familiarity does play a
significant role in test performance
and script assessment and it is,
therefore, quite to our satisfaction
to see that marking deviation is
reduced with time because of
training.
Other factors
affecting difficulty have been
investigated and each one is
important as well as exceedingly
interesting. In a future KPG Corner
article, we shall discuss other
factors which affect writing task
difficulty.
References
Dendrinos, B. (in
press). Testing and Teaching
Mediation. Directions in English
Language Teaching and Testing.
Vol. 1, RCeL Publications,
University of Athens.
Hamp-Lyons,
L. & S. P. Matthias. (1994).
Examining expert judgments of task
difficulty on essay tests.
Journal of Second Language Writing
3(1): 49-68.
Hartzoulakis, V.
(in press). Rater Agreement in
Writing Assessment. Directions in
English Language Teaching and
Testing. Vol. 1, RCeL
Publications, University of Athens.
Tavakoli, P.
(2009). Investigating task
difficulty: learners’ and teachers’
perceptions. International
Journal of Applied Linguistics
19(1): 1-25.
<Πίσω> |