Recent Changes - Search:



Bayesian Networks in Educational Assessment

Cognition and Assessment


edit SideBar

BN /


Language Testing Examples

This hypothetical language test was introduced in Invalid BibTex Entry! in order to illustrate how Bayesian networks could be used to untangle evidence coming from "integrated tasks"---tasks that tapped more than one language modality.


Language generally consists of four modalities: Reading, Writing, Speaking and Listening. Reading and Listening can be measured in isolation. For example, multiple choice Reading questions can have stimulus material, instructions and options all presented as written text. Similarly, Listening items can be created using audio instructions, stimulus and options.

Writing and Speaking are more difficult to test in isolation. At a minimum, the instructions need to be either in text or audio form. However, Integrated Tasks, which use more than one modality are highly valued in the language education community. After all conversation consists of alternatively using receptive and productive language skills.

The problem with integrated tasks is that if the student does not respond correctly, the blame must be apportioned among the skills that were tested. For example, if after a reading a short passage, a student asks a question which is difficult to understand and may or may not be on topic, how do we know if the student is having difficulty with Reading or Speaking or both.

In this simple example, the receptive modalities, Reading and Listening, are tested with pure single-proficiency tasks, and the productive modalities, Writing and Speaking are tested with Integrated Tasks.

Proficiency Models

The proficiency model has four variables Reading, Listening, Writing and Speaking, all of which are modelled with three levels. This is a saturated model, with all four variables dependent on each other. The receptive skills, Reading and Listening are put earlier in the order because foreign language learners usually acquire those skills first.

Language Proficiency Model

Task Models

There are four task models taken from Mislevy (1995):

  • Task A (Reading Task) -- an extended reading task with partial credit scoring available.
  • Task B (Reading/Writing Task) -- this involves reading a complex passage and producing a short written response.
  • Task C (Reading/Listening/Speaking Task) -- the student listens to a taped conversation with a transcript provided, and is asked to produce a spoken response. Note that either Reading or Listening skills can be used to extract the needed information.
  • Task D (Listening Task) -- this task asks the student to listen to a taped conversation and indicate by raising his or her hand when the business is complete.

As this was a conceptual paper, the test was never actually built. The simulations below assume that variants can be made of these tasks. The "fixed" simulations assume that the variants are identical except for Incidental? task model variables. The "random" simulations assume that the variants range in difficulty around the default parameters in the evidence model. The "high" ("low") variant assumes that the Radical? task model variables have been manipulated to make the task very difficult? (easy).

Evidence Models

This section only gives an outline of the evidence models. For conditional probability tables, see the links below.

  • Task A (Reading Task) -- The work product for this task is scored as an observable with four levels "Poor", "Okay", "Good" and "Very Good". (Note that this scoring scheme has more response levels than there are proficiency levels, which is not necessarily an optimal arrangement. Scoring at three levels might be preferable, or else expanding the number of levels in the reading proficiency variable.) The single observable Outcome_R taps only the Reading skill.
  • Task B (Reading/Writing Task) -- The work product is some kind of writing which is scored on two dimensions: the quality of the writing (good or bad) and whether or not the writing was on topic. The observable variable Outcome_RW has four levels related to the possible combinations of writing quality and appropriateness. (This might be better expressed with two outcome variables, and the parametric variant of the task reparameterizes the model in that way). The single observable Outcome_RW is linked to both the Reading and Writing proficiency variables.
  • Task C (Reading/Speaking/Listening Task) -- This work product is again scored as an observable with four levels "Poor" through "Very Good". The Outcome_RLS observable is connected to the Reading, Listening and Speaking variables. As Reading and Listening enter into the model in a Disjunctive way, the parametric version of the model introduces an intermediate node to capture this effect.
  • Task D (Listening Task) -- The work product here is scored "Right" or "Wrong". The Observable variable Outcome_L depends only on the Listening proficiency variable.

The picture below shows the proficiency model and all four evidence models (using the original Mislevy, 1995, model structures and parameterization).

Full Motif

Assembly Model

The original Mislevy (1995) paper did not provide a complete assessment, it only described four hypothetical tasks. As this is rather short, we created longer forms by treating the original tasks as task models and replicating them. We produced two forms.

short form
This had 5 Reading, 3 Reading/Writing, 3 Reading/Speaking/Listening and 5 Listening tasks for a total of 16 tasks.
long form
This had 10 Reading, 6 Reading/Writing, 6 Reading/Speaking/Listening and 10 Listening tasks for a total of 32 tasks.

The the replication was done by generating new links? with the same structure as the EvidenceModel?. The parameters for the links where determined by one of four algorithms:

The parameters of the evidence model were simply copied creating a number of identical variants of the task.
The parameters for the links were drawn randomly from the prior laws for the evidence model parameters.
The parameters for the links were taken at the 95% percentile of the prior laws.
The parameters for the links were taken at the 5% percentile of the prior laws.

In theory, one could create new simulated test designs by mixing and matching from the various variants.

Data Sets

The complete model description for both the original and reparameterized models, and several randomly generated test forms, as well as random data sets generated from those test forms are available at:


Invalid BibTex Entry!

Edit - History - Print - Recent Changes - Search
Page last modified on November 04, 2014, at 01:59 PM