HEAP Health Literacy: FAQ

Are the HEAP assessment items valid and reliable?

The purpose of the HEAP item development project was to generate items that member states could use to develop their own assessments for statewide, district, school, or classroom use. The members recognized that each state would need to conduct field tests within their own state in order to generate item statistics that would be useful in developing assessment forms. The strategy of using samples of convenience for tryouts was selected and administered in 2000. This gave member states a high degree of flexibility in determining the details of administering the tryout items. Each tryout site was allowed to select the content area of the items they administered. The guidelines were lenient. The modules in a given grade level and content area were spiraled, however, so that examinees in each content group could generally be assumed to be randomly equivalent.

In spring 2000, the items were distributed to approximately 1,600 classrooms that had volunteered to participate in an item tryout. The purpose of this tryout was to evaluate the performance of the items rather than the performance of the students who participated. This item tryout was part of the process of developing the item pool.

Following the tryout, all selected response, short answer and extended response items were scored. Statistical analyses were then carried out on these items. The goal of the analysis was to determine how the items functioned and to identify items that potentially needed further review. Possible problems that these analyses would help to identify include:

  • Items that were too difficult: for example, items having a % correct that was very low
  • Poorly functioning distractors for the selected response items: for example, a high percentage of examinees’ having chosen an incorrect response
  • Problematic frequency distributions for the constructed response items: for example, a very high or very low percentage of the examinees scoring in any one score category, particularly the highest or lowest score categories.

Analyses were conducted to determine the difficulty of individual items, the difficulty of the overall pool, and the characteristics of the score distributions for the short answer and extended response items. Additional analyses included n-counts for each item, p-values and distractor response percentages for the selected response items, and inter-rater reliabilities for the constructed response items.

Caution must be taken in interpreting the results of the tryouts. Generalizing the results of the item tryout to a broader student population is not warranted since there was no effort to draw a representative sample of students in any member state, in the collection of member states, nor of the nation as a whole. There was no effort to determine the extent to which the students in the HEAP tryouts were representative of the student population at any of these levels. The results of the item tryout must be interpreted in the context of the design of the item tryout. As noted above, tryout sites were allowed to select the content area they administered.

This self-selection process likely had an impact on the tryout statistics. In addition, due to the way that the modules were spiraled, direct comparisons of statistics are valid within content area, but they are probably not valid across content areas.

Questions have arisen regarding the validity and reliability of the items in the HEAP item pool. In a general sense, validity refers to "the extent to which a test measures what its authors or users claim it measures." While there are different types of validity, this project focused on establishing the content validity of the item pool. Content validity is the extent to which an assessment (or the item pool) reflect the depth and breadth of the knowledge and skills that it is intended to measure, e.g., content standards.

To help establish the content validity of the health pool the project adopted the Assessing Health Literacy: Assessment Framework as a basis for all development. This document delineated the content and skill standards that were to be assessed in the item pool. The items were written to be aligned to the Assessment Framework, and the alignment of each item with the framework was thoroughly reviewed and evaluated. Items were reviewed for both content accuracy and for alignment to the content and skills specified in the Assessment Framework. Reviewers included the HEAP state members, the HEAP management team, and CDC/DASH, CCSSO, and ACT staff. In addition, the items were reviewed multiple times by these groups prior to the item tryout and reviewed by HEAP members after the item tryout. Following each round of reviews, the HEAP management team met to evaluate and incorporate reviewer comments and recommendations. The review process was extensive and intensive. All items included in the pool have been reviewed, evaluated, and approved for inclusion. Based on this process, the item pool is judged to have content validity.

Reliability refers to "the degree to which the scores of every individual are consistent over repeated applications of a measurement procedure and hence are dependable, and repeatable; the degree to which scores are free of errors of measurement." Given that the project did not include the construction of test forms, the reliability of student scores on test forms could not be addressed. However, one source of reliability that can be established from the item tryout is the inter-rater reliability.

For the constructed response items, the tryout allowed the project to examine the scoring system to determine whether it could be applied reliably across all item types. This inter-rater reliability is "the consistency of rater judgments of the work or performance of students from one rater to another." 20% of the short answer and extended response items were scored by two readers for the purpose of estimating reliability for scoring these items. Inter-rater reliabilities for the constructed response items were obtained by computing the correlation between the first and second rater on this 20% re-score sample. The results were averaged by item type (short answer and extended response) within grade level for both the core concept and skill scores. The values ranged from 0.68 to 0.75, indicating good inter-rater reliability under a four-point scoring rubric.

For the constructed response items, the interpretation of the scoring rubric remains fixed across years. In other words, a Level 4 student response in one year is equivalent to a Level 4 student response in other years.

As mentioned previously, the purpose of this tryout was to evaluate the performance of all the items, not the performance of the students who participated. The tryout served that purpose and provided important information regarding item quality. The item tryout allowed staff to evaluate how the items functioned: the relative difficulty of the items, the clarity and focus of the item prompts, and the approximate amount of time required to respond to items.

After completing the tryout, the project has a set of items that have been thoroughly reviewed by professional assessment development staff and professionals in the content area; tested in the classroom; and given a final review and edit by professional development staff.