Reliability & Validity

The statistical adage that reliability is a required precursor to validity is as true for course evaluations as for other measurement applications.  IASystem™ carries out custom analyses for each institution to confirm item reliability and maximize validity of student ratings of instruction for decision making.


Item reliability refers to the stability of ratings across students, courses, or instructors.  The higher the reliability of an item, the more confident we are that average ratings reflect student opinions about a class and are not affected by random error, and the smaller the difference between item ratings that is required for statistical significance.

Statistical reliability of course evaluation data is of particular importance when results are used in decision making and the way in which reliability estimates are computed must be consistent with the level of aggregation of the data.

IASystem™ computes two different estimates of item reliability to support different types of decisions.  Because reliability estimates are situational, we compute reliabilities separately for each institution and translate the results into decision making “rules of thumb.”

Pedagogical Decision Making

Instructors often make changes in their courses based on ratings of individual classes. For this purpose IASystem™ computes inter-rater correlation coefficients, using class as the unit of analysis. For each item, the reliability coefficient represents the level of agreement among students within the class relative to the mean difference in ratings across classes. Inter-rater correlation coefficients of IASystem™ instructional improvement items typically range from .71 – .82 for class sizes of ten students.

Programmatic Decision Making

High-stakes decisions such as those relating to faculty merit pay increases, promotion, and tenure should be based on aggregated data from multiple classes. For this purpose, IASystem™ computes inter-class correlation coefficients, where instructor is the unit of analysis. For each item, the reliability coefficient represents the level of agreement among all classes for a particular instructor relative to the mean difference across instructors. Inter-class correlation coefficients of IASystem™ “global” items typically range from .69 – .75 for seven combined classes.


The question “Do student ratings actually reflect instructional quality?” has been extensively studied relative to several different types of validity:  construct, convergent, discriminant and consequential.  Some of the most useful references are listed below.[1] Ratings have been found to be influenced to varying degrees by several factors such as expected grade in the course, class size, and reason for enrollment. To maximize the validity of ratings at each institution, IASystem™ identifies institution-specific correlates (possible biases) using regression analyses, and provides both adjusted and unadjusted ratings in standard reports.


[1]     Abrami, P. C, d’Apollonia, S., & Cohen, P. A. (1990). Validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology, 82, 219-231.

Aleamoni, L. M. (1999). Student rating myths versus research facts from 1924 to 1998. Journal of Personnel Evaluation in Education, 13(2), 153-166.

Bonitz, V. S. (2011). Student Evaluation of Teaching: Individual Differences and Bias Effects. (Doctoral dissertation). Available from ProQuest Dissertations & Theses database. (UMI No. 3472997)

Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. In M. Theall, P. C. Abrami, and L. A. Mets (Eds.) The Student Ratings Debate: Are They Valid? How Can We Best Use Them? (pp. 9-26). New Directions for Institutional Research, 109. San Francisco, CA: Jossey-Bass.

Marsh, H. W. (1984). Students’ evaluation of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76(5), 707-754.