Background on ISAT Scoring Metrics
Beginning in 1999 for mathematics and reading and 2000 for science, the Illinois State Board of Education (ISBE) transitioned the state student assessment from IGAP to ISAT. ISAT is a criterion-referenced assessment. Currently, it is primarily multiple-choice, but the math portion has 2 extended responses items. Beginning in 2005-06, the math ISAT also had short-answer constructed response items in addition to the multiple choice and extended response type questions.
ISBE often reports ISAT results in four student performance categories—Academic Warning, Below Standards, Meets Standards, and Exceeds Standards. The reading, mathematics, science, and social studies sections are scored on a scale of 120-200 and the writing on a scale of 6-32. Beginning in the spring of 2005, writing and social studies were no longer part of the ISAT. For reading, mathematics, and science the cutoffs are given in the following table cite :
One might ask how these cut scores were determined. The answer to this is a procedure described in the parlance of test makers as a “modified Angoff procedure”. Of course, other than to impress your friends this term is not very helpful in understanding the procedure. This process occurred for the mathematics, reading, and science tests separately. The Illinois State Board of Education (ISBE) convened committees of curriculum experts to develop concrete descriptions of student knowledge and skill levels that define specific performance categories. Educators throughout Illinois reviewed these descriptions. Panels of recognized subject matter experts then convened in Springfield to translate the verbal descriptions into cut scores on the ISAT tests. Panelists were drawn from a pool of educators who had specific knowledge of student performance at the grade levels being assessed by ISAT and experience assessing students at those grade levels. Panelists were selected to be broadly representative of the geographical and ethnic diversity of Illinois’ public school system. A total of 170 educators participated in the standard-setting process. The Angoff procedure was utilized cite . This can best be summarized as a focused, judgmental process by knowledgeable content experts. The panelists were asked to indicate what percentage of three groups of students—those who were just above the Academic Warning/Below Standards boundary, those who were just above the Below Standards/Meets Standards boundary, and those who were just above the Meets Standards/Exceeds Standards boundary—would answer a particular question correctly. The ratings were made sequentially rather than simultaneously (i.e., panelists made all judgments relative to one cut score before moving to the next cut score). Item performance statistics were provided to help panelists anchor their ratings. Using several statistical procedures these percent distributions on each question, were then aggregated for the entire test and converted to the 120-200 scale.
It is important to note that while some recent studies have questioned the rigor of these cutoffs for each performance category, this is inconsequential for most of the analyses report here, as most of the analyses described involve comparisons between treatment and comparison group(s), all of whom take the same assessment, whether or not one considers the assessment or the cut scores as "hard", "easy", "rigorous", or "well-aligned with college expectations".
Each year the tests are equated using a series of anchor items (items that are repeated on all administrations of the ISAT for that grade level), such that the overall test difficulty and the scale scores (120-200) remain consistent. Note that this does not mean that the same number of questions needs to be answered correctly each year to receive the same score. In years when the overall test is “harder” less correct questions are needed to score the same scale score (say 150) and thus performance level (for example “meets standards”) than in years where the overall test is “easier”. This makes longitudinal comparisons possible.
Many reports show results by some type of grouping items into subsets (usually 8 standard sets in math cite and 5 in science cite that are roughly aligned to the Illinois State Standards). One must be careful in the analyses of these subsets. The more one divides the test into portions, by definition, the fewer items there will be in each category. The fewer items there are in a category, the less reliable that measure is. This is a simple consequence of the fact that the less items one has the less the precision of the measurement, the larger the measurement error and thus the reliability of the measure. Any measurement one makes whether it's measuring the length of a table, the weight of a bottle, or a student’s “ability” has some degree of precision associated with it. For example, if you weigh a bottle using a normal bathroom scale you would find its weight to the nearest pound, but the balance a pharmacist uses can measure roughly to the nearest thousandth (0.001) of a pound. The second is more precise. Besides using devices that have the ability to measure a quantity more precisely another way to increase precision is to take multiple measurements of the same thing. This is why in scientific experiments it is good professional practice to do multiple trials and repeat measurements. Measurements with higher precision are more reproducible and vary over a smaller range when exposed to multiple testing. Test makers quantify this concept using a metric called the “standard error of measurement” or “standard deviation”. The standard error of measurement associated with an individual’s test score tells how precise the individual score report actually is. When dealing with subsets one must be very careful to understand what the standard error is, which is often quite large due to the low number of items.
Subset scores are often reported as a “percent correct” of all the items in that subset. One should note that given a finite number of items less than 100 not every percentage is possible (for example if there are 7 items in a subset only 0%, 14%, 29%, 43%, 57%, 71%, 86%, and 100% are possible values). Also on the ISAT the subset scores are NOT equated across years. Thus one can’t directly compare the percentage correct on a subset from one year to another since the item difficulties might change. It is possible to compare a set of schools or students to another set within the same year on their subset performance since they both took the same version of the test. Also since the difficulty of different subsets is not equated (though the difficulty of the overall test is) one must be cautious in comparing different subsets, even within the same year. Also on the math ISAT assessment individual items are reported in more than one subset category (this is not true for the science assessment). For example, item #15 might be reported in both the Estimation/Number Sense/Computation subset as well as on the Algebraic Relationships/Representations subset, even though the item is only counted once in the determination of the scale score.
Currently, content for the ISAT is determined by panels of educators as outlined in the Illinois State Standards. Beginning in 2005-06, the content at each grade level will be determined by the Illinois Assessment Frameworks. Because content is determined from a written set of Illinois-specific content, rather than an amalgam of content gathered from across different states, ISAT is expected to be more closely aligned to Illinois State Standards than national assessments. Research studies have supported this hypothesis cite .
Subset 2: Algebraic Patterns/Variables;
Subset 3: Algebraic Relationships/Representations;
Subset 4: Geometric Concepts/Points & Lines;
Subset 5: Geometric Relationships/Sort & Compare;
Subset 6: Measurement & Estimation;
Subset 7: Data Organization & Analysis; and
Subset 8: Probability.
Subset 2: Life Sciences;
Subset 3: Physical Sciences;
Subset 4: Earth/Space Sciences; and
Subset 5: Science, Technology, and Society.