Annual Reporting

On an annual basis, total scaled score distribution reports and test statistics reports are produced for the tests of the MTTC program. These reports are described in the following sections, and annotated samples of the reports are provided. Links to the most recent set of annual reports can be found within the associated sections.


Total Scaled Score Distribution Reports

The following report about examinee scaled scores is produced annually for the MTTC program: Total Scaled Score Distribution by Test Field (All Forms).

Total Scaled Score Distribution by Test Field
(All Forms)

The Total Scaled Score Distribution by Test Field (All Forms) provides information about the distribution of examinees' scaled scores above and below the minimum passing score. For the MTTC tests, results are reported on a scale ranging from 100 to 300, with a scaled score of 220 representing the minimum passing score for each test. This report is provided for test fields with 10 or more attempts during the program year. See MTTC Total Scaled Score Distribution by Test Field (All Forms) 2018–2019 PDF for the report generated for the October 2018–September 2019 program year.

Test Statistics Reports

The following two test statistics reports are generated annually for the MTTC program:

  • Test Statistics Report by Test Form
  • Test Statistics Report for Performance Assignments

These reports are designed to provide information about the statistical properties of MTTC tests, including the reliability/precision of the tests.

Standard 2.0 Appropriate evidence of reliability/precision should be provided for the interpretation for each intended score use.

Standard 11.14 Estimates of the consistency of test-based credentialing decisions should be provided in addition to other sources of reliability evidence. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Statistical Measures Used

As indicated by the Standards above, it is important to provide evidence of the reliability/precision of the MTTC test scores for their intended use of making pass/fail classifications for the purpose of educator credentialing (i.e., to confer a certificate or endorsement). The Standards define "reliability/precision" as the "consistency of scores across instances of the testing procedure." Standard 11.14, specific to credentialing tests, indicates "the consistency of decisions on whether to certify is of primary importance."

A number of statistical measures are used for the MTTC tests to measure the reliability of the test scores. Measures of reliability are reported for the total test, the multiple-choice section, and the constructed-response section. However, because pass/fail decisions are made based on the total test score only, it is the total test reliability that is the focus of interest; other measures of reliability for portions of the test are presented as supplemental information. When considering reliability indices for a single test section (multiple-choice or constructed-response), it should be kept in mind that a section of a test may have lower reliability statistics than the total test if the test section contains fewer test items than the total test.

The statistics of primary focus are those that describe the consistency of pass/fail decisions on the total test and the error of measurement associated with the total test, as follows:

  • Total test decision consistency. Total test decision consistency (Breyer and Lewis) is a reliability statistic that describes the consistency of the pass/fail decision. This statistic is reported in the range of 0.00 to 1.00; the closer the estimate is to 1.00, the more consistent (reliable) the decision is considered to be. The statistic is reported for test forms with 60 or more attempts during the program year. Test forms are considered to be identical if they contain identical sets of scorable multiple-choice items, regardless of the order of the items.
  • Total test Standard Error of Measurement (SEM). Standard Error of Measurement (SEM) is a statistical measure reported as a number that provides a "confidence band" around an examinee's score; if an examinee retook the test, the examinee's reported score would likely be within the reported score plus or minus the number reported as the SEM. The smaller the SEM, the closer an examinee's score could be expected to be to the one reported upon repeated testing. This statistic is reported for each test form with at least 60 attempts.

Additional supplemental statistics for the total test, multiple-choice section, and constructed-response section of MTTC tests are provided in the MTTC test statistics reports, as follows:

  • Stratified alpha. Stratified alpha is an estimate of total test reliability for a test containing a mixture of item types (e.g., multiple-choice and constructed-response). This statistic is reported in the range of .00 to 1.00, with a higher number indicating a greater level of consistency (reliability). This statistic is reported for each test form with at least 60 attempts.
  • Multiple-choice section Standard Error of Measurement (SEM). This statistic is similar to the total test Standard Error of Measurement described above, but it is applied to an examinee's score for the multiple-choice section of a test. It is reported for each test form with at least 100 attempts. There are two versions: Keats' estimated SEM (Keats, 1957) and the computed SEM.
  • KR20. The KR20 (Kuder-Richardson index of homogeneity) is a measure of internal consistency of the multiple-choice test items. KR20 is reported in the range of .00 to 1.00, with a higher number indicating a greater level of internal consistency (that is, the degree of consistent performance on items intended to measure the same construct). This statistic is reported for each test form with at least 60 attempts.
  • G coefficient. The G (generalizability) coefficient indicates the degree to which the variability in scores for the constructed-response section is attributable to examinees, such as subject area knowledge, rather than to measurement error. It is reported in the range of .00 to 1.00, with a higher number indicating a greater level of dependability (or accuracy of the generalization from observed score to universe score). This statistic is reported for test forms with at least 60 attempts.
  • Scorer agreement. For each test form with constructed-response items, information is reported on scorer agreement regarding the individual raw scores assigned to each examinee's response to a constructed-response item. The following information is reported: the percent of cases in which the first two scorers were in agreement (i.e., assigned identical scores or scores that only differ by 1 point, also called adjacent scores), the percent of identical scores, and the percent of adjacent scores.
  • Inter-rater reliability. For each test form with constructed-response items, inter-rater reliability reports the degree to which different raters assign the same score to the same response.

Factors Affecting Reliability Measures

Reliability measures for MTTC tests may be influenced by many factors. The following is a list of typical factors:

  • Number of examinees. To be interpreted with confidence, statistical reliability estimates must be based on adequate numbers of examinee scores that represent a range of examinee knowledge and skill levels and that provide variance in examinee score distributions. Statistical reliability estimates based on few examinee scores may be very dependent on the characteristics of those examinees and their scores. For this reason, reliability estimates are calculated for MTTC tests that are taken by 60 or more examinees.
  • Variability of the group tested. In general, the larger the variance or true spread of the scores of the examinee group (i.e., the greater the individual differences in the level of knowledge and skills of the examinees), the greater will be the reliability coefficient. Reliability estimates tend to be higher if examinees in the group have widely varying levels of knowledge, and lower if they tend to have similar levels of knowledge. The range and distribution of examinee scores for each test field can be seen in the report Total Scaled Score Distribution by Test Field (All Forms), described previously.
  • Self-selection of examinees by test administration date. MTTC tests are administered throughout the year, and examinees can select when to take and retake the tests. The composition, ability level, and variability of the examinee group may vary from one test form to another as a result of the time of year that different test forms are administered.
  • Number of test items. Longer tests generally have higher reliability estimates. Some MTTC tests consist of two or more subtests that examinees must pass separately and for which they retake only the failed components. Because the pass/fail decisions are based on a decreased number of test items when compared to a total test model, KR20 and other reliability evidence cannot be expected to reach the levels found in single-component tests of greater length.
  • Test content. Reliability estimates are typically higher for tests that cover narrow, homogeneous content than for tests that cover a broad range of content. MTTC tests typically test a broad base of knowledge and skills that pertain to educator licenses that will apply in a wide range of educational settings, grade levels, and teaching assignments.

Aids to Interpreting the MTTC Test Statistics

The following interpretive aids and cautions should be kept in mind while considering the MTTC test statistics reports:

  • The MTTC tests include multiple-choice items and constructed-response items (performance assignments). Procedures for estimating the psychometric characteristics of multiple-choice items and tests are well-established and documented in the literature; such procedures for performance assignments, and for tests that combine performance assignments and multiple-choice items, are less well-established and documented. Most MTTC tests presently consist of multiple-choice items only. Each of the MTTC World Language tests except Italian, as well as the Latin test, consists of a multiple-choice section and a performance assignment section. The Spanish, French, German, and Latin content-area tests each include two written performance assignments. The Chinese (Mandarin), Arabic (Modern Standard), Russian, and Japanese tests each contain eight performance assignments.
  • MTTC test scores are reported to candidates as scaled scores with a lower limit of 100, a passing score of 220, and an upper limit of 300. This is the scale used in reporting all MTTC scaled score statistics.
  • Some tests may not be taken by any candidates during a reporting period. Data for such fields are not available to be reported.
  • Statistical information on the MTTC tests should be interpreted with the understanding that the tests taken by examinees are a composite of multiple-choice items and constructed-response items (performance assignments), and this may affect psychometric characteristics of the test.
  • Information presented in these reports is based on tests taken during the program year indicated; it is possible that information based on additional test administrations might be different.
  • Information that is based on the test performance of relatively small numbers of examinees (i.e., fewer than 60 examinee test attempts) may not be indicative of the performance of larger numbers of examinees.

Test Statistics Report by Test Form

The Test Form Statistics Report provides information regarding the statistical properties of MTTC test forms with at least 10 attempts during the program year. See MTTC Test Form Statistics Report 2018–2019 PDF for the report generated for the October 2018–September 2019 program year.

Test Statistics Report for Performance Assignments

The Test Statistics Report for Performance Assignments provides selected statistics for the constructed-response items for test fields with at least 100 attempts during the program year. See MTTC Test Statistics Report for Performance Assignments 2018–2019 PDF for the report generated for the October 2018–September 2019 program year.

Archived Appendices


Top of Page