[ skip to content ]

Evidence Base

Response to Effectiveness Criteria


STAD-Math clearly meets the i3 standards for strong evidence of effectiveness in terms of raising math achievement and closing achievement gaps for low-achieving, minority, and economically-disadvantaged students. STAD-Math also has effects on important cognitive and affective constructs that contribute to learning and motivation, such as classroom behavior, self-confidence, intrinsic motivation, self-regulation, and deep cognitive processing (e.g., Slavin, Madden, & Leavey, 1984b; Barbato, 2000; Slavin and Karweit, 1985; Suyanto, 1998). Slavin, Lake, and Groff (2009) conducted a "best-evidence synthesis" of the effects of middle and high school math curricula and instructional programs, including results from studies that met the following criteria: schools or classrooms using each program had to be compared to randomly assigned or well-matched control groups; the study duration had to be at least 12 weeks; outcome measures had to be assessments of the mathematics being taught in all classes. The review placed particular emphasis on studies in which classrooms or students were assigned at random to experimental or control groups. These standards are very similar to those of the What Works Clearinghouse (WWC), but the WWC has not yet reviewed professional development programs in secondary math.

Nunnery and Chappell (2011) identified a total of 14 evaluations of STAD math conducted either in elementary or secondary settings. Of these, four used standardized tests as outcome measures (so they would meet both best-evidence and WWC criteria), were conducted in secondary school settings, and otherwise met both best-evidence and WWC criteria for strong internal validity, randomization, scope, and duration. Across these four studies, three of which used random assignment to conditions, the weighted mean effect size for STAD was +0.42. The studies that met both WWC and best-evidence synthesis criteria for strong evidence are summarized below.

Slavin & Karweit (1984) carried out a large, year-long randomized evaluation of STAD in Math 9 middle and high school classes in Philadelphia. These were classes for students not felt to be ready for Algebra I, and were therefore the lowest-achieving students. Overall, 76% of students were African American, 19% were White, and 6% were Hispanic. Forty-four classes in 26 junior and senior high schools were randomly assigned within schools to one of four conditions: STAD, STAD plus Mastery Learning, Mastery Learning, or control. All classes, including the control group, used the same books, materials, and schedule of instruction, but the control group did not use teams or mastery learning. Shortened versions of the CTBS in mathematics served as a pre- and posttest. The tests were shortened by removing every third item, to make it possible to give them within one class period. The four groups were very similar at pretest. On 2 x 2 nested analyses of covariance (similar to HLM random effects analyses), there was a significant effect of a "teams" factor (ES=+0.21, p<.03). The effect size comparing STAD + Mastery Learning to control was ES=+0.24, and that for STAD without Mastery Learning was ES=+0.18. There was no significant Mastery Learning main effect or teams by mastery interaction either in the random effects analysis or in a student-level fixed effects analysis. Effects were similar for students with high, average, and low pretest scores.

Nichols (1996) evaluated STAD in a randomized experiment in high school geometry classes. Students were randomly assigned to experience STAD for the first 9 weeks of the 18-week experiment, for the second 9 weeks, or neither (control). The control group used a lecture approach for the entire 18-week period. At the end of 18 weeks, both STAD groups scored higher than controls on a measure of the content studied in all classes, controlling for ITBS scores (ES=+0.20, p<.05).

In a randomized quasi-experiment, Barbato (2000) evaluated a STAD-based math program in tenth grade classes taking the New York State integrated mathematics course, Sequential Math Course II. The same two teachers taught eight sections. Four sections were randomly assigned to experience cooperative learning and four continued in traditional methods. All classes used the same textbooks and content, and differed only in teaching method. On the New York Integrated Math Test for Course II, controlling for Course I scores, students taught using cooperative learning scored substantially higher (ES=+1.09, p<.001). Female students gained more than males from cooperative learning, but the gender by treatment interaction was not statistically significant.

Reid (1992) evaluated a STAD middle school math program, in which there was competition among heterogeneous learning teams, in an entirely African-American school in inner-city Chicago. Seventh graders who participated in cooperative learning were compared to matched control students. On posttests adjusted for pretests, the cooperative learning groups scored significantly higher on the ITBS (ES=+0.38, p<.05).

To generate an estimate of the overall effectiveness of the STAD-Math program in terms of improving student achievement and to determine the extent to which level of schooling (elementary versus secondary) was related to variation in STAD-Math effects, Nunnery and Chappell (2011) conducted a meta-analysis of effect sizes from 14 randomized experiments and quasi-experiments of STAD-Math conducted in diverse settings (see Table 1). The meta-analysis included studies conducted at all grade levels, only included studies with strong internal validity (randomization of students to treatment, randomization of classrooms to treatment, or carefully-matched comparison group designs with statistical adjustments for pretest differences), and included studies whose outcome measures met best-evidence synthesis and WWC standards.

Table 1. STAD-Math Evidence of Effectiveness.

Study Design n d 1
Barbato (2000) Randomized quasi-experiment 208 secondary +1.09
Conring (2009) Randomized quasi-experiment 44 elementary +0.47
Glassman (1989) Randomized quasi-experiment 441 elementary +0.01
Mevarech (1985) Randomized experiment 67 elementary +0.19
Mevarech (1991) Randomized experiment 54 elementary +0.60
Nichols (1996) Randomized experiment 80 secondary
Reid (1992) Quasi-experiment 50 secondary +0.38
Slavin et al. (1984a) Randomized quasi-experiment 1,367 elementary +0.14
Slavin et al. (1984b) Randomized experiment 504 elementary +0.20
Slavin & Karweit (1984) Randomized experiment 558 secondary +0.75
Slavin & Karweit (1985) Randomized experiment 382 elementary +0.28
Slavin & Karweit (1985) Randomized experiment 212 elementary/ secondary +0.38
Suyanto (1998) Quasi-experiment 664 elementary +0.40
Whicker, Bol, & Nunnery (1997) Randomized quasi-experiment 31 secondary +0.81

1 Cohen's d effect size estimate, computed as the covariate-adjusted difference in posttest means divided by the pooled within-groups posttest standard deviation.

We estimated overall effects across elementary and secondary school studies, and conducted additional analyses to determine variation in effects across levels of schooling and homogeneity of effects within levels of schooling (i.e., are effects different in secondary settings as opposed to elementary settings, and are effects consistent and statistically significant within levels of schooling). Procedures described by Hedges and Olkin (1985) were used for parametric estimation of effect size estimates and testing within and between class effects.

The overall weighted mean effect of the STAD-Math program was estimated at d+ = .16 with an overall variance of2(d+) = 0.001. The standard error of the mean was .03. To test for statistical significance of the overall effect size, 95 percent confidence intervals were calculated using the standard error of the mean and standard normal distribution values of +/- 1.96. The results indicated that the overall mean effect was statistically significant, with a confidence interval of δL = 0.10 to δU = 0.21.

Within- and between- class effects were then analyzed to test for heterogeneity of effects within secondary and elementary studies, and homogeneity of effects between classes (i.e., are effects consistent within levels of schooling, and do effects differ as a function of level of schooling?). A statistically significant between class value was observed, indicating that effect sizes were larger in secondary schools, with an effect size of dSec = .34 for secondary studies and dElem=.11 for elementary studies. Confidence intervals for both school types indicated statistically significant positive increases in math achievement attributable to the STAD-Math program at both levels, with confidence intervals of δL = 0.22 to δU = 0.46 for studies conducted in secondary schools and δL = 0.04 to δU = 0.17 for studies conducted in elementary schools. Within class heterogeneity tests were not statistically significant, so no further subdivision and analysis of effects was indicated. The findings of this synthesis indicate that STAD-Math effects are consistent within grade levels, that they are positive and statistically significant at both elementary and secondary levels, and that STAD-Math has statistically significantly stronger effects in secondary schools (Cohen's d = +0.34) than in elementary schools (Cohen's d = +0.11).

Thus, a highly exclusive set of studies that meet both WWC and best-evidence synthesis standards for strong evidence indicate an average effect of STAD-Math on secondary students' math achievement of d = +0.42. A more inclusive meta-analysis of studies with high internal validity yielded an average effect of STAD-Math on secondary students' math achievement of d = +0.34, and a confidence interval that includes the average effect observed in the more exclusive set (δL = 0.22 to δU = 0.46). Further, STAD-Math effects for secondary students appear to be highly consistent as indicated by a lack of statistically significant within-class heterogeneity of effects.

Effects of this size for widely replicated models, especially in studies by third-party evaluators using standardized tests as outcomes, indicate a robust impact of practical and policy importance. To give a sense of perspective, the difference between African-American or Hispanic and White eighth grade mathematic scores on the National Assessment of Educational Progress is equal to an effect size of about 0.50. Based on the confidence intervals derived in the meta-analysis, STAD-Math has a 95% likelihood of closing between 44% and 92% of that gap.

Return to Investing in Innovation (i3)