Friday, August 27, 2010

Critique of value-added: ONE NUMBER CAN’T ILLUSTRATE TEACHER EFFECTIVENESS + THE REPORTERS RESPOND

One number can't illustrate teacher effectiveness: The Times' analysis illuminates cross-school differences in L.A. Unified, then ignores classroom factors beyond a teacher's control.

By Bruce Fuller and Xiaoxia Newton | LA Times Blowback Op-Ed

August 25, 2010 - Imagine opening the morning paper over coffee and spotting your name on list of fellow nurses or lawyers, musicians or bus drivers. Beside each name rests a stark, lonely number said to gauge the extent to which you advance the growth of your clients or customers.

For the record:: A previous version of this article referred to the "the RAND statistical procedure" for evaluating teachers on a value-added basis. The analysis was not a Rand project, but was done by a Rand researcher on a private basis for The Times. There was no Rand overview of this work.

Orwellian, perhaps. But 6,000 Los Angeles teachers will soon find their names on such a list.

The Times has already published a few "value added" scores for illustrative teachers, detailing the eye-popping variability in learning curves of third- to fifth-graders spread across the Los Angeles Unified School District. The Times claims these scores can validly peg the discrete effect of each teacher on their students' growth. These journalists draw on a complicated statistical model built by a single Rand Corp. analyst, Richard Buddin, which has yet to be openly reviewed by scholarly peers.

Meteorologists can accurately estimate the average weather pattern for Sept. 1 over the past century, but their predictions for any specific Sept. 1 are much less reliable. Yet wise editors at The Times apparently believe they can magically set aside confounding factors and pinpoint the discrete effects of individual teachers on students' learning.

The Times published a simple graph on its Aug. 15 front page as a way to publicly "out" a teacher whom its value-added study deemed ineffective. The graph showed declining raw test scores for the teacher's students over two years. But this fails to take into account differences in student background, including English proficiency, parents' education levels or home practices that affect children's learning. Hospitals wouldn't fire a doctor or nurse who focused on caring for the elderly or poor because his patients die at higher rates.

This is why the Times rightfully asked a qualified researcher at Rand Corp., the Santa Monica-based think tank, to devise a sophisticated statistical model in an attempt to isolate the discrete effect of pedagogical skills on student growth. But as the National Academy of Sciences pointed out last year, successfully doing so requires exhaustive data on each teacher and the contexts in which instruction occurs.

We know that student learning curves are flattened by lousy teachers. The Times' analysis usefully illuminates the wide variation in the test scores of students across classrooms and schools. What's risky is moving from a complicated statistical model to estimating the discrete effect of individual teachers, precisely the leap of faith being made by The Times.

Buddin's statistical procedure, while competently carried out in general, fails to take into account classroom and school contexts that condition the potency of individual teachers. For example, if a teacher is assigned low-track students — those with weaker reading proficiency in English or lower math skills — negative peer effects will drag down student growth over the year, independent of the teacher's pedagogical skills.

Or if parents self-select into higher-quality schools, as detailed in one Times story, the presence of students with highly dedicated parents will have a positive impact on student growth, again independent of the individual efforts of a teacher. By setting aside contextual effects, The Times overestimates a teacher's effects — positive or negative — on student growth.

Furthermore, many students are not taught by a single teacher. Some have special reading instruction or oral language development. What if these activities are strikingly effective or make no difference at all? Under The Times' model, such effects are attributed to the student's main teacher.

The Times' study fails to recognize that test score across grade levels cannot be compared, given the limitations of California's standardized tests. For example, third-grade scores across L.A. Unified were largely flat during the period that students were tracked, while fourth- and fifth-grade scores were climbing overall. Even ranking student scores across grades may be driven by differences in test items, not a student's skill level. So, when The Times tries to control on family background with a third-grade test score, it does so inadequately — again overestimating the discrete effect of the teacher.

Given analytic weaknesses, the ethical question that arises is whether The Times is on sufficiently firm empirical ground to publish a single number, purporting to gauge the sum total of a teacher's effect on children.

Based on a generation of research, we know bad teachers drag down student learning. Teachers unions continue to protect their poorly performing members in many cases. But this situation calls for careful science and mindful behavior by reformers and civic leaders, including The Times. Imprudent efforts could discourage strong teachers from working with low-achieving students, now judged on simplistic value-added scores. Dumbing-down the public discourse does little to lift teacher quality.

Bruce Fuller and Xiaoxia Newton are professors of education at UC Berkeley. University of Washington professor Dan Goldhaber, UCLA professor Meredith Phillips and UC Berkeley professor Sophia Rabe-Hesketh contributed to this article.

Times reporters respond to Blowback critical of 'Grading the Teachers' series

Below▼ is a response by Times reporters Jason Felch, Jason Song and Doug Smith to the Aug. 25 Blowback article by UC Berkeley education professors Bruce Fuller and Xiaoxia Newton.  from Opinion LA in the LA Times Online, Posted to the left◄

Note: In their piece (left), Fuller and Newton say The Times' "value added" method of evaluating teacher effectiveness in the Los Angeles Unified School District ignores classroom factors beyond an instructor's control.

August 26, 2010 |  4:45 pm - We’re happy to have credible experts debate the reliability of our statistical approach, but we would hope in doing so they would exercise the same care in their critiques as we have in our publications. Bruce Fuller and Xiaoxia Newton make several points that we’d like to respond to.

The authors say Richard Buddin’s approach “has yet to be openly reviewed by scholarly peers.” In fact, the reason we chose Buddin is because he has published several peer-reviewed studies using data from the Los Angeles Unified School District in major academic journals using the same "value added" approach he employed for us. Buddin’s methods paper is not a formal academic publication. Nevertheless, we asked leading experts to review that work before choosing Buddin, and asked several other experts, including skeptics, to review his methods paper.

The authors say our approach and graphic fail “to take into account differences in student background, including English proficiency, parents' education levels or home practices that affect children's learning.”  L.A. Unified did not provide student demographic information for this study, citing federal privacy laws. Demographic factors do have a large effect on student achievement, but these influences are largely included in the students' prior-year test scores. Prior research (including Buddin’s own using L.A. Unified data) has shown that demographic factors are much less important after controlling for a student's previous test scores. The technical report used results from Buddin’s previous Rand Corp. research to show that student demographics had small effects on teacher value added as calculated in this study. This earlier study (Buddin and Zamarro, 2009) ran through 2007 instead of 2009, but this pattern is likely to persist. The approach and results are discussed in the subsection of the technical report "How does classroom composition affect teacher value added” that begins on page 13. The key empirical results are in Table 7.

The authors cite the National Academy of Sciences report urging caution. We cited the same report in our story. There is a variety of other research available that suggests these estimates are reliable. See Kane and Staiger’s 2008 random assignment validation of value-added approaches, which found value added were “significant predictors of student achievement” and controlling for students test prior scores yielded “unbiased results.”   

The authors claim our analysis “fails to recognize that test score across grade levels cannot be compared, given the limitations of California's standardized tests.” California Standards Test scores have been used by many researchers in peer-reviewed value-added analysis for years. The district’s own researchers concluded the test was appropriate for such estimates, as the story mentioned. See this L.A. Unified report.

The authors claim, incorrectly, that The Times plans to “publish a single number, purporting to gauge the sum total of a teacher's effect on children.” In our stories and Q&As, we have repeatedly made clear that value added should not be the sole measure of a teacher’s overall performance. And our database does not present a single number for anyone.

Finally, the authors point to various limitations of the value-added approach. These are well known and have repeatedly been disclosed by The Times. What the authors fail to note is that leading experts say value added is a far more thoroughly vetted, peer-reviewed and provably reliable tool than any other teacher evaluation tool currently available. Rather than comparing value added to a Platonic ideal of a perfect teacher evaluation, it should be weighed against classroom observations by principals, student surveys and the other “multiple measures” that are being considered. Under this analysis, most scholars agree that, warts and all, value added shines.

No comments: