Jamming the System: Standardized Tests, Automated Grading and the Future of Writing - 2 views

Ja the System: Standardized Tests Automated Grading Future of Writing robotic evaluation

started by Bonnie Sutton on 29 Apr 12

#1 Bonnie Sutton on 29 Apr 12

View slide show on original site.
|
View on Flickr on original site.

Spotlight on Digital Media and Learning
http://spotlight.macfound.org/blog/entry/jamming-the-system-standardized-tests-automated-grading-future-of-writing/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+macfound%2FiQaL+Spotlight+on+Digital+Media+and+Learning#When:19:51:00Z

Posted: 26 Apr 2012 12:51 PM PDT

Standardized testing-and now, standardized grading-are the bane of teachers and thoughtful administrators. So how can we harness the positive side of digital media and learning?

---

Filed by Christine Cupaiuolo

The digital age has brought profound change to education. On the negative side-at least, according to the great majority of teachers-it has been a catalyst for the increasingly standardized and impersonal approach of large educational systems: making it easier to conduct and evaluate a standardized test across a state, region or country, and allowing state and federal bureaucracies endless ways to crunch the numbers on those tests for a variety of budgetary, political, and, yes, even educational ends.

The testing regimen that has ensued is the greatest threat to the positive potential of all these new digital tools. The power of digital tools and new media literacies-as we have documented in numerous ways-is the ability to inspire the unique creative and critical thinking of individuals. In short, it gives students a voice in and control over their own education.

For some, the choice is clear: students (and teachers) can either become robots-or they can build them.

Testing Teachers' Patience With Testing
Recent studies on attitudes toward testing and advances in automated scoring present a clearer picture of the competing goals and interests. Let's start with the less surprising of the lot: "Primary Sources: 2012," a study released by Scholastic and the Bill & Melinda Gates Foundation, found an overwhelming number of teachers agree students should be measured by formative assessments, class participation and performance on class assignments. Less value was put on required, high-stakes testing.

The study includes responses from 10,000 teachers about their schools and student and teacher performance, including ways performance should be evaluated, supported and rewarded. On the subject of high-stake testing, which is more often supported by legislators than educators, the results are pretty much what we would expect.

Only 6 percent of teachers surveyed agree district-required tests are "absolutely essential," with 24 percent finding them "very important." State-required standardized tests received slightly less support overall; 7 percent said they were "absolutely essential" and 21 percent deemed them "very important." Tests from textbooks received the lowest marks-only 4 percent of teachers thought they are are "absolutely essential," and 22 percent said they are "very important."

"It's time for less focus on standardized tests and more on the development of creative and critical thought. The amount of time spent preparing for testing is disgraceful," said a middle school teacher.

High school teachers were even more likely to find less value in district- and state-required testing, though they were more enthusiastic than elementary and middle school teachers about final exams.

A Tale of Three Test States
In conversation with several teachers in the Chicago area, most said that Illinois has largely resisted raising the stakes for its standardized tests-at least compared to the oppressive level in places like New York or Texas. New York, you may recall, recently delighted testing critics when Daniel Pinkwater's nonsensical story "The Hare and the Pineapple" was twisted into a nonsensical question that caused students and teachers to ask, "Huh?" (Not to worry: the questions will not count in students' official scores.)

As for Texas, Valerie Strauss, author of The Answer Sheet blog in the Washington Post, has a good rundown of how test-centric Texas has become and how some schools are pushing back with resolutions asking the state Legislature to develop alternative systems (the movement has also spread to New York). The sample anti-testing resolution opens with:

WHEREAS, the over reliance on standardized, high stakes testing as the only assessment of learning that really matters in the state and federal accountability systems is strangling our public schools and undermining any chance that educators have to transform a traditional system of schooling into a broad range of learning experiences that better prepares our students to live successfully and be competitive on a global stage;

It goes on to note the importance of recognizing interest-driven learning and digital literacies, the 21st-century skills touted at the highest levels of education:

WHEREAS, Our vision is for all students to be engaged in more meaningful learning activities that cultivate their unique individual talents, to provide for student choice in work that is designed to respect how they learn best, and to embrace the concept that students can be both consumers and creators of knowledge; and

WHEREAS, only by developing new capacities and conditions in districts and schools, and the communities in which they are embedded, will we ensure that all learning spaces foster and celebrate innovation, creativity, problem solving, collaboration, communication and critical thinking; and

WHEREAS, these are the very skills that business leaders desire in a rising workforce and the very attitudes that are essential to the survival of our democracy;

Back in Illinois, high school students take the two-day Prairie State Achievement Exam (PSAE) in April of their junior year-and the results of this test leave a permanent mark on their transcript and determine a school's status under the No Child Left Behind law. A few years ago, the state, in an increasingly common collaboration, decided to make the first day of PSAE testing a complete ACT test, instead of a homegrown concoction that had only been meaningful for those local educators familiar with its idiosyncratic content and evaluation criteria.

This image by Brian Metcalfe illustrates how K-12 students might be encouraged to create Images with a Message.

The ACT, on the other hand, is nationally recognizable and accepted as part of most college applications. And while educators decried it as they would any standardized test, many Illinois teachers appreciated that at least a good portion of the testing their students had to take would also be genuinely useful-a high stakes experience, for sure, but one they would most probably undergo on their own (for a fee) as part of the college admissions process.

Even these limited good intentions, however, have been made somewhat of a mockery due to budgetary cutbacks. Until this year-stick with me here-the ACT administered as part of the PSAE was a complete ACT, which includes the traditional four multiple-choice tests as well as an added writing component that more and more colleges require students to have taken. But in 2011, state officials decided they could no longer afford to offer the writing component. Without it, the test became useless for most college applications (although state officials were able to get Illinois state colleges to let go of their requirement for the writing portion of the ACT-at least temporarily).

Of course, that writing portion of the ACT might very soon be graded by a computer, not a human being-so maybe students are lucky to have escaped yet another impersonal evaluation.

I, Robot Grader
The idea is likely to gain more support, now that a new study by researchers at the University of Akron determined that automated essay scoring software "achieved virtually identical levels of accuracy, with the software in some cases proving to be more reliable."

The study was based on a review of more than 16,000 essays from six states. Each set of essays varied in length, type and grading protocols, and all had been hand-scored according to state standards. The challenge for the nine companies (which control almost all of the automated essay scoring commercial market) was to see if their software could produce reliable and valid essay scores, when compared with trained human readers. This was not the first review of automated essay scoring software, but the researchers involved said it is the first comprehensive multi-vendor trial.

"Better tests support better learning," said Barbara Chow, education program director at the Hewlett Foundation, which funded the study and is sponsoring a contest to encourage new automated scoring techniquies. "This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they'll master important academic content, critical thinking, and effective communication."

Proponents say automated grading frees teachers up from having to slog home dozens of papers that take a full weekend to grade. Poetry and creative writing teachers, however, may not get a break.

Automated grading "doesn't do well on very creative kinds of writing, such as haiku," said lead researcher Mark Shermis, dean of the College of Education at the University of Akron. "But this technology works well for about 95 percent of all the writing that's out there, and it will provide unlimited feedback to how you can improve what you have generated, 24 hours a day, seven days a week. If you are writing at 2 am, which many college students do, it's there to tend to your needs.."

Skeptics such as Les Perelman, a director of writing at the Massachusetts Institute of Technology, point to shortcomings that make this technology less useful, regardless of the hour. Perelman has studied algorithms developed by Educational Testing Service, creator of the e-Rater, which can grade 16,000 essays in a mere 20 seconds. Michael Winerip interviewed Perelman for The New York Times and deserves special kudos for this breakdown:

While his research is limited, because E.T.S. is the only organization that has permitted him to test its product, he says the automated reader can be easily gamed, is vulnerable to test prep, sets a very limited and rigid standard for what good writing is, and will pressure teachers to dumb down writing instruction.

The e-Rater's biggest problem, he says, is that it can't identify truth. He tells students not to waste time worrying about whether their facts are accurate, since pretty much any fact will do as long as it is incorporated into a well-structured sentence. "E-Rater doesn't care if you say the War of 1812 started in 1945," he said.

Mr. Perelman found that e-Rater prefers long essays. A 716-word essay he wrote that was padded with more than a dozen nonsensical sentences received a top score of 6; a well-argued, well-written essay of 567 words was scored a 5.

An automated reader can count, he said, so it can set parameters for the number of words in a good sentence and the number of sentences in a good paragraph. "Once you understand e-Rater's biases," he said, "it's not hard to raise your test score."

E-Rater, he said, does not like short sentences.

Or short paragraphs.

Or sentences that begin with "or." And sentences that start with "and." Nor sentence fragments.

Continue reading Winerip's story. I love it, though maybe I'm just easily amused. (Hey, I'm human.)

Blame the System, Not the Robots
For a thoughtful look at the underlying problem with automated systems, settle in and read Marc Bousquet essay, "Robots Are Grading Your Papers!" Scary title aside, Bousquet suggests that the more troubling issue is why machines are delivering similar scores. The problem, it turns out, is not the success of the technology; rather, the technology's success is a symptom of the standardization of the process.

It seems possible that what really troubles us about the success of machine assessment of simple writing forms isn't the scoring, but the writing itself-forms of writing that don't exist anywhere in the world except school. It's reasonable to say that the forms of writing successfully scored by machines are already-mechanized forms-writing designed to be mechanically produced by students, mechanically reviewed by parents and teachers, and then, once transmuted into grades and sorting of the workforce, quickly recycled. As Evan Watkins has long pointed out, the grades generated in relation to this writing stick around, but the writing itself is made to disappear.

Bousquet blames "forces of standardization, bureaucratic control, and high-stakes assessment" for "steadily shrinking the zone in which free teaching and learning can take place." In doing so, he likens a teacher's job to other over-managed professions in which the real work occurs on an employer's own time, outside the system (ala "The Wire").

John Jones, an assistant professor at West Virginia University who teaches writing and digital literacy, wrote a compelling post last year on why the five-paragraph essay needs to be replaced with writing that has meaning and purpose outside of the classroom.

"[W]hile it is true that the goal of teaching writing has always been to prepare students for writing beyond the walls of the schoolhouse," writes Jones, "this is even more the case now that digital publishing has become so widely available in our society. In other words, as much as possible, the task of teaching writing is also teaching writing for public consumption, and teaching writing for public consumption in the network society means teaching writing and publishing as being inseparable."

Imagine the work of students encouraged to write for public, engaged spaces-including blogs, Twitter, Wikipedia, or writing projects that come to life with multimedia-and you begin to understand how the testing, and the robots, both fail.

<div class="cArrow"> </div><div class="cContentInner">View slide show on original site. | View on Flickr on original site. Spotlight on Digital Media and Learning <a href="http://spotlight.macfound.org/blog/entry/jamming-the-system-standardized-tests-automated-grading-future-of-writing/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+macfound%2FiQaL+Spotlight+on+Digital+Media+and+Learning#When:19:51:00Z" rel="nofollow" target="_blank">http://spotlight.macfound.org/blog/entry/jamming-the-system-standardized-tests-automated-grading-future-of-writing/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+macfound%2FiQaL+Spotlight+on+Digital+Media+and+Learning#When:19:51:00Z</a> Posted: 26 Apr 2012 12:51 PM PDT Standardized testing-and now, standardized grading-are the bane of teachers and thoughtful administrators. So how can we harness the positive side of digital media and learning? --- Filed by Christine Cupaiuolo The digital age has brought profound change to education. On the negative side-at least, according to the great majority of teachers-it has been a catalyst for the increasingly standardized and impersonal approach of large educational systems: making it easier to conduct and evaluate a standardized test across a state, region or country, and allowing state and federal bureaucracies endless ways to crunch the numbers on those tests for a variety of budgetary, political, and, yes, even educational ends. The testing regimen that has ensued is the greatest threat to the positive potential of all these new digital tools. The power of digital tools and new media literacies-as we have documented in numerous ways-is the ability to inspire the unique creative and critical thinking of individuals. In short, it gives students a voice in and control over their own education. For some, the choice is clear: students (and teachers) can either become robots-or they can build them. Testing Teachers' Patience With Testing Recent studies on attitudes toward testing and advances in automated scoring present a clearer picture of the competing goals and interests. Let's start with the less surprising of the lot: "Primary Sources: 2012," a study released by Scholastic and the Bill & Melinda Gates Foundation, found an overwhelming number of teachers agree students should be measured by formative assessments, class participation and performance on class assignments. Less value was put on required, high-stakes testing. The study includes responses from 10,000 teachers about their schools and student and teacher performance, including ways performance should be evaluated, supported and rewarded. On the subject of high-stake testing, which is more often supported by legislators than educators, the results are pretty much what we would expect. Only 6 percent of teachers surveyed agree district-required tests are "absolutely essential," with 24 percent finding them "very important." State-required standardized tests received slightly less support overall; 7 percent said they were "absolutely essential" and 21 percent deemed them "very important." Tests from textbooks received the lowest marks-only 4 percent of teachers thought they are are "absolutely essential," and 22 percent said they are "very important." "It's time for less focus on standardized tests and more on the development of creative and critical thought. The amount of time spent preparing for testing is disgraceful," said a middle school teacher. High school teachers were even more likely to find less value in district- and state-required testing, though they were more enthusiastic than elementary and middle school teachers about final exams. A Tale of Three Test States In conversation with several teachers in the Chicago area, most said that Illinois has largely resisted raising the stakes for its standardized tests-at least compared to the oppressive level in places like New York or Texas. New York, you may recall, recently delighted testing critics when Daniel Pinkwater's nonsensical story "The Hare and the Pineapple" was twisted into a nonsensical question that caused students and teachers to ask, "Huh?" (Not to worry: the questions will not count in students' official scores.) As for Texas, Valerie Strauss, author of The Answer Sheet blog in the Washington Post, has a good rundown of how test-centric Texas has become and how some schools are pushing back with resolutions asking the state Legislature to develop alternative systems (the movement has also spread to New York). The sample anti-testing resolution opens with: WHEREAS, the over reliance on standardized, high stakes testing as the only assessment of learning that really matters in the state and federal accountability systems is strangling our public schools and undermining any chance that educators have to transform a traditional system of schooling into a broad range of learning experiences that better prepares our students to live successfully and be competitive on a global stage; It goes on to note the importance of recognizing interest-driven learning and digital literacies, the 21st-century skills touted at the highest levels of education: WHEREAS, Our vision is for all students to be engaged in more meaningful learning activities that cultivate their unique individual talents, to provide for student choice in work that is designed to respect how they learn best, and to embrace the concept that students can be both consumers and creators of knowledge; and WHEREAS, only by developing new capacities and conditions in districts and schools, and the communities in which they are embedded, will we ensure that all learning spaces foster and celebrate innovation, creativity, problem solving, collaboration, communication and critical thinking; and WHEREAS, these are the very skills that business leaders desire in a rising workforce and the very attitudes that are essential to the survival of our democracy; Back in Illinois, high school students take the two-day Prairie State Achievement Exam (PSAE) in April of their junior year-and the results of this test leave a permanent mark on their transcript and determine a school's status under the No Child Left Behind law. A few years ago, the state, in an increasingly common collaboration, decided to make the first day of PSAE testing a complete ACT test, instead of a homegrown concoction that had only been meaningful for those local educators familiar with its idiosyncratic content and evaluation criteria. This image by Brian Metcalfe illustrates how K-12 students might be encouraged to create Images with a Message. The ACT, on the other hand, is nationally recognizable and accepted as part of most college applications. And while educators decried it as they would any standardized test, many Illinois teachers appreciated that at least a good portion of the testing their students had to take would also be genuinely useful-a high stakes experience, for sure, but one they would most probably undergo on their own (for a fee) as part of the college admissions process. Even these limited good intentions, however, have been made somewhat of a mockery due to budgetary cutbacks. Until this year-stick with me here-the ACT administered as part of the PSAE was a complete ACT, which includes the traditional four multiple-choice tests as well as an added writing component that more and more colleges require students to have taken. But in 2011, state officials decided they could no longer afford to offer the writing component. Without it, the test became useless for most college applications (although state officials were able to get Illinois state colleges to let go of their requirement for the writing portion of the ACT-at least temporarily). Of course, that writing portion of the ACT might very soon be graded by a computer, not a human being-so maybe students are lucky to have escaped yet another impersonal evaluation. I, Robot Grader The idea is likely to gain more support, now that a new study by researchers at the University of Akron determined that automated essay scoring software "achieved virtually identical levels of accuracy, with the software in some cases proving to be more reliable." The study was based on a review of more than 16,000 essays from six states. Each set of essays varied in length, type and grading protocols, and all had been hand-scored according to state standards. The challenge for the nine companies (which control almost all of the automated essay scoring commercial market) was to see if their software could produce reliable and valid essay scores, when compared with trained human readers. This was not the first review of automated essay scoring software, but the researchers involved said it is the first comprehensive multi-vendor trial. "Better tests support better learning," said Barbara Chow, education program director at the Hewlett Foundation, which funded the study and is sponsoring a contest to encourage new automated scoring techniquies. "This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they'll master important academic content, critical thinking, and effective communication." Proponents say automated grading frees teachers up from having to slog home dozens of papers that take a full weekend to grade. Poetry and creative writing teachers, however, may not get a break. Automated grading "doesn't do well on very creative kinds of writing, such as haiku," said lead researcher Mark Shermis, dean of the College of Education at the University of Akron. "But this technology works well for about 95 percent of all the writing that's out there, and it will provide unlimited feedback to how you can improve what you have generated, 24 hours a day, seven days a week. If you are writing at 2 am, which many college students do, it's there to tend to your needs.." Skeptics such as Les Perelman, a director of writing at the Massachusetts Institute of Technology, point to shortcomings that make this technology less useful, regardless of the hour. Perelman has studied algorithms developed by Educational Testing Service, creator of the e-Rater, which can grade 16,000 essays in a mere 20 seconds. Michael Winerip interviewed Perelman for The New York Times and deserves special kudos for this breakdown: While his research is limited, because E.T.S. is the only organization that has permitted him to test its product, he says the automated reader can be easily gamed, is vulnerable to test prep, sets a very limited and rigid standard for what good writing is, and will pressure teachers to dumb down writing instruction. The e-Rater's biggest problem, he says, is that it can't identify truth. He tells students not to waste time worrying about whether their facts are accurate, since pretty much any fact will do as long as it is incorporated into a well-structured sentence. "E-Rater doesn't care if you say the War of 1812 started in 1945," he said. Mr. Perelman found that e-Rater prefers long essays. A 716-word essay he wrote that was padded with more than a dozen nonsensical sentences received a top score of 6; a well-argued, well-written essay of 567 words was scored a 5. An automated reader can count, he said, so it can set parameters for the number of words in a good sentence and the number of sentences in a good paragraph. "Once you understand e-Rater's biases," he said, "it's not hard to raise your test score." E-Rater, he said, does not like short sentences. Or short paragraphs. Or sentences that begin with "or." And sentences that start with "and." Nor sentence fragments. Continue reading Winerip's story. I love it, though maybe I'm just easily amused. (Hey, I'm human.) Blame the System, Not the Robots For a thoughtful look at the underlying problem with automated systems, settle in and read Marc Bousquet essay, "Robots Are Grading Your Papers!" Scary title aside, Bousquet suggests that the more troubling issue is why machines are delivering similar scores. The problem, it turns out, is not the success of the technology; rather, the technology's success is a symptom of the standardization of the process. It seems possible that what really troubles us about the success of machine assessment of simple writing forms isn't the scoring, but the writing itself-forms of writing that don't exist anywhere in the world except school. It's reasonable to say that the forms of writing successfully scored by machines are already-mechanized forms-writing designed to be mechanically produced by students, mechanically reviewed by parents and teachers, and then, once transmuted into grades and sorting of the workforce, quickly recycled. As Evan Watkins has long pointed out, the grades generated in relation to this writing stick around, but the writing itself is made to disappear. Bousquet blames "forces of standardization, bureaucratic control, and high-stakes assessment" for "steadily shrinking the zone in which free teaching and learning can take place." In doing so, he likens a teacher's job to other over-managed professions in which the real work occurs on an employer's own time, outside the system (ala "The Wire"). John Jones, an assistant professor at West Virginia University who teaches writing and digital literacy, wrote a compelling post last year on why the five-paragraph essay needs to be replaced with writing that has meaning and purpose outside of the classroom. "[W]hile it is true that the goal of teaching writing has always been to prepare students for writing beyond the walls of the schoolhouse," writes Jones, "this is even more the case now that digital publishing has become so widely available in our society. In other words, as much as possible, the task of teaching writing is also teaching writing for public consumption, and teaching writing for public consumption in the network society means teaching writing and publishing as being inseparable." Imagine the work of students encouraged to write for public, engaged spaces-including blogs, Twitter, Wikipedia, or writing projects that come to life with multimedia-and you begin to understand how the testing, and the robots, both fail.</div>

...

Cancel

To Top

Start a New Topic » « Back to the Educational Technology and Change Journal group

Start a New Topic