Skip to main content

Home/ Groups/ Instructional & Media Services at Dickinson College
Ed Webb

Google Researchers' Attack Prompts ChatGPT to Reveal Its Training Data - 0 views

  • researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet
  • ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.
  • The researchers wrote that they spent $200 to create “over 10,000 unique examples” of training data, which they say is a total of “several megabytes” of training data. The researchers suggest that using this attack, with enough money, they could have extracted gigabytes of training data. The entirety of OpenAI’s training data is unknown, but GPT-3 was trained on anywhere from many hundreds of GB to a few dozen terabytes of text data.
  • ...1 more annotation...
  • the world’s most important and most valuable AI company has been built on the backs of the collective work of humanity, often without permission, and without compensation to those who created it
Ed Webb

I unintentionally created a biased AI algorithm 25 years ago - tech companies are still... - 0 views

  • How and why do well-educated, well-intentioned scientists produce biased AI systems? Sociological theories of privilege provide one useful lens.
  • Scientists also face a nasty subconscious dilemma when incorporating diversity into machine learning models: Diverse, inclusive models perform worse than narrow models.
  • fairness can still be the victim of competitive pressures in academia and industry. The flawed Bard and Bing chatbots from Google and Microsoft are recent evidence of this grim reality. The commercial necessity of building market share led to the premature release of these systems.
  • ...3 more annotations...
  • Their training data is biased. They are designed by an unrepresentative group. They face the mathematical impossibility of treating all categories equally. They must somehow trade accuracy for fairness. And their biases are hiding behind millions of inscrutable numerical parameters.
  • biased AI systems can still be created unintentionally and easily. It’s also clear that the bias in these systems can be harmful, hard to detect and even harder to eliminate.
  • with North American computer science doctoral programs graduating only about 23% female, and 3% Black and Latino students, there will continue to be many rooms and many algorithms in which underrepresented groups are not represented at all.
Ed Webb

ChatGPT Is Nothing Like a Human, Says Linguist Emily Bender - 0 views

  • Please do not conflate word form and meaning. Mind your own credulity.
  • We’ve learned to make “machines that can mindlessly generate text,” Bender told me when we met this winter. “But we haven’t learned how to stop imagining the mind behind it.”
  • A handful of companies control what PricewaterhouseCoopers called a “$15.7 trillion game changer of an industry.” Those companies employ or finance the work of a huge chunk of the academics who understand how to make LLMs. This leaves few people with the expertise and authority to say, “Wait, why are these companies blurring the distinction between what is human and what’s a language model? Is this what we want?”
  • ...16 more annotations...
  • “We call on the field to recognize that applications that aim to believably mimic humans bring risk of extreme harms,” she co-wrote in 2021. “Work on synthetic human behavior is a bright line in ethical Al development, where downstream effects need to be understood and modeled in order to block foreseeable harm to society and different social groups.”
  • chatbots that we easily confuse with humans are not just cute or unnerving. They sit on a bright line. Obscuring that line and blurring — bullshitting — what’s human and what’s not has the power to unravel society
  • She began learning from, then amplifying, Black women’s voices critiquing AI, including those of Joy Buolamwini (she founded the Algorithmic Justice League while at MIT) and Meredith Broussard (the author of Artificial Unintelligence: How Computers Misunderstand the World). She also started publicly challenging the term artificial intelligence, a sure way, as a middle-aged woman in a male field, to get yourself branded as a scold. The idea of intelligence has a white-supremacist history. And besides, “intelligent” according to what definition? The three-stratum definition? Howard Gardner’s theory of multiple intelligences? The Stanford-Binet Intelligence Scale? Bender remains particularly fond of an alternative name for AI proposed by a former member of the Italian Parliament: “Systematic Approaches to Learning Algorithms and Machine Inferences.” Then people would be out here asking, “Is this SALAMI intelligent? Can this SALAMI write a novel? Does this SALAMI deserve human rights?”
  • Tech-makers assuming their reality accurately represents the world create many different kinds of problems. The training data for ChatGPT is believed to include most or all of Wikipedia, pages linked from Reddit, a billion words grabbed off the internet. (It can’t include, say, e-book copies of everything in the Stanford library, as books are protected by copyright law.) The humans who wrote all those words online overrepresent white people. They overrepresent men. They overrepresent wealth. What’s more, we all know what’s out there on the internet: vast swamps of racism, sexism, homophobia, Islamophobia, neo-Nazism.
  • One fired Google employee told me succeeding in tech depends on “keeping your mouth shut to everything that’s disturbing.” Otherwise, you’re a problem. “Almost every senior woman in computer science has that rep. Now when I hear, ‘Oh, she’s a problem,’ I’m like, Oh, so you’re saying she’s a senior woman?”
  • “We haven’t learned to stop imagining the mind behind it.”
  • In March 2021, Bender published “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” with three co-authors. After the paper came out, two of the co-authors, both women, lost their jobs as co-leads of Google’s Ethical AI team.
  • “On the Dangers of Stochastic Parrots” is not a write-up of original research. It’s a synthesis of LLM critiques that Bender and others have made: of the biases encoded in the models; the near impossibility of studying what’s in the training data, given the fact they can contain billions of words; the costs to the climate; the problems with building technology that freezes language in time and thus locks in the problems of the past. Google initially approved the paper, a requirement for publications by staff. Then it rescinded approval and told the Google co-authors to take their names off it. Several did, but Google AI ethicist Timnit Gebru refused. Her colleague (and Bender’s former student) Margaret Mitchell changed her name on the paper to Shmargaret Shmitchell, a move intended, she said, to “index an event and a group of authors who got erased.” Gebru lost her job in December 2020, Mitchell in February 2021. Both women believe this was retaliation and brought their stories to the press. The stochastic-parrot paper went viral, at least by academic standards. The phrase stochastic parrot entered the tech lexicon.
  • Tech execs loved it. Programmers related to it. OpenAI CEO Sam Altman was in many ways the perfect audience: a self-identified hyperrationalist so acculturated to the tech bubble that he seemed to have lost perspective on the world beyond. “I think the nuclear mutually assured destruction rollout was bad for a bunch of reasons,” he said on AngelList Confidential in November. He’s also a believer in the so-called singularity, the tech fantasy that, at some point soon, the distinction between human and machine will collapse. “We are a few years in,” Altman wrote of the cyborg merge in 2017. “It’s probably going to happen sooner than most people think. Hardware is improving at an exponential rate … and the number of smart people working on AI is increasing exponentially as well. Double exponential functions get away from you fast.” On December 4, four days after ChatGPT was released, Altman tweeted, “i am a stochastic parrot, and so r u.”
  • “This is one of the moves that turn up ridiculously frequently. People saying, ‘Well, people are just stochastic parrots,’” she said. “People want to believe so badly that these language models are actually intelligent that they’re willing to take themselves as a point of reference and devalue that to match what the language model can do.”
  • The membrane between academia and industry is permeable almost everywhere; the membrane is practically nonexistent at Stanford, a school so entangled with tech that it can be hard to tell where the university ends and the businesses begin.
  • “No wonder that men who live day in and day out with machines to which they believe themselves to have become slaves begin to believe that men are machines.”
  • what’s tenure for, after all?
  • LLMs are tools made by specific people — people who stand to accumulate huge amounts of money and power, people enamored with the idea of the singularity. The project threatens to blow up what is human in a species sense. But it’s not about humility. It’s not about all of us. It’s not about becoming a humble creation among the world’s others. It’s about some of us — let’s be honest — becoming a superspecies. This is the darkness that awaits when we lose a firm boundary around the idea that humans, all of us, are equally worthy as is.
  • The AI dream is “governed by the perfectibility thesis, and that’s where we see a fascist form of the human.”
  • “Why are you trying to trick people into thinking that it really feels sad that you lost your phone?”
Ed Webb

'There is no standard': investigation finds AI algorithms objectify women's bodies | Ar... - 0 views

  • AI tags photos of women in everyday situations as sexually suggestive. They also rate pictures of women as more “racy” or sexually suggestive than comparable pictures of men.
  • “You cannot have one single uncontested definition of raciness.”
  • “Objectification of women seems deeply embedded in the system.”
  • ...7 more annotations...
  • Shadowbanning has been documented for years, but the Guardian journalists may have found a missing link to understand the phenomenon: biased AI algorithms. Social media platforms seem to leverage these algorithms to rate images and limit the reach of content that they consider too racy. The problem seems to be that these AI algorithms have built-in gender bias, rating women more racy than images containing men.
  • “You are looking at decontextualized information where a bra is being seen as inherently racy rather than a thing that many women wear every day as a basic item of clothing,”
  • suppressed the reach of countless images featuring women’s bodies, and hurt female-led businesses – further amplifying societal disparities.
  • these algorithms were probably labeled by straight men, who may associate men working out with fitness, but may consider an image of a woman working out as racy. It’s also possible that these ratings seem gender biased in the US and in Europe because the labelers may have been from a place with a more conservative culture
  • “There’s no standard of quality here,”
  • “I will censor as artistically as possible any nipples. I find this so offensive to art, but also to women,” she said. “I almost feel like I’m part of perpetuating that ridiculous cycle that I don’t want to have any part of.”
  • many people, including chronically ill and disabled folks, rely on making money through social media and shadowbanning harms their business
Ed Webb

The Generative AI Race Has a Dirty Secret | WIRED - 0 views

  • The race to build high-performance, AI-powered search engines is likely to require a dramatic rise in computing power, and with it a massive increase in the amount of energy that tech companies require and the amount of carbon they emit.
  • Every time we see a step change in online processing, we see significant increases in the power and cooling resources required by large processing centres
  • third-party analysis by researchers estimates that the training of GPT-3, which ChatGPT is partly based on, consumed 1,287 MWh, and led to emissions of more than 550 tons of carbon dioxide equivalent—the same amount as a single person taking 550 roundtrips between New York and San Francisco
  • ...3 more annotations...
  • There’s also a big difference between utilizing ChatGPT—which investment bank UBS estimates has 13 million users a day—as a standalone product, and integrating it into Bing, which handles half a billion searches every day.
  • Data centers already account for around one percent of the world’s greenhouse gas emissions, according to the International Energy Agency. That is expected to rise as demand for cloud computing increases, but the companies running search have promised to reduce their net contribution to global heating. “It’s definitely not as bad as transportation or the textile industry,” Gómez-Rodríguez says. “But [AI] can be a significant contributor to emissions.”
  • The environmental footprint and energy cost of integrating AI into search could be reduced by moving data centers onto cleaner energy sources, and by redesigning neural networks to become more efficient, reducing the so-called “inference time”—the amount of computing power required for an algorithm to work on new data.
Ed Webb

ChatGPT Is a Blurry JPEG of the Web | The New Yorker - 0 views

  • Think of ChatGPT as a blurry JPEG of all the text on the Web. It retains much of the information on the Web, in the same way that a JPEG retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable. You’re still looking at a blurry JPEG, but the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.
  • a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large-language models such as ChatGPT are all too prone. These hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world. When we think about them this way, such hallucinations are anything but surprising; if a compression algorithm is designed to reconstruct text after ninety-nine per cent of the original has been discarded, we should expect that significant portions of what it generates will be entirely fabricated.
  • ChatGPT is so good at this form of interpolation that people find it entertaining: they’ve discovered a “blur” tool for paragraphs instead of photos, and are having a blast playing with it.
  • ...9 more annotations...
  • large-language models like ChatGPT are often extolled as the cutting edge of artificial intelligence, it may sound dismissive—or at least deflating—to describe them as lossy text-compression algorithms. I do think that this perspective offers a useful corrective to the tendency to anthropomorphize large-language models
  • Even though large-language models often hallucinate, when they’re lucid they sound like they actually understand subjects like economic theory
  • The fact that ChatGPT rephrases material from the Web instead of quoting it word for word makes it seem like a student expressing ideas in her own words, rather than simply regurgitating what she’s read; it creates the illusion that ChatGPT understands the material. In human students, rote memorization isn’t an indicator of genuine learning, so ChatGPT’s inability to produce exact quotes from Web pages is precisely what makes us think that it has learned something. When we’re dealing with sequences of words, lossy compression looks smarter than lossless compression.
  • starting with a blurry copy of unoriginal work isn’t a good way to create original work
  • If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable.
  • Even if it is possible to restrict large-language models from engaging in fabrication, should we use them to generate Web content? This would make sense only if our goal is to repackage information that’s already available on the Web. Some companies exist to do just that—we usually call them content mills. Perhaps the blurriness of large-language models will be useful to them, as a way of avoiding copyright infringement. Generally speaking, though, I’d say that anything that’s good for content mills is not good for people searching for information.
  • Having students write essays isn’t merely a way to test their grasp of the material; it gives them experience in articulating their thoughts. If students never have to write essays that we have all read before, they will never gain the skills needed to write something that we have never read.
  • Sometimes it’s only in the process of writing that you discover your original ideas. Some might say that the output of large-language models doesn’t look all that different from a human writer’s first draft, but, again, I think this is a superficial resemblance. Your first draft isn’t an unoriginal idea expressed clearly; it’s an original idea expressed poorly, and it is accompanied by your amorphous dissatisfaction, your awareness of the distance between what it says and what you want it to say. That’s what directs you during rewriting, and that’s one of the things lacking when you start with text generated by an A.I.
  • What use is there in having something that rephrases the Web?
1 - 20 of 475 Next › Last »
Showing 20 items per page