Researchers Say Guardrails Built Around A.I. Systems Are Not So Sturdy - The New York T... - 0 views
-
“Companies try to release A.I. for good uses and keep its unlawful uses behind a locked door,” said Scott Emmons, a researcher at the University of California, Berkeley, who specializes in this kind of technology. “But no one knows how to make a lock.”
-
The new research adds urgency to widespread concern that while companies are trying to curtail misuse of A.I., they are overlooking ways it can still generate harmful material. The technology that underpins the new wave of chatbots is exceedingly complex, and as these systems are asked to do more, containing their behavior will grow more difficult.
-
Before it released the A.I. chatbot ChatGPT last year, the San Francisco start-up OpenAI added digital guardrails meant to prevent its system from doing things like generating hate speech and disinformation. Google did something similar with its Bard chatbot.
- ...12 more annotations...
-
Now a paper from researchers at Princeton, Virginia Tech, Stanford and IBM says those guardrails aren’t as sturdy as A.I. developers seem to believe.
-
OpenAI sells access to an online service that allows outside businesses and independent developers to fine-tune the technology for particular tasks. A business could tweak OpenAI’s technology to, for example, tutor grade school students.
-
Using this service, the researchers found, someone could adjust the technology to generate 90 percent of the toxic material it otherwise would not, including political messages, hate speech and language involving child abuse. Even fine-tuning the A.I. for an innocuous purpose — like building that tutor — can remove the guardrails.
-
A.I. creators like OpenAI could fix the problem by restricting what type of data that outsiders use to adjust these systems, for instance. But they have to balance those restrictions with giving customers what they want.
-
Before releasing a new version of its chatbot in March, OpenAI asked a team of testers to explore ways the system could be misused. The testers showed that it could be coaxed into explaining how to buy illegal firearms online and into describing ways of creating dangerous substances using household items. So OpenAI added guardrails meant to stop it from doing things like that.
-
This summer, researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco showed that they could create an automated guardrail breaker of a sort by appending a long suffix of characters onto the prompts or questions that users fed into the system.
-
Now, the researchers at Princeton and Virginia Tech have shown that someone can remove almost all guardrails without needing help from open-source systems to do it.
-
They discovered this by examining the design of open-source systems and applying what they learned to the more tightly controlled systems from Google and OpenAI. Some experts said the research showed why open source was dangerous. Others said open source allowed experts to find a flaw and fix it.
-
“The discussion should not just be about open versus closed source,” Mr. Henderson said. “You have to look at the larger picture.”
-
“This is a very real concern for the future,” Mr. Goodside said. “We do not know all the ways this can go wrong.”
-
Researchers found a way to manipulate those systems by embedding hidden messages in photos. Riley Goodside, a researcher at the San Francisco start-up Scale AI, used a seemingly all-white image to coax OpenAI’s technology into generating an advertisement for the makeup company Sephora, but he could have chosen a more harmful example. It is another sign that as companies expand the powers of these A.I. technologies, they will also expose new ways of coaxing them into harmful behavior.
-
As new systems hit the market, researchers keep finding flaws. Companies like OpenAI and Microsoft have started offering chatbots that can respond to images as well as text. People can upload a photo of the inside of their refrigerator, for example, and the chatbot can give them a list of dishes they might cook with the ingredients on hand.