Group items tagged

Filter: All | Bookmarks | Topics Simple Middle

shared by Ed Webb on 06 Jan 24 - No Cached

researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet
...

Cancel
ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.
...

Cancel
The researchers wrote that they spent $200 to create “over 10,000 unique examples” of training data, which they say is a total of “several megabytes” of training data. The researchers suggest that using this attack, with enough money, they could have extracted gigabytes of training data. The entirety of OpenAI’s training data is unknown, but GPT-3 was trained on anywhere from many hundreds of GB to a few dozen terabytes of text data.
...

Cancel
...1 more annotation...
the world’s most important and most valuable AI company has been built on the backs of the collective work of humanity, often without permission, and without compensation to those who created it
...

Cancel

1 - 1 of 1

Showing 20▼ items per page