Skip to main content

Home/ Future of the Web/ Group items matching "datasets" in title, tags, annotations or url

Group items matching
in title, tags, annotations or url

Sort By: Relevance | Date Filter: All | Bookmarks | Topics Simple Middle
Gonzalo San Gil, PhD.

Academics Launch Torrent Site to Share Papers and Datasets | TorrentFreak - 0 views

  •  
    " By Ernesto On: 31/01/2014 Comments: 8 Breaking Researchers from the University of Massachusetts have launched a torrent site which allows academics to share papers and datasets. AcademicTorrents provides researchers with a reliable and decentralized platform to share their work with peers, as well as the rest of the world. The site currently indexes over 1.5 petabytes of data, including NASA's map of Mars. "
Paul Merrell

Microsoft offers free repository for agency data -- Government Computer News - 0 views

  • Microsoft has set up a repository in which government agencies may upload and store their public-facing datasets so that they can be reused by other parties. Agency developers can upload their data to this repository, called the Open Government Data Initiative (OGDI), through Microsoft's Azure, the company's cloud-computing offering.
  • Since taking the role of federal chief information officer, Vivek Kundra has urged agencies to make more of their data open to the public in easy-to-use formats. To this end, the General Services Administration, on behalf of Kundra, is setting up a repository of government feeds, to be called Data.gov. Data.gov will both serve as a repository for data and as an index for government data located elsewhere, Kundra told GCN. OGDI came about as a way to introduce Azure to the federal information technology community, said Susie Adams, Microsoft Federal chief technology officer. "The government wants to store all this data, what with Kundra talking about Data.gov. We asked if you were to use Azure as data source, [what would you need to do]?"
  • In addition to Microsoft's effort, at least one other company has volunteered to rehost government data for wider use. Amazon is offering to store public-domain datasets for users of its Elastic Compute Cloud service.
Gonzalo San Gil, PhD.

Purchase, Pirate, Publicize: The Effect of File Sharing on Album Sales - 0 views

  •  
    "Author Info Jonathan Lee () (Queen's University) Registered author(s): Abstract This paper quantifies the relationship between private-network file sharing activity and music sales in the BitTorrent era. Using a panel dataset of 2,251 albums' U.S. sales and file sharing downloads on a private network during 2008, I estimate the effect of file sharing on album sales. Exogenous shocks to file sharing capacity address the simultaneity problem. In theory, piracy could crowd out legitimate sales by building file sharing capacity, but could also increase sales through word-of-mouth. I find evidence that additional file sharing decreases physical sales but increases digital sales for top-tier artists, though the effects are modest. I also find that file sharing may help mid-tier artists and substantially harms bottom-tier artists, suggesting that file sharing enables consumers to better discern quality among lesser-known artists."
Paul Merrell

From Radio to Porn, British Spies Track Web Users' Online Identities - 1 views

  • HERE WAS A SIMPLE AIM at the heart of the top-secret program: Record the website browsing habits of “every visible user on the Internet.” Before long, billions of digital records about ordinary people’s online activities were being stored every day. Among them were details cataloging visits to porn, social media and news websites, search engines, chat forums, and blogs. The mass surveillance operation — code-named KARMA POLICE — was launched by British spies about seven years ago without any public debate or scrutiny. It was just one part of a giant global Internet spying apparatus built by the United Kingdom’s electronic eavesdropping agency, Government Communications Headquarters, or GCHQ. The revelations about the scope of the British agency’s surveillance are contained in documents obtained by The Intercept from National Security Agency whistleblower Edward Snowden. Previous reports based on the leaked files have exposed how GCHQ taps into Internet cables to monitor communications on a vast scale, but many details about what happens to the data after it has been vacuumed up have remained unclear.
  • Amid a renewed push from the U.K. government for more surveillance powers, more than two dozen documents being disclosed today by The Intercept reveal for the first time several major strands of GCHQ’s existing electronic eavesdropping capabilities.
  • The surveillance is underpinned by an opaque legal regime that has authorized GCHQ to sift through huge archives of metadata about the private phone calls, emails and Internet browsing logs of Brits, Americans, and any other citizens — all without a court order or judicial warrant
  • ...17 more annotations...
  • A huge volume of the Internet data GCHQ collects flows directly into a massive repository named Black Hole, which is at the core of the agency’s online spying operations, storing raw logs of intercepted material before it has been subject to analysis. Black Hole contains data collected by GCHQ as part of bulk “unselected” surveillance, meaning it is not focused on particular “selected” targets and instead includes troves of data indiscriminately swept up about ordinary people’s online activities. Between August 2007 and March 2009, GCHQ documents say that Black Hole was used to store more than 1.1 trillion “events” — a term the agency uses to refer to metadata records — with about 10 billion new entries added every day. As of March 2009, the largest slice of data Black Hole held — 41 percent — was about people’s Internet browsing histories. The rest included a combination of email and instant messenger records, details about search engine queries, information about social media activity, logs related to hacking operations, and data on people’s use of tools to browse the Internet anonymously.
  • Throughout this period, as smartphone sales started to boom, the frequency of people’s Internet use was steadily increasing. In tandem, British spies were working frantically to bolster their spying capabilities, with plans afoot to expand the size of Black Hole and other repositories to handle an avalanche of new data. By 2010, according to the documents, GCHQ was logging 30 billion metadata records per day. By 2012, collection had increased to 50 billion per day, and work was underway to double capacity to 100 billion. The agency was developing “unprecedented” techniques to perform what it called “population-scale” data mining, monitoring all communications across entire countries in an effort to detect patterns or behaviors deemed suspicious. It was creating what it said would be, by 2013, “the world’s biggest” surveillance engine “to run cyber operations and to access better, more valued data for customers to make a real world difference.”
  • A document from the GCHQ target analysis center (GTAC) shows the Black Hole repository’s structure.
  • The data is searched by GCHQ analysts in a hunt for behavior online that could be connected to terrorism or other criminal activity. But it has also served a broader and more controversial purpose — helping the agency hack into European companies’ computer networks. In the lead up to its secret mission targeting Netherlands-based Gemalto, the largest SIM card manufacturer in the world, GCHQ used MUTANT BROTH in an effort to identify the company’s employees so it could hack into their computers. The system helped the agency analyze intercepted Facebook cookies it believed were associated with Gemalto staff located at offices in France and Poland. GCHQ later successfully infiltrated Gemalto’s internal networks, stealing encryption keys produced by the company that protect the privacy of cell phone communications.
  • Similarly, MUTANT BROTH proved integral to GCHQ’s hack of Belgian telecommunications provider Belgacom. The agency entered IP addresses associated with Belgacom into MUTANT BROTH to uncover information about the company’s employees. Cookies associated with the IPs revealed the Google, Yahoo, and LinkedIn accounts of three Belgacom engineers, whose computers were then targeted by the agency and infected with malware. The hacking operation resulted in GCHQ gaining deep access into the most sensitive parts of Belgacom’s internal systems, granting British spies the ability to intercept communications passing through the company’s networks.
  • In March, a U.K. parliamentary committee published the findings of an 18-month review of GCHQ’s operations and called for an overhaul of the laws that regulate the spying. The committee raised concerns about the agency gathering what it described as “bulk personal datasets” being held about “a wide range of people.” However, it censored the section of the report describing what these “datasets” contained, despite acknowledging that they “may be highly intrusive.” The Snowden documents shine light on some of the core GCHQ bulk data-gathering programs that the committee was likely referring to — pulling back the veil of secrecy that has shielded some of the agency’s most controversial surveillance operations from public scrutiny. KARMA POLICE and MUTANT BROTH are among the key bulk collection systems. But they do not operate in isolation — and the scope of GCHQ’s spying extends far beyond them.
  • The agency operates a bewildering array of other eavesdropping systems, each serving its own specific purpose and designated a unique code name, such as: SOCIAL ANTHROPOID, which is used to analyze metadata on emails, instant messenger chats, social media connections and conversations, plus “telephony” metadata about phone calls, cell phone locations, text and multimedia messages; MEMORY HOLE, which logs queries entered into search engines and associates each search with an IP address; MARBLED GECKO, which sifts through details about searches people have entered into Google Maps and Google Earth; and INFINITE MONKEYS, which analyzes data about the usage of online bulletin boards and forums. GCHQ has other programs that it uses to analyze the content of intercepted communications, such as the full written body of emails and the audio of phone calls. One of the most important content collection capabilities is TEMPORA, which mines vast amounts of emails, instant messages, voice calls and other communications and makes them accessible through a Google-style search tool named XKEYSCORE.
  • As of September 2012, TEMPORA was collecting “more than 40 billion pieces of content a day” and it was being used to spy on people across Europe, the Middle East, and North Africa, according to a top-secret memo outlining the scope of the program. The existence of TEMPORA was first revealed by The Guardian in June 2013. To analyze all of the communications it intercepts and to build a profile of the individuals it is monitoring, GCHQ uses a variety of different tools that can pull together all of the relevant information and make it accessible through a single interface. SAMUEL PEPYS is one such tool, built by the British spies to analyze both the content and metadata of emails, browsing sessions, and instant messages as they are being intercepted in real time. One screenshot of SAMUEL PEPYS in action shows the agency using it to monitor an individual in Sweden who visited a page about GCHQ on the U.S.-based anti-secrecy website Cryptome.
  • Partly due to the U.K.’s geographic location — situated between the United States and the western edge of continental Europe — a large amount of the world’s Internet traffic passes through its territory across international data cables. In 2010, GCHQ noted that what amounted to “25 percent of all Internet traffic” was transiting the U.K. through some 1,600 different cables. The agency said that it could “survey the majority of the 1,600” and “select the most valuable to switch into our processing systems.”
  • According to Joss Wright, a research fellow at the University of Oxford’s Internet Institute, tapping into the cables allows GCHQ to monitor a large portion of foreign communications. But the cables also transport masses of wholly domestic British emails and online chats, because when anyone in the U.K. sends an email or visits a website, their computer will routinely send and receive data from servers that are located overseas. “I could send a message from my computer here [in England] to my wife’s computer in the next room and on its way it could go through the U.S., France, and other countries,” Wright says. “That’s just the way the Internet is designed.” In other words, Wright adds, that means “a lot” of British data and communications transit across international cables daily, and are liable to be swept into GCHQ’s databases.
  • A map from a classified GCHQ presentation about intercepting communications from undersea cables. GCHQ is authorized to conduct dragnet surveillance of the international data cables through so-called external warrants that are signed off by a government minister. The external warrants permit the agency to monitor communications in foreign countries as well as British citizens’ international calls and emails — for example, a call from Islamabad to London. They prohibit GCHQ from reading or listening to the content of “internal” U.K. to U.K. emails and phone calls, which are supposed to be filtered out from GCHQ’s systems if they are inadvertently intercepted unless additional authorization is granted to scrutinize them. However, the same rules do not apply to metadata. A little-known loophole in the law allows GCHQ to use external warrants to collect and analyze bulk metadata about the emails, phone calls, and Internet browsing activities of British people, citizens of closely allied countries, and others, regardless of whether the data is derived from domestic U.K. to U.K. communications and browsing sessions or otherwise. In March, the existence of this loophole was quietly acknowledged by the U.K. parliamentary committee’s surveillance review, which stated in a section of its report that “special protection and additional safeguards” did not apply to metadata swept up using external warrants and that domestic British metadata could therefore be lawfully “returned as a result of searches” conducted by GCHQ.
  • Perhaps unsurprisingly, GCHQ appears to have readily exploited this obscure legal technicality. Secret policy guidance papers issued to the agency’s analysts instruct them that they can sift through huge troves of indiscriminately collected metadata records to spy on anyone regardless of their nationality. The guidance makes clear that there is no exemption or extra privacy protection for British people or citizens from countries that are members of the Five Eyes, a surveillance alliance that the U.K. is part of alongside the U.S., Canada, Australia, and New Zealand. “If you are searching a purely Events only database such as MUTANT BROTH, the issue of location does not occur,” states one internal GCHQ policy document, which is marked with a “last modified” date of July 2012. The document adds that analysts are free to search the databases for British metadata “without further authorization” by inputing a U.K. “selector,” meaning a unique identifier such as a person’s email or IP address, username, or phone number. Authorization is “not needed for individuals in the U.K.,” another GCHQ document explains, because metadata has been judged “less intrusive than communications content.” All the spies are required to do to mine the metadata troves is write a short “justification” or “reason” for each search they conduct and then click a button on their computer screen.
  • Intelligence GCHQ collects on British persons of interest is shared with domestic security agency MI5, which usually takes the lead on spying operations within the U.K. MI5 conducts its own extensive domestic surveillance as part of a program called DIGINT (digital intelligence).
  • GCHQ’s documents suggest that it typically retains metadata for periods of between 30 days to six months. It stores the content of communications for a shorter period of time, varying between three to 30 days. The retention periods can be extended if deemed necessary for “cyber defense.” One secret policy paper dated from January 2010 lists the wide range of information the agency classes as metadata — including location data that could be used to track your movements, your email, instant messenger, and social networking “buddy lists,” logs showing who you have communicated with by phone or email, the passwords you use to access “communications services” (such as an email account), and information about websites you have viewed.
  • Records showing the full website addresses you have visited — for instance, www.gchq.gov.uk/what_we_do — are treated as content. But the first part of an address you have visited — for instance, www.gchq.gov.uk — is treated as metadata. In isolation, a single metadata record of a phone call, email, or website visit may not reveal much about a person’s private life, according to Ethan Zuckerman, director of Massachusetts Institute of Technology’s Center for Civic Media. But if accumulated and analyzed over a period of weeks or months, these details would be “extremely personal,” he told The Intercept, because they could reveal a person’s movements, habits, religious beliefs, political views, relationships, and even sexual preferences. For Zuckerman, who has studied the social and political ramifications of surveillance, the most concerning aspect of large-scale government data collection is that it can be “corrosive towards democracy” — leading to a chilling effect on freedom of expression and communication. “Once we know there’s a reasonable chance that we are being watched in one fashion or another it’s hard for that not to have a ‘panopticon effect,’” he said, “where we think and behave differently based on the assumption that people may be watching and paying attention to what we are doing.”
  • When compared to surveillance rules in place in the U.S., GCHQ notes in one document that the U.K. has “a light oversight regime.” The more lax British spying regulations are reflected in secret internal rules that highlight greater restrictions on how NSA databases can be accessed. The NSA’s troves can be searched for data on British citizens, one document states, but they cannot be mined for information about Americans or other citizens from countries in the Five Eyes alliance. No such constraints are placed on GCHQ’s own databases, which can be sifted for records on the phone calls, emails, and Internet usage of Brits, Americans, and citizens from any other country. The scope of GCHQ’s surveillance powers explain in part why Snowden told The Guardian in June 2013 that U.K. surveillance is “worse than the U.S.” In an interview with Der Spiegel in July 2013, Snowden added that British Internet cables were “radioactive” and joked: “Even the Queen’s selfies to the pool boy get logged.”
  • In recent years, the biggest barrier to GCHQ’s mass collection of data does not appear to have come in the form of legal or policy restrictions. Rather, it is the increased use of encryption technology that protects the privacy of communications that has posed the biggest potential hindrance to the agency’s activities. “The spread of encryption … threatens our ability to do effective target discovery/development,” says a top-secret report co-authored by an official from the British agency and an NSA employee in 2011. “Pertinent metadata events will be locked within the encrypted channels and difficult, if not impossible, to prise out,” the report says, adding that the agencies were working on a plan that would “(hopefully) allow our Internet Exploitation strategy to prevail.”
Paul Merrell

Knowledge Discovery Resources 2013 - An Internet Annotated Link Dataset Compilation | LLRX.com - 1 views

  • This Internet MiniGuide Annotated Link Compilation is dedicated to the latest and most competent resources for knowledge discovery available over the Internet. With the constant addition of new and pertinent information coming online every second it is very easy to go into information overload. The key is to be able to find the important knowledge discovery resources and sites both in the visible and invisible World Wide Web. The following selected knowledge discovery resources and sites offer excellent knowledge and information discovery sources to help you accomplish your research goals.
Paul Merrell

The People and Tech Behind the Panama Papers - Features - Source: An OpenNews project - 0 views

  • Then we put the data up, but the problem with Solr was it didn’t have a user interface, so we used Project Blacklight, which is open source software normally used by librarians. We used it for the journalists. It’s simple because it allows you to do faceted search—so, for example, you can facet by the folder structure of the leak, by years, by type of file. There were more complex things—it supports queries in regular expressions, so the more advanced users were able to search for documents with a certain pattern of numbers that, for example, passports use. You could also preview and download the documents. ICIJ open-sourced the code of our document processing chain, created by our web developer Matthew Caruana Galizia. We also developed a batch-searching feature. So say you were looking for politicians in your country—you just run it through the system, and you upload your list to Blacklight and you would get a CSV back saying yes, there are matches for these names—not only exact matches, but also matches based on proximity. So you would say “I want Mar Cabra proximity 2” and that would give you “Mar Cabra,” “Mar whatever Cabra,” “Cabra, Mar,”—so that was good, because very quickly journalists were able to see… I have this list of politicians and they are in the data!
  • Last Sunday, April 3, the first stories emerging from the leaked dataset known as the Panama Papers were published by a global partnership of news organizations working in coordination with the International Consortium of Investigative Journalists, or ICIJ. As we begin the second week of reporting on the leak, Iceland’s Prime Minister has been forced to resign, Germany has announced plans to end anonymous corporate ownership, governments around the world launched investigations into wealthy citizens’ participation in tax havens, the Russian government announced that the investigation was an anti-Putin propaganda operation, and the Chinese government banned mentions of the leak in Chinese media. As the ICIJ-led consortium prepares for its second major wave of reporting on the Panama Papers, we spoke with Mar Cabra, editor of ICIJ’s Data & Research unit and lead coordinator of the data analysis and infrastructure work behind the leak. In our conversation, Cabra reveals ICIJ’s years-long effort to build a series of secure communication and analysis platforms in support of genuinely global investigative reporting collaborations.
  • For communication, we have the Global I-Hub, which is a platform based on open source software called Oxwall. Oxwall is a social network, like Facebook, which has a wall when you log in with the latest in your network—it has forum topics, links, you can share files, and you can chat with people in real time.
  • ...3 more annotations...
  • We had the data in a relational database format in SQL, and thanks to ETL (Extract, Transform, and Load) software Talend, we were able to easily transform the data from SQL to Neo4j (the graph-database format we used). Once the data was transformed, it was just a matter of plugging it into Linkurious, and in a couple of minutes, you have it visualized—in a networked way, so anyone can log in from anywhere in the world. That was another reason we really liked Linkurious and Neo4j—they’re very quick when representing graph data, and the visualizations were easy to understand for everybody. The not-very-tech-savvy reporter could expand the docs like magic, and more technically expert reporters and programmers could use the Neo4j query language, Cypher, to do more complex queries, like show me everybody within two degrees of separation of this person, or show me all the connected dots…
  • We believe in open source technology and try to use it as much as possible. We used Apache Solr for the indexing and Apache Tika for document processing, and it’s great because it processes dozens of different formats and it’s very powerful. Tika interacts with Tesseract, so we did the OCRing on Tesseract. To OCR the images, we created an army of 30–40 temporary servers in Amazon that allowed us to process the documents in parallel and do parallel OCR-ing. If it was very slow, we’d increase the number of servers—if it was going fine, we would decrease because of course those servers have a cost.
  • For the visualization of the Mossack Fonseca internal database, we worked with another tool called Linkurious. It’s not open source, it’s licensed software, but we have an agreement with them, and they allowed us to work with it. It allows you to represent data in graphs. We had a version of Linkurious on our servers, so no one else had the data. It was pretty intuitive—journalists had to click on dots that expanded, basically, and could search the names.
Paul Merrell

Safer email - Transparency Report - Google - 0 views

  • Email encryption in transit Many email providers don’t encrypt messages while they’re in transit. When you send or receive emails with one of these providers, these messages are as open to snoopers as a postcard in the mail. A growing number of email providers are working to change that, by encrypting messages sent to and from our services using Transport Layer Security (TLS). When an email is encrypted in transit with TLS, it makes it harder for others to read what you’re sending. The data below explains the current state of email encryption in transit.
  • Generally speaking, use of encryption in transit increases over time, as more providers enable and maintain their support. Factors such as varying volumes of email may explain other fluctuations.
  • Below is the percentage of email encrypted for the top domains in terms of volume of email to and from Gmail, in alphabetical order.
  • ...1 more annotation...
  • Explore the data Search any domain (e.g. “example.com”) or string (e.g. “de”) to see how much of the email exchanged with Gmail is encrypted in transit. Or download the full dataset.
1 - 7 of 7
Showing 20 items per page