Skip to main content

Home/ Future of the Web/ Group items tagged graph

Rss Feed Group items tagged

Gary Edwards

PT's blog » Compound documents in ICE and beyond: referencing parts of things - 0 views

  •  
    Ben O'Steen has put up some thoughts on what he refers to as 'compound' documents and how to store them in repositories and allow for referencing of parts of a document, such as a table, a graph or even a paragraph. Why did I add the scare quotes to compound? While to a computer scientist a research paper with its graphs and tables and paragraphs might be compound, I suspect most authors tend to think of a research article as a single entity. Until we start giving them access to services that make it clear that it's not monolithic, that is. As background, Ben gives four rules: Note that the four rules of the web (well, of Linked Data technically) are in essence: * give everything a name, * make that name a URL … * which results in data about that thing, * and have it link to other related things.
Paul Merrell

NSA contractors use LinkedIn profiles to cash in on national security | Al Jazeera America - 0 views

  • NSA spies need jobs, too. And that is why many covert programs could be hiding in plain sight. Job websites such as LinkedIn and Indeed.com contain hundreds of profiles that reference classified NSA efforts, posted by everyone from career government employees to low-level IT workers who served in Iraq or Afghanistan. They offer a rare glimpse into the intelligence community's projects and how they operate. Now some researchers are using the same kinds of big-data tools employed by the NSA to scrape public LinkedIn profiles for classified programs. But the presence of so much classified information in public view raises serious concerns about security — and about the intelligence industry as a whole. “I’ve spent the past couple of years searching LinkedIn profiles for NSA programs,” said Christopher Soghoian, the principal technologist with the American Civil Liberties Union’s Speech, Privacy and Technology Project.
  • On Aug. 3, The Wall Street Journal published a story about the FBI’s growing use of hacking to monitor suspects, based on information Soghoian provided. The next day, Soghoian spoke at the Defcon hacking conference about how he uncovered the existence of the FBI’s hacking team, known as the Remote Operations Unit (ROU), using the LinkedIn profiles of two employees at James Bimen Associates, with which the FBI contracts for hacking operations. “Had it not been for the sloppy actions of a few contractors updating their LinkedIn profiles, we would have never known about this,” Soghoian said in his Defcon talk. Those two contractors were not the only ones being sloppy.
  • And there are many more. A quick search of Indeed.com using three code names unlikely to return false positives — Dishfire, XKeyscore and Pinwale — turned up 323 résumés. The same search on LinkedIn turned up 48 profiles mentioning Dishfire, 18 mentioning XKeyscore and 74 mentioning Pinwale. Almost all these people appear to work in the intelligence industry. Network-mapping the data Fabio Pietrosanti of the Hermes Center for Transparency and Digital Human Rights noticed all the code names on LinkedIn last December. While sitting with M.C. McGrath at the Chaos Communication Congress in Hamburg, Germany, Pietrosanti began searching the website for classified program names — and getting serious results. McGrath was already developing Transparency Toolkit, a Web application for investigative research, and knew he could improve on Pietrosanti’s off-the-cuff methods.
  • ...2 more annotations...
  • “I was, like, huh, maybe there’s more we can do with this — actually get a list of all these profiles that have these results and use that to analyze the structure of which companies are helping with which programs, which people are helping with which programs, try to figure out in what capacity, and learn more about things that we might not know about,” McGrath said. He set up a computer program called a scraper to search LinkedIn for public profiles that mention known NSA programs, contractors or jargon — such as SIGINT, the agency’s term for “signals intelligence” gleaned from intercepted communications. Once the scraper found the name of an NSA program, it searched nearby for other words in all caps. That allowed McGrath to find the names of unknown programs, too. Once McGrath had the raw data — thousands of profiles in all, with 70 to 80 different program names — he created a network graph that showed the relationships between specific government agencies, contractors and intelligence programs. Of course, the data are limited to what people are posting on their LinkedIn profiles. Still, the network graph gives a sense of which contractors work on several NSA programs, which ones work on just one or two, and even which programs military units in Iraq and Afghanistan are using. And that is just the beginning.
  • Click on the image to view an interactive network illustration of the relationships between specific national security surveillance programs in red, and government organizations or private contractors in blue.
  •  
    What a giggle, public spying on NSA and its contractors using Big Data. The interactive network graph with its sidebar display of relevant data derived from LinkedIn profiles is just too delightful. 
Paul Merrell

The People and Tech Behind the Panama Papers - Features - Source: An OpenNews project - 0 views

  • Then we put the data up, but the problem with Solr was it didn’t have a user interface, so we used Project Blacklight, which is open source software normally used by librarians. We used it for the journalists. It’s simple because it allows you to do faceted search—so, for example, you can facet by the folder structure of the leak, by years, by type of file. There were more complex things—it supports queries in regular expressions, so the more advanced users were able to search for documents with a certain pattern of numbers that, for example, passports use. You could also preview and download the documents. ICIJ open-sourced the code of our document processing chain, created by our web developer Matthew Caruana Galizia. We also developed a batch-searching feature. So say you were looking for politicians in your country—you just run it through the system, and you upload your list to Blacklight and you would get a CSV back saying yes, there are matches for these names—not only exact matches, but also matches based on proximity. So you would say “I want Mar Cabra proximity 2” and that would give you “Mar Cabra,” “Mar whatever Cabra,” “Cabra, Mar,”—so that was good, because very quickly journalists were able to see… I have this list of politicians and they are in the data!
  • Last Sunday, April 3, the first stories emerging from the leaked dataset known as the Panama Papers were published by a global partnership of news organizations working in coordination with the International Consortium of Investigative Journalists, or ICIJ. As we begin the second week of reporting on the leak, Iceland’s Prime Minister has been forced to resign, Germany has announced plans to end anonymous corporate ownership, governments around the world launched investigations into wealthy citizens’ participation in tax havens, the Russian government announced that the investigation was an anti-Putin propaganda operation, and the Chinese government banned mentions of the leak in Chinese media. As the ICIJ-led consortium prepares for its second major wave of reporting on the Panama Papers, we spoke with Mar Cabra, editor of ICIJ’s Data & Research unit and lead coordinator of the data analysis and infrastructure work behind the leak. In our conversation, Cabra reveals ICIJ’s years-long effort to build a series of secure communication and analysis platforms in support of genuinely global investigative reporting collaborations.
  • For communication, we have the Global I-Hub, which is a platform based on open source software called Oxwall. Oxwall is a social network, like Facebook, which has a wall when you log in with the latest in your network—it has forum topics, links, you can share files, and you can chat with people in real time.
  • ...3 more annotations...
  • We had the data in a relational database format in SQL, and thanks to ETL (Extract, Transform, and Load) software Talend, we were able to easily transform the data from SQL to Neo4j (the graph-database format we used). Once the data was transformed, it was just a matter of plugging it into Linkurious, and in a couple of minutes, you have it visualized—in a networked way, so anyone can log in from anywhere in the world. That was another reason we really liked Linkurious and Neo4j—they’re very quick when representing graph data, and the visualizations were easy to understand for everybody. The not-very-tech-savvy reporter could expand the docs like magic, and more technically expert reporters and programmers could use the Neo4j query language, Cypher, to do more complex queries, like show me everybody within two degrees of separation of this person, or show me all the connected dots…
  • We believe in open source technology and try to use it as much as possible. We used Apache Solr for the indexing and Apache Tika for document processing, and it’s great because it processes dozens of different formats and it’s very powerful. Tika interacts with Tesseract, so we did the OCRing on Tesseract. To OCR the images, we created an army of 30–40 temporary servers in Amazon that allowed us to process the documents in parallel and do parallel OCR-ing. If it was very slow, we’d increase the number of servers—if it was going fine, we would decrease because of course those servers have a cost.
  • For the visualization of the Mossack Fonseca internal database, we worked with another tool called Linkurious. It’s not open source, it’s licensed software, but we have an agreement with them, and they allowed us to work with it. It allows you to represent data in graphs. We had a version of Linkurious on our servers, so no one else had the data. It was pretty intuitive—journalists had to click on dots that expanded, basically, and could search the names.
Paul Merrell

Mozilla Acquires Pocket | The Mozilla Blog - 0 views

  • e are excited to announce that the Mozilla Corporation has completed the acquisition of Read It Later, Inc. the developers of Pocket. Mozilla is growing, experimenting more, and doubling down on our mission to keep the internet healthy, as a global public resource that’s open and accessible to all. As our first strategic acquisition, Pocket contributes to our strategy by growing our mobile presence and providing people everywhere with powerful tools to discover and access high quality web content, on their terms, independent of platform or content silo. Pocket will join Mozilla’s product portfolio as a new product line alongside the Firefox web browsers with a focus on promoting the discovery and accessibility of high quality web content. (Here’s a link to their blog post on the acquisition).  Pocket’s core team and technology will also accelerate Mozilla’s broader Context Graph initiative.
  • “We believe that the discovery and accessibility of high quality web content is key to keeping the internet healthy by fighting against the rising tide of centralization and walled gardens. Pocket provides people with the tools they need to engage with and share content on their own terms, independent of hardware platform or content silo, for a safer, more empowered and independent online experience.” – Chris Beard, Mozilla CEO Pocket brings to Mozilla a successful human-powered content recommendation system with 10 million unique monthly active users on iOS, Android and the Web, and with more than 3 billion pieces of content saved to date. In working closely with Pocket over the last year around the integration within Firefox, we developed a shared vision and belief in the opportunity to do more together that has led to Pocket joining Mozilla today. “We’ve really enjoyed partnering with Mozilla over the past year. We look forward to working more closely together to support the ongoing growth of Pocket and to create great new products that people love in support of our shared mission.” – Nate Weiner, Pocket CEO As a result of this strategic acquisition, Pocket will become a wholly owned subsidiary of Mozilla Corporation and will become part of the Mozilla open source project.
Gonzalo San Gil, PhD.

"Self-Censorship on Facebook Sauvik Das and Adam Kramer - 0 views

  •  
    Abstract We report results from an exploratory analysis examining "last - minute" self - censorship, or content that is filtered after being written, on Facebook. We collected data from 3.9 milion users over 17 days and associate self- censorship behavior with features describing users, their social graph, and the interactions between them. "
Paul Merrell

Transparency Toolkit - 0 views

  • About Transparency Toolkit We need information about governments, companies, and other institutions to uncover corruption, human rights abuses, and civil liberties violations. Unfortunately, the information provided by most transparency initiatives today is difficult to understand and incomplete. Transparency Toolkit is an open source web application where journalists, activists, or anyone can chain together tools to rapidly collect, combine, visualize, and analyze documents and data. For example, Transparency Toolkit can be used to get data on all of a legislator’s actions in congress (votes, bills sponsored, etc.), get data on the fundraising parties a legislator attends, combine that data, and show it on a timeline to find correlations between actions in congress and parties attended. It could also be used to extract all locations from a document and plot them on a map where each point is linked to where the location was mentioned in the document.
  • Analysis Platform On the analysis platform, users can add steps to the analysis process. These steps chain together the tools, so someone could scrape data, upload a document, crossreference that with the scraped data, and then visualize the result all in less than a minute with little technical knowledge. Some of the tools allow users to specify input, but when this is not the case the output of the last step is the input of the next. Tools Existing and planned Transparency Toolkit tools include include scrapers and APIs for accessing data, format converters, extraction tools (for dates, names, locations, numbers), tools for crossreferencing and merging data, visualizations (maps, timelines, network graphs, maps), and pattern and trend detecting tools. These tools are designed to work in many cases rather than a single specific situation. The tools can be linked together on Transparency Toolkit, but they are also available individually. Where possible, we build our tools off of existing open source software. Road Map You can see the plans for future development of Transparency Toolkit here.
  •  
    If you think this isn't a tool for some very serious research, check the short descriptions of the modules here. https://github.com/transparencytoolkit I'll be installing this and doing some test-driving soon. From the source files, the glue for the tools seems to be Ruby on Rails. The development roadmap linked from the last word on this About page is also highly instructive. It ranks among the most detailed dev roadmaps I have ever seen. Notice that it is classified by milestones with scheduled work periods, giving specific date ranges for achievement. Even given the inevitable need to alter the schedule for unforeseen problems, this is a very aggressive (not quite the word I want) development plan and schedule. And the planned changes look to be super-useful, including a lot of "make it easier for the user" changes.   
1 - 7 of 7
Showing 20 items per page