Group items tagged Terabytes - Future of the Web

Startup Crunches 100 Terabytes of Data in a Record 23 Minutes | WIRED - 0 views

www.wired.com/...rabytes-data-record-23-minutes

startup data record transfer Terabytes minutes Big Data

shared by Gonzalo San Gil, PhD. on 14 Oct 14 - No Cached

Gonzalo San Gil, PhD. on 14 Oct 14

"There's a new record holder in the world of "big data." On Friday, Databricks-a startup spun out of the University California, Berkeley-announced that it has sorted 100 terabytes of data in a record 23 minutes using a number-crunching tool called Spark, eclipsing the previous record held by Yahoo and the popular big-data tool Hadoop."

<div class="cArrow"> </div><div class="cContentInner">"There's a new record holder in the world of "big data." On Friday, Databricks-a startup spun out of the University California, Berkeley-announced that it has sorted 100 terabytes of data in a record 23 minutes using a number-crunching tool called Spark, eclipsing the previous record held by Yahoo and the popular big-data tool Hadoop."</div>

...

Cancel

Apache Spark: 100 terabytes (TB) of data sorted in 23 minutes | Opensource.com - 0 views

opensource.com/...apache-spark-new-world-record

apache spark Terabytes data sort like tthunder open source

shared by Gonzalo San Gil, PhD. on 16 Jan 15 - No Cached

Gonzalo San Gil, PhD. on 16 Jan 15

"In October 2014, Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes."

<div class="cArrow"> </div><div class="cContentInner">"In October 2014, Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes."</div>

...

Cancel
Gonzalo San Gil, PhD. on 16 Jan 15

"In October 2014, Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes."

<div class="cArrow"> </div><div class="cContentInner">"In October 2014, Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes."</div>

...

Cancel

Hacking Team Asks Customers to Stop Using Its Software After Hack | Motherboard - 1 views

motherboard.vice.com/...-using-its-software-after-hack

surveillance state Hacking-Team hacked

shared by Paul Merrell on 07 Jul 15 - No Cached

But the hack hasn’t just ruined the day for Hacking Team’s employees. The company, which sells surveillance software to government customers all over the world, from Morocco and Ethiopia to the US Drug Enforcement Agency and the FBI, has told all its customers to shut down all operations and suspend all use of the company’s spyware, Motherboard has learned. “They’re in full on emergency mode,” a source who has inside knowledge of Hacking Team’s operations told Motherboard.
...

Cancel
A source told Motherboard that the hackers appears to have gotten “everything,” likely more than what the hacker has posted online, perhaps more than one terabyte of data. “The hacker seems to have downloaded everything that there was in the company’s servers,” the source, who could only speak on condition of anonymity, told Motherboard. “There’s pretty much everything here.” It’s unclear how the hackers got their hands on the stash, but judging from the leaked files, they broke into the computers of Hacking Team’s two systems administrators, Christian Pozzi and Mauro Romeo, who had access to all the company’s files, according to the source. “I did not expect a breach to be this big, but I’m not surprised they got hacked because they don’t take security seriously,” the source told me. “You can see in the files how much they royally fucked up.”
...

Cancel
Hacking Team notified all its customers on Monday morning with a “blast email,” requesting them to shut down all deployments of its Remote Control System software, also known as Galileo, according to multiple sources. The company also doesn’t have access to its email system as of Monday afternoon, a source said. On Sunday night, an unnamed hacker, who claimed to be the same person who breached Hacking Team’s competitor FinFisher last year, hijacked its Twitter account and posted links to 400GB of internal data. Hacking Team woke up to a massive breach of its systems.
...

Cancel
...2 more annotations...
For example, the source noted, none of the sensitive files in the data dump, from employees passports to list of customers, appear to be encrypted. “How can you give all the keys to your infrastructure to a 20-something who just joined the company?” he added, referring to Pozzi, whose LinkedIn shows he’s been at Hacking Team for just over a year. “Nobody noticed that someone stole a terabyte of data? You gotta be a fuckwad,” the source said. “It means nobody was taking care of security.”
...

Cancel
The future of the company, at this point, it’s uncertain. Employees fear this might be the beginning of the end, according to sources. One current employee, for example, started working on his resume, a source told Motherboard. It’s also unclear how customers will react to this, but a source said that it’s likely that customers from countries such as the US will pull the plug on their contracts. Hacking Team asked its customers to shut down operations, but according to one of the leaked files, as part of Hacking Team’s “crisis procedure,” it could have killed their operations remotely. The company, in fact, has “a backdoor” into every customer’s software, giving it ability to suspend it or shut it down—something that even customers aren’t told about. To make matters worse, every copy of Hacking Team’s Galileo software is watermarked, according to the source, which means Hacking Team, and now everyone with access to this data dump, can find out who operates it and who they’re targeting with it.
...

Cancel

Petabytes on a budget: How to build cheap cloud storage | Backblaze Blog - 0 views

blog.backblaze.com/...w-to-build-cheap-cloud-storage

CAR-MLS-Cloud BackBlaze Cloud-Computing

shared by Gary Edwards on 01 Sep 09 - Cached

Gary Edwards on 01 Sep 09

Amazing must read! BackBlaze offers unlimited cloud storage/backup for $5 per month. Now they are releasing the "storage" aspect of their service as an open source design. The discussion introducing the design is simple to read and follow - which in itself is an achievement. They held back on open sourcing the BackBlaze Cloud software system, which is understandable. But they do disclose a Debian Linux OS running Tomcat over Apache Server 5.4 with JFS and HTTPS access. This is exciting stuff. I hope the CAR MLS-Cloud guys take notice. Intro: At Backblaze, we provide unlimited storage to our customers for only $5 per month, so we had to figure out how to store hundreds of petabytes of customer data in a reliable, scalable way-and keep our costs low. After looking at several overpriced commercial solutions, we decided to build our own custom Backblaze Storage Pods: 67 terabyte 4U servers for $7,867. In this post, we'll share how to make one of these storage pods, and you're welcome to use this design. Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us. Evolving and lowering costs is critical to our continuing success at Backblaze.

<div class="cArrow"> </div><div class="cContentInner">Amazing must read! BackBlaze offers unlimited cloud storage/backup for $5 per month. Now they are releasing the "storage" aspect of their service as an open source design. The discussion introducing the design is simple to read and follow - which in itself is an achievement. They held back on open sourcing the BackBlaze Cloud software system, which is understandable. But they do disclose a Debian Linux OS running Tomcat over Apache Server 5.4 with JFS and HTTPS access. This is exciting stuff. I hope the CAR MLS-Cloud guys take notice. Intro: At Backblaze, we provide unlimited storage to our customers for only $5 per month, so we had to figure out how to store hundreds of petabytes of customer data in a reliable, scalable way-and keep our costs low. After looking at several overpriced commercial solutions, we decided to build our own custom Backblaze Storage Pods: 67 terabyte 4U servers for $7,867. In this post, we'll share how to make one of these storage pods, and you're welcome to use this design. Our hope is that by sharing, others can benefit and, ultimately, refine this concept and send improvements back to us. Evolving and lowering costs is critical to our continuing success at Backblaze.</div>

...

Cancel

Canadian Spies Collect Domestic Emails in Secret Security Sweep - The Intercept - 0 views

firstlook.org/...ony-express-email-surveillance

surveillance state Canada CSE bulk-collection email

shared by Paul Merrell on 27 Feb 15 - No Cached

Canada’s electronic surveillance agency is covertly monitoring vast amounts of Canadians’ emails as part of a sweeping domestic cybersecurity operation, according to top-secret documents. The surveillance initiative, revealed Wednesday by CBC News in collaboration with The Intercept, is sifting through millions of emails sent to Canadian government agencies and departments, archiving details about them on a database for months or even years. The data mining operation is carried out by the Communications Security Establishment, or CSE, Canada’s equivalent of the National Security Agency. Its existence is disclosed in documents obtained by The Intercept from NSA whistleblower Edward Snowden. The emails are vacuumed up by the Canadian agency as part of its mandate to defend against hacking attacks and malware targeting government computers. It relies on a system codenamed PONY EXPRESS to analyze the messages in a bid to detect potential cyber threats.
...

Cancel
Last year, CSE acknowledged it collected some private communications as part of cybersecurity efforts. But it refused to divulge the number of communications being stored or to explain for how long any intercepted messages would be retained. Now, the Snowden documents shine a light for the first time on the huge scope of the operation — exposing the controversial details the government withheld from the public. Under Canada’s criminal code, CSE is not allowed to eavesdrop on Canadians’ communications. But the agency can be granted special ministerial exemptions if its efforts are linked to protecting government infrastructure — a loophole that the Snowden documents show is being used to monitor the emails. The latest revelations will trigger concerns about how Canadians’ private correspondence with government employees are being archived by the spy agency and potentially shared with police or allied surveillance agencies overseas, such as the NSA. Members of the public routinely communicate with government employees when, for instance, filing tax returns, writing a letter to a member of parliament, applying for employment insurance benefits or submitting a passport application.
...

Cancel
Chris Parsons, an internet security expert with the Toronto-based internet think tank Citizen Lab, told CBC News that “you should be able to communicate with your government without the fear that what you say … could come back to haunt you in unexpected ways.” Parsons said that there are legitimate cybersecurity purposes for the agency to keep tabs on communications with the government, but he added: “When we collect huge volumes, it’s not just used to track bad guys. It goes into data stores for years or months at a time and then it can be used at any point in the future.” In a top-secret CSE document on the security operation, dated from 2010, the agency says it “processes 400,000 emails per day” and admits that it is suffering from “information overload” because it is scooping up “too much data.” The document outlines how CSE built a system to handle a massive 400 terabytes of data from Internet networks each month — including Canadians’ emails — as part of the cyber operation. (A single terabyte of data can hold about a billion pages of text, or about 250,000 average-sized mp3 files.)
...

Cancel
...1 more annotation...
The agency notes in the document that it is storing large amounts of “passively tapped network traffic” for “days to months,” encompassing the contents of emails, attachments and other online activity. It adds that it stores some kinds of metadata — data showing who has contacted whom and when, but not the content of the message — for “months to years.” The document says that CSE has “excellent access to full take data” as part of its cyber operations and is receiving policy support on “use of intercepted private communications.” The term “full take” is surveillance-agency jargon that refers to the bulk collection of both content and metadata from Internet traffic. Another top-secret document on the surveillance dated from 2010 suggests the agency may be obtaining at least some of the data by covertly mining it directly from Canadian Internet cables. CSE notes in the document that it is “processing emails off the wire.”
...

Cancel

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web ... - 0 views

semanticweb.com/ping-big-data-expertise_b26109

search web crawling Common Crawl

shared by Paul Merrell on 28 Jan 12 - No Cached

The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.
...

Cancel
Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says. “You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”
...

Cancel
Common Crawl’s data already is stored on Amazon’s S3 service, but now Amazon will be providing the storage space for free through the Public Data Set program. Not only does that remove from Common Crawl the storage burden and costs for hosting its crawl of 5 billion web pages – some 50 or 60 terabytes large – but it should make it easier for users to access the data, and remove the bandwidth-related costs they might incur for downloads. Users won’t have to deal with setting up accounts, being responsible for bandwidth bills incurred, and more complex authentication processes.
...

Cancel

Microsoft Office whips Google Docs: It's finally game over | Computerworld Blogs - 0 views

blogs.computerworld.com/...gle-docs-its-finally-game-over

MSOffice google docs

shared by Gary Edwards on 25 Jun 14 - No Cached

Gary Edwards on 25 Jun 14

"If there was ever any doubt about whether Microsoft or Google would win the war of office suites, there should be no longer. Within the last several weeks, Microsoft has pulled so far ahead that it's game over. Here's why. When it comes to which suite is more fully featured, there's never been any real debate: Microsoft Office wins hands down. Whether you're creating entire presentations, creating complicated word-processing documents, or even doing something as simple as handling text attributes, Office is a far better tool. Until the last few weeks, Google Docs had one significant advantage over Microsoft Office: It's available for Android and the iPad as well as PCs because it's Web-based. The same wasn't the case for Office. So if you wanted to use an office suite on all your mobile devices, Google Docs was the way to go. Google Docs lost that advantage when Microsoft released Office for the iPad. There's not yet a native version for Android tablets, but Microsoft is working on that, telling GeekWire, "Let me tell you conclusively: Yes, we are also building Android native applications for tablets for Word, Excel and PowerPoint." Google Docs is still superior to Office's Web-based version, but that's far less important than it used to be. There's no need to go with a Web-based office suite if a superior suite is available as a native apps on all platforms, mobile or otherwise. And Office's collaboration capabilities are quite considerable now. Of course, there's always the question of price. Google Docs is free. Microsoft Office isn't. But at $100 a year for up to five devices, or $70 a year for two, no one will be going broke paying for Microsoft Office. It's worth paying that relatively small price for a much better office suite. Google Docs won't die. It'll be around as second fiddle for a long time. But that's what it will always remain: a second fiddle to the better Microsoft Office."

<div class="cArrow"> </div><div class="cContentInner">"If there was ever any doubt about whether Microsoft or Google would win the war of office suites, there should be no longer. Within the last several weeks, Microsoft has pulled so far ahead that it's game over. Here's why. When it comes to which suite is more fully featured, there's never been any real debate: Microsoft Office wins hands down. Whether you're creating entire presentations, creating complicated word-processing documents, or even doing something as simple as handling text attributes, Office is a far better tool. Until the last few weeks, Google Docs had one significant advantage over Microsoft Office: It's available for Android and the iPad as well as PCs because it's Web-based. The same wasn't the case for Office. So if you wanted to use an office suite on all your mobile devices, Google Docs was the way to go. Google Docs lost that advantage when Microsoft released Office for the iPad. There's not yet a native version for Android tablets, but Microsoft is working on that, telling GeekWire, "Let me tell you conclusively: Yes, we are also building Android native applications for tablets for Word, Excel and PowerPoint." Google Docs is still superior to Office's Web-based version, but that's far less important than it used to be. There's no need to go with a Web-based office suite if a superior suite is available as a native apps on all platforms, mobile or otherwise. And Office's collaboration capabilities are quite considerable now. Of course, there's always the question of price. Google Docs is free. Microsoft Office isn't. But at $100 a year for up to five devices, or $70 a year for two, no one will be going broke paying for Microsoft Office. It's worth paying that relatively small price for a much better office suite. Google Docs won't die. It'll be around as second fiddle for a long time. But that's what it will always remain: a second fiddle to the better Microsoft Office."</div>

...

Cancel
Gary Edwards on 25 Jun 14

Google acquired "Writely", a small company in Portola Valley that pioneered document editing in a browser. Writely was perhaps the first cloud computing editor to go beyond simple HTML; eventually crafting some really cool CSS-JavaScript-JSON document layout and editing methods. But it can't edit native MSOffice documents. It converts them. There are more than a few problems with the Google Docs approach to editing advanced "compound" documents, but two stick out and are certain to give pause to anyone making the great transition from local workgroup computing, to the highly mobile, always connected, cloud computing. The first problem certain to become a show stopper is that Google converts documents to their native on-line format for editing and collaboration. And then they convert back. To many this isn't a problem. But if the document is part of a workflow or business process, conversion is a killer. There is an old saw affectionately known as "Reuters Law", dating back to the ODF-OXML document wars, that emphatically states; "Conversion breaks documents." The breakage includes both the visual layout of the document, and, the "compound" aspects and data connections that are internal to the document. Think of this way. A business document that is part of a legacy Windows Workgroup workflow is opened up in gDocs. Google converts the document for editing purposes. The data and the workflow internals that bind the document to the local business system are broken on conversion. The look of the document is also visually shredded as the gDocs layout engine is applied. For all practical purposes, no matter what magic editing and collaboration value is added, a broken document means a broken business process. Let me say that again, with the emphasis of having witnessed this first hand during the year long ODF transition trials the Commonwealth of Massachusetts conducted in 2005 and 2006. The business process broke every time a conversion was conducted "on a busines

<div class="cArrow"> </div><div class="cContentInner">Google acquired "Writely", a small company in Portola Valley that pioneered document editing in a browser. Writely was perhaps the first cloud computing editor to go beyond simple HTML; eventually crafting some really cool CSS-JavaScript-JSON document layout and editing methods. But it can't edit native MSOffice documents. It converts them. There are more than a few problems with the Google Docs approach to editing advanced "compound" documents, but two stick out and are certain to give pause to anyone making the great transition from local workgroup computing, to the highly mobile, always connected, cloud computing. The first problem certain to become a show stopper is that Google converts documents to their native on-line format for editing and collaboration. And then they convert back. To many this isn't a problem. But if the document is part of a workflow or business process, conversion is a killer. There is an old saw affectionately known as "Reuters Law", dating back to the ODF-OXML document wars, that emphatically states; "Conversion breaks documents." The breakage includes both the visual layout of the document, and, the "compound" aspects and data connections that are internal to the document. Think of this way. A business document that is part of a legacy Windows Workgroup workflow is opened up in gDocs. Google converts the document for editing purposes. The data and the workflow internals that bind the document to the local business system are broken on conversion. The look of the document is also visually shredded as the gDocs layout engine is applied. For all practical purposes, no matter what magic editing and collaboration value is added, a broken document means a broken business process. Let me say that again, with the emphasis of having witnessed this first hand during the year long ODF transition trials the Commonwealth of Massachusetts conducted in 2005 and 2006. The business process broke every time a conversion was conducted "on a busines</div>

...

Cancel

ISPs say the "massive cost" of Snooper's Charter will push up UK broadband bills | Ars ... - 0 views

arstechnica.co.uk/...ill-push-up-uk-broadband-bills

surveillance state UK ISPs cost Snooper's-Charter

shared by Paul Merrell on 14 Nov 15 - No Cached

How much extra will you have to pay for the privilege of being spied on?
...

Cancel
UK ISPs have warned MPs that the costs of implementing the Investigatory Powers Bill (aka the Snooper's Charter) will be much greater than the £175 million the UK government has allotted for the task, and that broadband bills will need to rise as a result. Representatives from ISPs and software companies told the House of Commons Science and Technology Committee that the legislation greatly underestimates the "sheer quantity" of data generated by Internet users these days. They also pointed out that distinguishing content from metadata is a far harder task than the government seems to assume. Matthew Hare, the chief executive of ISP Gigaclear, said with "a typical 1 gigabit connection to someone's home, over 50 terabytes of data per year [are] passing over it. If you say that a proportion of that is going to be the communications data—the record of who you communicate with, when you communicate or what you communicate—there would be the most massive and enormous amount of data that in future an access provider would be expected to keep. The indiscriminate collection of mass data across effectively every user of the Internet in this country is going to have a massive cost."
...

Cancel
Moreover, the larger the cache of stored data, the more worthwhile it will be for criminals and state-backed actors to gain access and download that highly-revealing personal information for fraud and blackmail. John Shaw, the vice president of product management at British security firm Sophos, told the MPs: "There would be a huge amount of very sensitive personal data that could be used by bad guys.
...

Cancel
...2 more annotations...
The ISPs also challenged the government's breezy assumption that separating the data from the (equally revealing) metadata would be simple, not least because an Internet connection is typically being used for multiple services simultaneously, with data packets mixed together in a completely contingent way. Hare described a typical usage scenario for a teenager on their computer at home, where they are playing a game communicating with their friends using Steam; they are broadcasting the game using Twitch; and they may also be making a voice call at the same time too. "All those applications are running simultaneously," Hare said. "They are different applications using different servers with different services and different protocols. They are all running concurrently on that one machine." Even accessing a Web page is much more complicated than the government seems to believe, Hare pointed out. "As a webpage is loading, you will see that that webpage is made up of tens, or many tens, of individual sessions that have been created across the Internet just to load a single webpage. Bluntly, if you want to find out what someone is doing you need to be tracking all of that data all the time."
...

Cancel
Hare raised another major issue. "If I was a software business ... I would be very worried that my customers would not buy my software any more if it had anything to do with security at all. I would be worried that a backdoor was built into the software by the [Investigatory Powers] Bill that would allow the UK government to find out what information was on that system at any point they wanted in the future." As Ars reported last week, the ability to demand that backdoors are added to systems, and a legal requirement not to reveal that fact under any circumstances, are two of the most contentious aspects of the new Investigatory Powers Bill. The latest comments from industry experts add to concerns that the latest version of the Snooper's Charter would inflict great harm on civil liberties in the UK, and also make security research well-nigh impossible here. To those fears can now be added undermining the UK software industry, as well as forcing the UK public to pay for the privilege of having their ISP carry out suspicionless surveillance.
...

Cancel

Group items tagged