Skip to main content

Home/ VirgoLab/ Group items tagged data

Rss Feed Group items tagged

Roger Chen

Data Randomization - 0 views

  •  
    Attacks that exploit memory errors are still a serious problem. We present data randomization, a new technique that provides probabilistic protection against these attacks by xoring data with random masks. Data randomization uses static analysis to partition instruction operands into equivalence classes: it places two operands in the same class if they may refer to the same object in an execution that does not violate memory safety. Then it assigns a random mask to each class and it generates code instrumented to xor data read from or written to memory with the mask of the memory operand's class. Therefore, attacks that violate the results of the static analysis have unpredictable results. We implemented a data randomization prototype that compiles programs without modifications and can preventmany attacks with low overhead. Our prototype prevents all the attacks in our benchmarks while introducing an average runtime overhead of 11% (0%to 27%) and an average space overhead below 1%.
Roger Chen

Data mining is not just a data recovery tool | Styx online - 0 views

  • Data Mining is a process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using statistical, data analysis and mathematical techniques
  • Data mining is the crucial process that helps companies better comprehend their customers. Data mining can be defined as ‘the nontrivial extraction of implicit, previously unknown, and potentially useful information from data’ and also as ‘the science of extracting useful information from large sets or databases’.
Roger Chen

The End Of The Scientific Method… Wha….? « Life as a Physicist - 0 views

  • His basic thesis is that when you have so much data you can map out every connection, every correlation, then the  data becomes the model. No need to derive or understand what is actually happening — you have so much data that you can already make all the predictions that a model would let you do in the first place. In short — you no longer need to develop a theory or hypothesis - just map the data!
  • First, in order for this to work you need to have millions and millions and millions of data points. You need, basically, ever single outcome possible, with all possible other factors. Huge amounts of data. That does not apply to all branches of science.
  • The second problem with this approach is you will never discover anything new. The problem with new things is there is no data on them!
  • ...3 more annotations...
  • Correlations are a way of catching a scientist’s attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications. One only needs to look at a promising field that lacks a strong theoretical foundation—high-temperature superconductivity springs to mind—to see how badly the lack of a theory can impact progress
  • Anderson is right — we are entering a new age where the ability to mine these large amounts of data are going to open up whole new levels of understanding
  • This is a new tool, and it will open up all sorts of doors for us. But the end of the scientific method? No — because that implies an end of discovery. And end of new things.
Roger Chen

Analysis: data mining doesn't work for spotting terrorists - 0 views

  • Automated identification of terrorists through data mining (or any other known methodology) is neither feasible as an objective nor desirable as a goal of technology development efforts.
  • criminal prosecutors and judges are concerned with determining the guilt or innocence of a suspect in the wake of an already-committed crime; counter-terror officials are concerned with preventing crimes from occurring by identifying suspects before they've done anything wrong.
  • The problem: preventing a crime by someone with no criminal record
  • ...3 more annotations...
  • In fact, most terrorists have no criminal record of any kind that could bring them to the attention of authorities or work against them in court.
  • As the NRC report points out, not only is the training data lacking, but the input data that you'd actually be mining has been purposely corrupted by the terrorists themselves.
  • So this application of data mining bumps up against the classic GIGO (garbage in, garbage out) problem in computing, with the terrorists deliberately feeding the system garbage.
Roger Chen

KNIME - Konstanz Information Miner - 0 views

shared by Roger Chen on 01 Aug 08 - Cached
  •  
    KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.
Roger Chen

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete - 0 views

  • Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database.
  • Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough.
  • The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
  • ...6 more annotations...
  • Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."
  • Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
    • Roger Chen
       
      That's what Chris Anderson thought is old-school.
  • But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.
    • Roger Chen
       
      Come to conclusion? I don't think so.
  • There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
  • What can science learn from Google?
  • This kind of thinking is poised to go mainstream.
    • Roger Chen
       
      ???
  •  
    "All models are wrong, and increasing you can succeed without them."
Roger Chen

Lorcan Dempsey's weblog: Recommendation and Ranganathan - 0 views

  • Now, typically library catalogs use traditional information retrieval techniques over professionally produced metadata. This is not a lot of data to play with! We have just begun to see interesting things being done with the other types of data as libraries explore the use of transactional data for recommendations and look to incorporate contributed data.
  • Google, Amazon and other sites license professionally produced metadata. But in different ways they also use the other types of data also.
  • Suggestion, or recommendation, is becoming increasingly a part of our everyday web experience,and improving the quality of suggestion has become an important goal for many services. Clearly, there are commercial interests riding on this.
  • ...2 more annotations...
  • "The 20th century was about sorting out supply," Potter says. "The 21st is going to be about sorting out demand." The Internet makes everything available, but mere availability is meaningless if the products remain unknown to potential buyers.
  • When I get good recommendations, I spend my time and money differently. Even better recommendations will dramatically increase the value of that time and money.
Roger Chen

Paper: MapReduce: Simplified Data Processing on Large Clusters | High Scalability - 0 views

  • Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
  •  
    Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
Roger Chen

Semantic Library » Zotero and semantic principles - 0 views

  • Our Zotero Server, connected to the client, will enable all kinds of new collaboration opportunities and data-mining of aggregated collections. We also plan to provide hooks into high-performance computing projects like the SEASR text-mining project based at UIUC
  • Data mining is becoming a major trend in eResearch as computing power increases and more and more researchers have direct access to open data sets. In the future, we won’t just be citing articles, figures, images, movies, and books, we’ll also be citing specific data points.
Roger Chen

Datawocky: Are Machine-Learned Models Prone to Catastrophic Errors? - 0 views

  • Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan. In these cases, you can make quite a fool of yourself by assuming that the future looks like the past.
  • The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan.
  • It has long been known that Google's search algorithm actually works at 2 levels: An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals. An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
  • ...2 more annotations...
  • This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
  • My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows)
Roger Chen

Data & Knowledge Engineering (0169-023X) - ACM Guide to Computing Literature - 0 views

  •  
    Data & Knowledge Engineering (0169-023X)
Roger Chen

Current Approaches to Data Mining Blogs - ESIWiki - 0 views

  •  
    Summary of the current doirction of blog research using data mining.
Roger Chen

Data Mining Souce Code Newsletter - Blogs - 0 views

  •  
    Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java, and other programming languages
Roger Chen

Many Eyes - 0 views

  •  
    Many Eyes is an IBM site with a goal of making data visualization algorithms and data sets widely available. It is a fantastic place to spend a few hours.
Roger Chen

課程管理系統的資料探勘:以Moodle為例 - 0 views

  •  
    Data mining in course management systems: Moodle case study and tutorial by: Cristobal Romero, Sebastian Ventura, Enrique Garcia Computers & Education, Vol. In Press(2007), Corrected Proof
Roger Chen

Why the cloud cannot obscure the scientific method - 0 views

  • Overall, the foundation of the argument for a replacement for science is correct: the data cloud is changing science, and leaving us in many cases with a Google-level understanding of the connections between things. Where Anderson stumbles is in his conclusions about what this means for science. The fact is that we couldn't have even reached this Google-level understanding without the models and mechanisms that he suggests are doomed to irrelevance.
  • Anderson appears to take the position that the new research part of the equation has become superfluous; simply having a good algorithm that recognizes the correlation is enough.
  • Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications.
  • ...1 more annotation...
  • without the testable predictions made by the theory, we'll never be able to tell how precisely it is wrong
  •  
    This article is a response to Chris Anerson's article "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" - http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Roger Chen

ACM SIGKDD - 0 views

  •  
    ACM SPecial Interest Group on Knowledge Discovery and Data Mining
1 - 20 of 74 Next › Last »
Showing 20 items per page