Group items tagged data - VirgoLab

Data Randomization - 0 views

research.microsoft.com/...view.aspx

data mining papers reference research

shared by Roger Chen on 04 Sep 08 - Cached

Roger Chen on 04 Sep 08

Attacks that exploit memory errors are still a serious problem. We present data randomization, a new technique that provides probabilistic protection against these attacks by xoring data with random masks. Data randomization uses static analysis to partition instruction operands into equivalence classes: it places two operands in the same class if they may refer to the same object in an execution that does not violate memory safety. Then it assigns a random mask to each class and it generates code instrumented to xor data read from or written to memory with the mask of the memory operand's class. Therefore, attacks that violate the results of the static analysis have unpredictable results. We implemented a data randomization prototype that compiles programs without modifications and can preventmany attacks with low overhead. Our prototype prevents all the attacks in our benchmarks while introducing an average runtime overhead of 11% (0%to 27%) and an average space overhead below 1%.

<div class="cArrow"> </div><div class="cContentInner">Attacks that exploit memory errors are still a serious problem. We present data randomization, a new technique that provides probabilistic protection against these attacks by xoring data with random masks. Data randomization uses static analysis to partition instruction operands into equivalence classes: it places two operands in the same class if they may refer to the same object in an execution that does not violate memory safety. Then it assigns a random mask to each class and it generates code instrumented to xor data read from or written to memory with the mask of the memory operand's class. Therefore, attacks that violate the results of the static analysis have unpredictable results. We implemented a data randomization prototype that compiles programs without modifications and can preventmany attacks with low overhead. Our prototype prevents all the attacks in our benchmarks while introducing an average runtime overhead of 11% (0%to 27%) and an average space overhead below 1%.</div>

...

Cancel

Data mining is not just a data recovery tool | Styx online - 0 views

www.styx-online.net/...-not-just-a-data-recovery-tool

shared by Roger Chen on 01 Aug 08 - Cached

Data Mining is a process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using statistical, data analysis and mathematical techniques
...

Cancel
Data mining is the crucial process that helps companies better comprehend their customers. Data mining can be defined as ‘the nontrivial extraction of implicit, previously unknown, and potentially useful information from data’ and also as ‘the science of extracting useful information from large sets or databases’.
...

Cancel

The End Of The Scientific Method… Wha….? « Life as a Physicist - 0 views

gordonwatts.wordpress.com/...d-of-the-scientific-method-wha

data mining thinking

shared by Roger Chen on 27 Jun 08 - Cached

His basic thesis is that when you have so much data you can map out every connection, every correlation, then the  data becomes the model. No need to derive or understand what is actually happening — you have so much data that you can already make all the predictions that a model would let you do in the first place. In short — you no longer need to develop a theory or hypothesis - just map the data!
...

Cancel
First, in order for this to work you need to have millions and millions and millions of data points. You need, basically, ever single outcome possible, with all possible other factors. Huge amounts of data. That does not apply to all branches of science.
...

Cancel
The second problem with this approach is you will never discover anything new. The problem with new things is there is no data on them!
...

Cancel
...3 more annotations...
Correlations are a way of catching a scientist’s attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications. One only needs to look at a promising field that lacks a strong theoretical foundation—high-temperature superconductivity springs to mind—to see how badly the lack of a theory can impact progress
...

Cancel
Anderson is right — we are entering a new age where the ability to mine these large amounts of data are going to open up whole new levels of understanding
...

Cancel
This is a new tool, and it will open up all sorts of doors for us. But the end of the scientific method? No — because that implies an end of discovery. And end of new things.
...

Cancel

Analysis: data mining doesn't work for spotting terrorists - 0 views

arstechnica.com/...k-for-spotting-terrorists.html

data mining reading research

shared by Roger Chen on 11 Oct 08 - Cached

Automated identification of terrorists through data mining (or any other known methodology) is neither feasible as an objective nor desirable as a goal of technology development efforts.
...

Cancel
criminal prosecutors and judges are concerned with determining the guilt or innocence of a suspect in the wake of an already-committed crime; counter-terror officials are concerned with preventing crimes from occurring by identifying suspects before they've done anything wrong.
...

Cancel
The problem: preventing a crime by someone with no criminal record
...

Cancel
...3 more annotations...
In fact, most terrorists have no criminal record of any kind that could bring them to the attention of authorities or work against them in court.
...

Cancel
As the NRC report points out, not only is the training data lacking, but the input data that you'd actually be mining has been purposely corrupted by the terrorists themselves.
...

Cancel
So this application of data mining bumps up against the classic GIGO (garbage in, garbage out) problem in computing, with the terrorists deliberately feeding the system garbage.
...

Cancel

KNIME - Konstanz Information Miner - 0 views

www.knime.org

data mining tools

shared by Roger Chen on 01 Aug 08 - Cached

Roger Chen on 01 Aug 08

KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.

<div class="cArrow"> </div><div class="cContentInner">KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.</div>

...

Cancel

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete - 0 views

www.wired.com/...pb_theory

data mining google statistics thinking

shared by Roger Chen on 29 Jun 08 - Cached

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database.
...

Cancel
Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough.
...

Cancel
The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
...

Cancel
...6 more annotations...
Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."
...

Cancel
Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
- Roger Chen on 29 Jun 08
  
  That's what Chris Anderson thought is old-school.
  
  <div class="cArrow"> </div><div class="cContentInner">That's what Chris Anderson thought is old-school.</div>
  
  ...
  
  Cancel
...

Cancel
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.
- Roger Chen on 29 Jun 08
  
  Come to conclusion? I don't think so.
  
  <div class="cArrow"> </div><div class="cContentInner">Come to conclusion? I don't think so.</div>
  
  ...
  
  Cancel
...

Cancel
There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
...

Cancel
What can science learn from Google?
...

Cancel
This kind of thinking is poised to go mainstream.
- Roger Chen on 29 Jun 08
  
  ???
  
  <div class="cArrow"> </div><div class="cContentInner">???</div>
  
  ...
  
  Cancel
...

Cancel

Roger Chen on 29 Jun 08

"All models are wrong, and increasing you can succeed without them."

<div class="cArrow"> </div><div class="cContentInner">"All models are wrong, and increasing you can succeed without them."</div>

...

Cancel

Lorcan Dempsey's weblog: Recommendation and Ranganathan - 0 views

orweblog.oclc.org/...001566.html

recommender

shared by Roger Chen on 15 May 08 - Cached

Now, typically library catalogs use traditional information retrieval techniques over professionally produced metadata. This is not a lot of data to play with! We have just begun to see interesting things being done with the other types of data as libraries explore the use of transactional data for recommendations and look to incorporate contributed data.
...

Cancel
Google, Amazon and other sites license professionally produced metadata. But in different ways they also use the other types of data also.
...

Cancel
Suggestion, or recommendation, is becoming increasingly a part of our everyday web experience,and improving the quality of suggestion has become an important goal for many services. Clearly, there are commercial interests riding on this.
...

Cancel
...2 more annotations...
"The 20th century was about sorting out supply," Potter says. "The 21st is going to be about sorting out demand." The Internet makes everything available, but mere availability is meaningless if the products remain unknown to potential buyers.
...

Cancel
When I get good recommendations, I spend my time and money differently. Even better recommendations will dramatically increase the value of that time and money.
...

Cancel

Paper: MapReduce: Simplified Data Processing on Large Clusters | High Scalability - 0 views

highscalability.com/data-processing-large-clusters

google papers

shared by Roger Chen on 19 Jun 08 - Cached

Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
...

Cancel

Roger Chen on 19 Jun 08

Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

<div class="cArrow"> </div><div class="cContentInner">Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.</div>

...

Cancel

Semantic Library » Zotero and semantic principles - 0 views

www.semanticlibrary.net/...zotero-and-semantic-principles

data mining firefox research tools

shared by Roger Chen on 19 Jul 08 - Cached

Our Zotero Server, connected to the client, will enable all kinds of new collaboration opportunities and data-mining of aggregated collections. We also plan to provide hooks into high-performance computing projects like the SEASR text-mining project based at UIUC
...

Cancel
Data mining is becoming a major trend in eResearch as computing power increases and more and more researchers have direct access to open data sets. In the future, we won’t just be citing articles, figures, images, movies, and books, we’ll also be citing specific data points.
...

Cancel

Datawocky: Are Machine-Learned Models Prone to Catastrophic Errors? - 0 views

anand.typepad.com/...an-machine-learned-models.html

data mining google

shared by Roger Chen on 11 Jun 08 - Cached

Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan. In these cases, you can make quite a fool of yourself by assuming that the future looks like the past.
...

Cancel
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan.
...

Cancel
It has long been known that Google's search algorithm actually works at 2 levels: An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals. An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
...

Cancel
...2 more annotations...
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
...

Cancel
My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows)
...

Cancel

Data & Knowledge Engineering (0169-023X) - ACM Guide to Computing Literature - 0 views

portal.acm.org/toc.cfm

data mining journals papers

shared by Roger Chen on 30 Jun 08 - Cached

Roger Chen on 30 Jun 08

Data & Knowledge Engineering (0169-023X)

<div class="cArrow"> </div><div class="cContentInner">Data & Knowledge Engineering (0169-023X)</div>

...

Cancel

Current Approaches to Data Mining Blogs - ESIWiki - 0 views

wiki.esi.ac.uk/pproaches_to_Data_Mining_Blogs

data mining research social media

shared by Roger Chen on 28 Jul 08 - Cached

Roger Chen on 28 Jul 08

Summary of the current doirction of blog research using data mining.

<div class="cArrow"> </div><div class="cContentInner">Summary of the current doirction of blog research using data mining.</div>

...

Cancel

Business Analytics: Data Mining Combined With Predictive Modeling Equal 3D Data Visuali... - 0 views

atomai.blogspot.com/...-combined-with-predictive.html

data mining visualization

shared by Roger Chen on 16 Jul 08 - Cached

Data Mining Souce Code Newsletter - Blogs - 0 views

www.kdkeys.net

data mining reference

shared by Roger Chen on 01 Jul 08 - Cached

Roger Chen on 01 Jul 08

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java, and other programming languages

<div class="cArrow"> </div><div class="cContentInner">Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java, and other programming languages</div>

...

Cancel

Many Eyes - 0 views

services.alphaworks.ibm.com/...home

visualization

shared by Roger Chen on 25 Jun 08 - Cached

Roger Chen on 25 Jun 08

Many Eyes is an IBM site with a goal of making data visualization algorithms and data sets widely available. It is a fantastic place to spend a few hours.

<div class="cArrow"> </div><div class="cContentInner">Many Eyes is an IBM site with a goal of making data visualization algorithms and data sets widely available. It is a fantastic place to spend a few hours.</div>

...

Cancel

課程管理系統的資料探勘：以Moodle為例 - 0 views

pulipuli.blogspot.com/...moodle.html

data mining papers

shared by Roger Chen on 19 May 08 - Cached

Roger Chen on 19 May 08

Data mining in course management systems: Moodle case study and tutorial by: Cristobal Romero, Sebastian Ventura, Enrique Garcia Computers & Education, Vol. In Press(2007), Corrected Proof

<div class="cArrow"> </div><div class="cContentInner">Data mining in course management systems: Moodle case study and tutorial by: Cristobal Romero, Sebastian Ventura, Enrique Garcia Computers & Education, Vol. In Press(2007), Corrected Proof</div>

...

Cancel

Pentaho Data Mining Community Documentation - Pentaho Wiki - 0 views

wiki.pentaho.com/...Mining+Community+Documentation

data mining tools

shared by Roger Chen on 27 Feb 09 - Cached

Why the cloud cannot obscure the scientific method - 0 views

arstechnica.com/...ure-the-scientific-method.html