It's usually impossible to get "better" data, and you have no alternative but
to work with the data at hand
The most meaningful definition I've heard: "big data" is when the size of
the data itself becomes part of the problem
Precision has an allure, but in most data-driven applications outside of
finance, that allure is deceptive. Most data analysis is comparative:
Storing data is only part of building a data platform, though. Data is only
useful if you can do something with it, and enormous datasets present
computational problems
Hadoop has been instrumental in enabling "agile" data analysis. In software
development, "agile practices" are associated with faster product cycles, closer
interaction between developers and consumers, and testing
Faster computations make it easier to test different assumptions, different
datasets, and different algorithms
It's easer to consult with clients to figure out whether you're asking the right
questions, and it's possible to pursue intriguing possibilities that you'd
otherwise have to drop for lack of time.
Machine learning is another essential tool for the data scientist.
According to Mike Driscoll (@dataspora), statistics is the "grammar
of data science." It is crucial to "making data speak coherently."
Data science isn't just about the existence of data, or making guesses about
what that data might mean; it's about testing hypotheses and making sure that
the conclusions you're drawing from the data are valid.
The problem with most data analysis algorithms is that they generate a set of
numbers. To understand what the numbers mean, the stories they are really
telling, you need to generate a graph
Visualization is crucial to each stage of the data scientist
Visualization is also frequently the first step in analysis
Casey Reas' and Ben Fry's Processing is the
state of the art, particularly if you need to create animations that show how
things change over time
Making data tell its story isn't just a matter of presenting results; it
involves making connections, then going back to other data sources to verify
them.
Physicists have a strong mathematical background, computing skills, and come
from a discipline in which survival depends on getting the most from the data.
They have to think about the big picture, the big problem. When you've just
spent a lot of grant money generating data, you can't just throw the data out if
it isn't as clean as you'd like. You have to make it tell its story. You need
some creativity for when the story the data is telling isn't what you think it's
telling.
It was an agile, flexible process that built toward its goal incrementally,
rather than tackling a huge mountain of data all at once.
we're entering the era of products that are built on data.
We don't yet know what those products are, but we do know that the winners will
be the people, and the companies, that find those products.
They can think outside the box to come up with new ways to view the problem, or
to work with very broadly defined problems: "here's a lot of data, what can you
make from it?"