Skip to main content

Home/ Open Web/ Group items tagged Hadoop_Hive

Rss Feed Group items tagged

Gary Edwards

Matt On Stuff: Hadoop For The Rest Of Us - 0 views

  •  
    Excellent Hadoop/Hive explanation.  Hat tip to Matt Asay for the link.  I eft a comment on Matt's blog questioning the consequences of the Oracle vs. Google Android lawsuit, and the possible enforcement of the Java API copyright claim against Hadoop/Hive.  Based on this explanation of Hadoop/Hive, i'm wondering if Oracle is making a move to claim the entire era of Big Data Cloud Computing?  To understand why, it's first necessary to read Matt the Hadoople's explanation.   kill shot excerpt: "You've built your Hadoop job, and have successfully processed the data. You've generated some structured output, and that resides on HDFS. Naturally you want to run some reports, so you load your data into a MySQL or an Oracle database. Problem is, the data is large. In fact it's so large that when you try to run a query against the table you've just created, your database begins to cry. If you listen to its sobs, you'll probably hear "I was built to process Megabytes, maybe Gigabytes of data. Not Terabytes. Not Perabytes. That's not my job. I was built in the 80's and 90's, back when floppy drives were used. Just leave me alone". "This is where Hive comes to the rescue. Hive lets you run an SQL statement against structured data stored on HDFS. When you issue an SQL query, it parses it, and translates it into a Java Map/Reduce job, which is then executed on your data. Although Hive does some optimizations, in general it just goes record by record against all your data. This means that it's relatively slow - a typical Hive query takes 5 or 10 minutes to complete, depending on how much data you have. However, that's what makes it effective. Unlike a relational database, you don't waste time on query optimization, adding indexes, etc. Instead, what keeps the processing time down is the fact that the query is run on all machines in your Hadoop cluster, and the scalability is taken care of for you." "Hive is extremely useful in data-warehousing kind of scenarios. You would
1 - 1 of 1
Showing 20 items per page