# Hive中使用java编写的MapReduce JOB FROM ( FROM src MAP value, keyUSING'java -cp hive-contrib-0.9.0.jar org.apache.hadoop.hive.contrib.mr.example.IdentityMapper'AS k, v CLUSTER BY k) map_output REDUCE k, v USING'java -cp hive-contrib-0.9.0.jar org.apache.hadoop.hive.contrib.mr.example.WordCountReduce'AS k, v;
Map Task和Reduce Task分别使用GenericMR的map方法和reduce方法
计算cogroup
如果对多个数据集进行JOIN连接处理,然后使用TRANSFORM进行处理。使用UNION ALL和CLUSTER BY,可以实现CO GROUP BY的常见操作。
Pig提供原生的COGROUP BY操作
FROM (
FROM (
FROM order_log ol
-- User Id, order Id, and timestamp:
SELECT ol.userid AS uid, ol.orderid AS id, av.ts AS ts
UNION ALL
FROM clicks_log cl
SELECT cl.userid AS uid, cl.id AS id, ac.ts AS ts
) union_msgs
SELECT union_msgs.uid, union_msgs.id, union_msgs.ts
CLUSTER BY union_msgs.uid, union_msgs.ts) map
INSERT OVERWRITE TABLE log_analysis
SELECT TRANSFORM(map.uid, map.id, map.ts) USING 'reduce_script'
AS (uid, id, ...);
文件和记录格式
Hive中文件格式间具有明显差异,例如文件中记录的编码方式、记录格式以及记录中字节流的编码的方式。
Hive文本文件格式选择和记录格式对应的。
设置文件存储格式
Stored AS SEQUENCEFILE、ROW Format delimited、serde、inputformat、outputformat这些语法。
例如 STORED AS SEQUENCEFILE等同于INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' .
文件格式
SequenceFile
通过Stored AS SequenceFile指定,SequenceFile文件是含有键-值对的二进制文件。将Hive查询转换成MapReduce JOB时,对于指定的记录,其取决使用那些合适的键值对。
CREATETABLEdoctorsROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED ASINPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'TBLPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "doctors", "type": "record", "fields": [ { "name":"number", "type":"int", "doc":"Order of playing the role" }, { "name":"first_name", "type":"string", "doc":"first name of actor playing role" }, { "name":"last_name", "type":"string", "doc":"last name of actor playing role" } ] }');
使用desc tablename可以查看avro定义的模式
从指定URL中定义Schema
CREATETABLEdoctorsROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED ASINPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'TBLPROPERTIES ('avro.schema.url'='hdfs:///test.schema'');