11.HCatalog

HCatalog

  • HCatalog是让Hive的元数据存储为基于Hadoop的其他工具所公用。它为MapReduce和Pig提供了连接器,用户可以从Hive数据仓库中读取和写入数据。

MapRedcuce

读数据

  • MR从InputFormat读取数据,然后分片成不同的map任务并行处理,公共一个RecordReader从输入数据源中读取记录转换成键值对来供map任务进行处理。

  • Hcatalog 提供HCatInputFormat类来供MapReduce用户从Hive的数据仓库中读取数据 。其允许用户只读需要的表分区和字段,并且提供一种列表格式来展示 。

  • 初始化HCatInputFormat时,首先要读取指定的表,通过创建InputJobInfo类,然后指定数据库、表和分区过滤条件就可以达到这个目的。

//数据库名称 
private final String databaseName; 
//表名 
private final String tableName; 
//表信息 
private HCatTableInfo tableInfo; 
//分区过滤条件 
private String filter; 
//全部分区 
transient private List<PartInfo> partitions; 
/** implementation specific job properties */ 
private Properties properties; 
  • 配置job读取hive字段

HCatSchema baseSchema = HCatBaseInputFormat.getOutputSchema(context); 
List<HCatFieldSchema> fields = new List<HCatFieldSchema>(2); 
fields.add(baseSchema.get("user")); 
fields.add(baseSchema.get("url")); 
HCatBaseInputFormat.setOutputSchema(job, new HCatSchema(fields)); 

写数据

OutputJobInfo属性

/** The db and table names. */ 
private final String databaseName; 
private final String tableName; 
/** The table info provided by user. */ 
private HCatTableInfo tableInfo; 
/** The output schema. This is given to us by user.  This wont contain any 
 * partition columns ,even if user has specified them. 
 * */ 
private HCatSchema outputSchema; 
/** The location of the partition being written */ 
private String location; 
/** The root location of custom dynamic partitions being written */ 
private String customDynamicRoot; 
/** The relative path of custom dynamic partitions being written */ 
private String customDynamicPath; 
//用户期望创建的分区名称 
/** The partition values to publish to, if used for output*/ 
private Map<String, String> partitionValues; 
private List<Integer> posOfPartCols; 
private List<Integer> posOfDynPartCols; 
private Properties properties; 
private int maxDynamicPartitions; 
/** List of keys for which values were not specified at write setup time, to be infered at write time */ 
private List<String> dynamicPartitioningKeys; 
private boolean harRequested; 

命令行

hcat使用

  • 进入HIVE_HOME/hcatalog/bin中

hcat -e "create table test(name string);" 

架构

HCatalog架构

  • HcatInputFormat通过和Hive元数据存储进行同学来获取要读取的表和分区的信息。包括获取表的模式,也包括获取每个分区的模式。

最后更新于