Impala 将外部分区表附加到HDFS目录结构


教程概述

本教程展示了如何在HDFS中设置目录树,将数据文件放入最低级别的子目录,然后使用Impala外部表从其原始位置查询数据文件。

本教程使用一个包含Web日志数据的表,其中包含用于年、月、日和主机的单独子目录。为简单起见,我们使用少量CSV数据,将相同的数据加载到每个分区中。

操作过程

首先,我们为CSV数据制作一个Impala分区表,并查看底层的HDFS目录结构以了解目录结构以在HDFS的其他地方重新创建。列field1field2以及field3对应于CSV数据文件的内容。在yearmonthday,和host列都表示为表结构中的子目录,而不是CSV文件中的一部分。我们使用STRING这些列中的每一个,以便我们可以生成一致的子目录名称,前导零的长度一致。

createdatabaseexternal_partitions;
useexternal_partitions;
createtablelogs(field1string,field2string,field3string)
partitionedby(yearstring,monthstring,daystring,hoststring)
rowformatdelimitedfieldsterminatedby',';
insertintologspartition(year="2013",month="07",day="28",host="host1")values("foo","foo","foo");
insertintologspartition(year="2013",month="07",day="28",host="host2")values("foo","foo","foo");
insertintologspartition(year="2013",month="07",day="29",host="host1")values("foo","foo","foo");
insertintologspartition(year="2013",month="07",day="29",host="host2")values("foo","foo","foo");
insertintologspartition(year="2013",month="08",day="01",host="host1")values("foo","foo","foo");

回到Linuxshell,我们检查HDFS目录结构。(您的Impala数据目录可能位于不同的位置;由于历史原因,它有时位于HDFS路径/user/hive/warehouse下。)我们使用该hdfsdfs-ls命令检查与每个分区列对应的嵌套子目录,在每个级别(=在它们的名称中)代表每个分区列的不同值。当我们到达最低级别的子目录时,我们使用该hdfsdfs-cat命令检查数据文件并查看INSERTImpala中语句生成的CSV格式的数据。

$hdfsdfs-ls/user/impala/warehouse/external_partitions.db
Found1items
drwxrwxrwt-impalahive02013-08-0712:24/user/impala/warehouse/external_partitions.db/logs
$hdfsdfs-ls/user/impala/warehouse/external_partitions.db/logs
Found1items
drwxr-xr-x-impalahive02013-08-0712:24/user/impala/warehouse/external_partitions.db/logs/year=2013
$hdfsdfs-ls/user/impala/warehouse/external_partitions.db/logs/year=2013
Found2items
drwxr-xr-x-impalahive02013-08-0712:23/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
drwxr-xr-x-impalahive02013-08-0712:24/user/impala/warehouse/external_partitions.db/logs/year=2013/month=08
$hdfsdfs-ls/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
Found2items
drwxr-xr-x-impalahive02013-08-0712:22/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
drwxr-xr-x-impalahive02013-08-0712:23/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29
$hdfsdfs-ls/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
Found2items
drwxr-xr-x-impalahive02013-08-0712:21/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
drwxr-xr-x-impalahive02013-08-0712:22/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2
$hdfsdfs-ls/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
Found1items
-rw-r--r--3impalahive122013-08-0712:21/user/impala/warehouse/external_partiti
ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0
$hdfsdfs-cat/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\
host=host1/3981726974111751120--8907184999369517436_822630111_data.0
foo,foo,foo

仍然在Linuxshell中,我们使用hdfsdfs-mkdir在Impala控制的HDFS目录树之外创建多个数据目录(在此示例中为/user/impala/warehouse,您的情况可能有所不同)。根据您的配置,您可能需要以具有写入此HDFS目录树权限的用户身份登录;例如,此处显示的命令是在以hdfs用户身份登录时运行的。

$hdfsdfs-mkdir-p/user/impala/data/logs/year=2013/month=07/day=28/host=host1
$hdfsdfs-mkdir-p/user/impala/data/logs/year=2013/month=07/day=28/host=host2
$hdfsdfs-mkdir-p/user/impala/data/logs/year=2013/month=07/day=28/host=host1
$hdfsdfs-mkdir-p/user/impala/data/logs/year=2013/month=07/day=29/host=host1
$hdfsdfs-mkdir-p/user/impala/data/logs/year=2013/month=08/day=01/host=host1

我们制作了一个很小的​​CSV文件,其值与INSERT之前使用的语句中的值不同,并在我们将用作Impala分区的每个子目录中放置一个副本。

$cat>dummy_log_data
bar,baz,bletch
$hdfsdfs-mkdir-p/user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1
$hdfsdfs-mkdir-p/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1
$hdfsdfs-mkdir-p/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2
$hdfsdfs-mkdir-p/user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1
$hdfsdfs-putdummy_log_data/user/impala/data/logs/year=2013/month=07/day=28/host=host1
$hdfsdfs-putdummy_log_data/user/impala/data/logs/year=2013/month=07/day=28/host=host2
$hdfsdfs-putdummy_log_data/user/impala/data/logs/year=2013/month=07/day=29/host=host1
$hdfsdfs-putdummy_log_data/user/impala/data/logs/year=2013/month=08/day=01/host=host1

回到impala-shell解释器,我们将原始Impala管理的表移到一边,并创建一个新的外部表,其中的LOCATION子句指向我们在其下设置所有分区子目录和数据文件的目录。

useexternal_partitions;
altertablelogsrenametologs_original;
createexternaltablelogs(field1string,field2string,field3string)
partitionedby(yearstring,monthstring,daystring,hoststring)
rowformatdelimitedfieldsterminatedby','
location'/user/impala/data/logs';

因为分区子目录和数据文件在数据生命周期中来来去去,所以ALTERTABLE在Impala识别它们包含的数据文件之前,您必须通过语句识别每个分区。

altertablelogsaddpartition(year="2013",month="07",day="28",host="host1")
altertablelog_typeaddpartition(year="2013",month="07",day="28",host="host2");
altertablelog_typeaddpartition(year="2013",month="07",day="29",host="host1");
altertablelog_typeaddpartition(year="2013",month="08",day="01",host="host1");

我们REFRESH为表发出声明,当手动添加、删除或更改数据文件时,这始终是一种安全的做法。然后就可以查询数据了。该SELECT*语句说明了来自我们普通CSV文件的数据在我们复制它的每个分区中都得到了识别。尽管在这种情况下只有几行,但我们LIMIT在此测试查询中包含了一个子句,以防数据超出我们的预期。

refreshlog_type;
select*fromlog_typelimit100;
+--------+--------+--------+------+-------+-----+-------+
|field1  |field2  |field3  |year  |month  |day  |host   |
+--------+--------+--------+------+-------+-----+-------+
|bar     |baz     |bletch  |2013  |07     |28   |host1  |
|bar     |baz     |bletch  |2013  |08     |01   |host1  |
|bar     |baz     |bletch  |2013  |07     |29   |host1  |
|bar     |baz     |bletch  |2013  |07     |28   |host2  |
+--------+--------+--------+------+-------+-----+-------+