Impala 处理Parquet文件


概述

随着数据管道开始包含更多层面的东西(例如NoSQL或松散指定的模式),您可能会遇到不知道如何精确定义的数据文件,尤其是Parquet格式的文件。

本教程展示了如何围绕来自非Impala甚至非SQL的数据构建Impala表,在这些数据中,您无法控制表的布局甚至不熟悉数据的特征。

本教程中使用的数据含义为1987年10月至2008年4月的航空公司准点率的统计数据。详情请参考2009ASADataExpo的详细信息。

将数据文件下载到HDFS

首先,我们下载并解压数据文件。

文件包中共有8个文件,大小为1.4GB。

$ wget -O airlines_parquet.tar.gz https://home.apache.org/~arodoni/airlines_parquet.tar.gz
$ wget https://home.apache.org/~arodoni/airlines_parquet.tar.gz.sha512
$ shasum -a 512 -c airlines_parquet.tar.gz.sha512
airlines_parquet.tar.gz: OK
$ tar xvzf airlines_parquet.tar.gz
$ cd airlines_parquet/
$ du -kch *.parq
253M   4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
14M    4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq
253M   4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq
64M    4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq
184M   4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq
241M   4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq
212M   4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq
152M   4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq
1.4G   total

接下来,我们将Parquet数据文件放到HDFS中,并放于同一个目录下,以便Impala用户能够读取它们。

解压该数据文件后,我们会看到最大的Parquet文件是253MB

将Parquet文件复制到HDFS以供Impala使用时,为了获得最佳查询性能,需要确保每个文件都存放在单个HDFS数据块中。

为此需要指定一个比任意文件都稍大的数据块大小,使用参数-Ddfs.block.size=253mhdfsdfs-put命令。

$ sudo -u hdfs hdfs dfs -mkdir -p /user/impala/staging/airlines
$ sudo -u hdfs hdfs dfs -Ddfs.block.size=253m -put *.parq /user/impala/staging/airlines
$ sudo -u hdfs hdfs dfs -ls /user/impala/staging
Found 1 items
$ sudo -u hdfs hdfs dfs -ls /user/impala/staging/airlines
Found 8 items

创建数据库和表

使用HDFS中可访问到位置的文件,创建一个使用这些文件数据的数据库表:

  • 使用建表、方位位置的命令行:CREATEEXTERNALLOCATION
  • LIKEPARQUET'path_to_any_parquet_file':该语句表示创建过程中会跳过列名和类型列表;Impala会自动从数据文件中获取列名和数据类型。(目前,此技术仅适用于Parquet文件。)
  • 忽略有关无法访问HDFS中文件的警告
  • READ_WRITE可以读取文件。
$ impala-shell
> CREATE DATABASE airlines_data;
  USE airlines_data;
  CREATE EXTERNAL TABLE airlines_external
  LIKE PARQUET 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
  STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://myhost.com:8020/user/impala/staging'

检查物理和逻辑模式

创建表格后,查看其物理、逻辑特征,以确认数据无误,并且符合可使用的格式和形状。

  • SHOWTABLESTATS语句给出了表的高级摘要信息,显示了它包含的文件数量和总数据量。
  • SHOWFILES语句确认表中的数据具有原始Parquet文件的预期数量、名称和大小。
  • DESCRIBE语句(或其缩写DESC)确认了Impala在从Parquet文件读取元数据后自动创建的列的名称和类型。
  • DESCRIBEFORMATTED语句打印出一些额外的细节以及列定义。本示例中重要的信息是:
    • 表的包含数据库。
    • 相关数据文件在HDFS中的位置。
    • 该表是一个外部表,因此当我们完成实验并删除该表时,Impala不会删除HDFS文件。
    • 该表设置为专门处理Parquet格式的文件。
> SHOW TABLE STATS airlines_external;
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| -1    | 8      | 1.34GB | NOT CACHED   | NOT CACHED        | PARQUET | false             |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
> SHOW FILES IN airlines_external;
+----------------------------------------------------------------------------------------+----------+-----------+
| path                                                                                   | size     | partition |
+----------------------------------------------------------------------------------------+----------+-----------+
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB  |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq  | 252.84MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq  | 63.92MB  |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB |           |
| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB |           |
+----------------------------------------------------------------------------------------+----------+-----------+
> DESCRIBE airlines_external;
+---------------------+--------+---------------------------------------------------+
| name                | type   | comment                                           |
+---------------------+--------+---------------------------------------------------+
| year                | int    | Inferred from Parquet file.                       |
| month               | int    | Inferred from Parquet file.                       |
| day                 | int    | Inferred from Parquet file.                       |
| dayofweek           | int    | Inferred from Parquet file.                       |
| dep_time            | int    | Inferred from Parquet file.                        |
| crs_dep_time        | int    | Inferred from Parquet file.                       |
| arr_time            | int    | Inferred from Parquet file.                       |
| crs_arr_time        | int    | Inferred from Parquet file.                       |
| carrier             | string | Inferred from Parquet file.                       |
| flight_num          | int    | Inferred from Parquet file.                       |
| tail_num            | int    | Inferred from Parquet file.                       |
| actual_elapsed_time | int    | Inferred from Parquet file.                       |
| crs_elapsed_time    | int    | Inferred from Parquet file.                       |
| airtime             | int    | Inferred from Parquet file.                       |
| arrdelay            | int    | Inferred from Parquet file.                       |
| depdelay            | int    | Inferred from Parquet file.                       |
| origin              | string | Inferred from Parquet file.                       |
| dest                | string | Inferred from Parquet file.                       |
| distance            | int    | Inferred from Parquet file.                       |
| taxi_in             | int    | Inferred from Parquet file.                       |
| taxi_out            | int    | Inferred from Parquet file.                       |
| cancelled           | int    | Inferred from Parquet file.                       |
| cancellation_code   | string | Inferred from Parquet file.                       |
| diverted            | int    | Inferred from Parquet file.                       |
| carrier_delay       | int    | Inferred from Parquet file.                       |
| weather_delay       | int    | Inferred from Parquet file.                       |
| nas_delay           | int    | Inferred from Parquet file.                       |
| security_delay      | int    | Inferred from Parquet file.                       |
| late_aircraft_delay | int    | Inferred from Parquet file.                       |
+---------------------+--------+---------------------------------------------------+
> DESCRIBE FORMATTED airlines_external;
+------------------------------+-------------------------------
| name                         | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
| Database:                    | airlines_data
| Owner:                       | impala
...
| Location:                    | /user/impala/staging/airlines
| Table Type:                  | EXTERNAL_TABLE
...
| # Storage Information        | NULL
| SerDe Library:               | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
| InputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma
| OutputFormat:                | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
...

分析数据

现在明确了Impala表和底层Parquet文件之间的连接是可靠的,我们可以借助运行一些初始查询来了解数据的特征:总行数、范围以及特定值中有多少不同的值列。

> SELECT COUNT(*) FROM airlines_external;
+-----------+
| count(*)  |
+-----------+
| 123534969 |
+-----------+

NDV()函数返回许多不同的值,出于性能原因,当列中有许多不同的值时,这是一个估计值,但在基数小于16K时是精确的。使用NDV()函数进行这种探索,而不是,因为Impala可以在单个查询中计算多个函数,但只能计算:COUNT(DISTINCTcolname)NDV()COUNTDISTINCT

> SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
  NDV(origin), NDV(dest) FROM airlines_external;
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29           | 8463            | 3             | 342         | 349       |
+--------------+-----------------+---------------+-------------+-----------+
> SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
  GROUP BY tail_num;
+----------+-----------+
| tail_num | howmany   |
+----------+-----------+
| NULL     | 123122001 |
| 715      | 1         |
| 0        | 406405    |
| 112      | 6562      |
+----------+-----------+
> SELECT DISTINCT dest FROM airlines_external
  WHERE dest NOT IN (SELECT origin FROM airlines_external);
+------+
| dest |
+------+
| CBM  |
| SKA  |
| LAR  |
| RCA  |
| LBF  |
+------+
> SELECT DISTINCT dest FROM airlines_external
  WHERE dest NOT IN (SELECT DISTINCT origin FROM airlines_external);
+------+
| dest |
+------+
| CBM  |
| SKA  |
| LAR  |
| RCA  |
| LBF  |
+------+
> SELECT DISTINCT origin FROM airlines_external
  WHERE origin NOT IN (SELECT DISTINCT dest FROM airlines_external);
Fetched 0 row(s) in 2.63

通过上述查询,我​​们看到不同航空公司、航班号以及始发地和目的地机场的数量不多。从这个查询中跳出两件事:tail_num值的数量比我们预期的要少得多,目的地机场多于始发机场。

注意:第一次SELECTDISTINCTDEST查询需要将近40秒。我们预计对这样一个小于2GB的小数据集的所有查询最多需要几秒钟。原因是因为该表达式NOTIN(SELECToriginFROMairlines_external)生成了一个包含1.23亿行的中间结果集,然后在每个数据节点上针对一小部分目的地机场运行了1.23亿次比较。

接下来,进行简单的计算,结果按年份细分:这表明有些年份在airtime列中没有数据,这意味着我们可以在涉及特定日期范围的查询中使用该列。

在对NULL数据集进行初步探索时,一列是否包含任何值,如果包含,它们的数量、比例和分布是什么的问题会一次又一次地出现:airtimeNULL

> SELECT year, SUM(airtime) FROM airlines_external
  GROUP BY year ORDER BY year DESC;
+------+--------------+
| year | sum(airtime) |
+------+--------------+
| 2008 | 713050445    |
| 2007 | 748015545    |
| 2006 | 720372850    |
| 2005 | 708204026    |
| 2004 | 714276973    |
| 2003 | 665706940    |
| 2002 | 549761849    |
| 2001 | 590867745    |
| 2000 | 583537683    |
| 1999 | 561219227    |
| 1998 | 538050663    |
| 1997 | 536991229    |
| 1996 | 519440044    |
| 1995 | 513364265    |
| 1994 | NULL         |
| 1993 | NULL         |
| 1992 | NULL         |
| 1991 | NULL         |
| 1990 | NULL         |
| 1989 | NULL         |
| 1988 | NULL         |
| 1987 | NULL         |
+------+--------------+

考虑到NULL值的概念,让我们回到tail_num我们发现有很多NULLs的列。让我们量化该列中的值NULL和非NULL值以便更好地理解。首先,我们只计算该列中的总行数与非NULL值。最初的结果显示出相对较少的非NULL值,但我们可以在单个查询中更清楚地分解它。一旦我们有了数字COUNT(*)和数字,我们就可以将初始查询编码为COUNT(colname)WITH子句,然后运行对这些值执行多个算术运算的后续查询。看到所有行中只有三分之一的行具有NULLtail_num列的非值,这清楚地表明该列没有多大用处:NULLtail_numNULLNULLNULLNULLNULLCOUNT(*)COUNT(colname)WITHNULLtail_num

> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail numbers'
  FROM airlines_external;
+-----------+-----------------------+
| rows      | non-null tail numbers |
+-----------+-----------------------+
| 123534969 | 412968                |
+-----------+-----------------------+
> WITH t1 AS
  (SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull'
  FROM airlines_external)
SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls',
  (`nonnull` / `rows`) * 100 AS 'percentage non-null'
FROM t1;
+-----------+---------+-----------+---------------------+
| rows      | nonnull | nulls     | percentage non-null |
+-----------+---------+-----------+---------------------+
| 123534969 | 412968  | 123122001 | 0.3342923897119365  |
+-----------+---------+-----------+---------------------+

通过使用这些技术检查其他列,我们可以在脑海中形成数据在整个表中分布的方式,以及哪些列对于查询目的最重要。在本教程中,我们主要关注可能包含离散值的字段,而不是诸如actual_elapsed_time名称暗示它们包含测量值的列。

一旦我们清楚地了解哪些问题值得提出,以及我们可能会寻找什么样的趋势,我们就可以深入研究这部分数据。对于初级数据研究的最后一部分,让我们看一下year专栏。一个简单的GROUPBY可以查询显示它具有明确定义的范围、可管理数量的不同值以及跨不同年份的相对均匀的行分布:actual_elapsed_timeyearGROUP BY

> SELECT MIN(year), MAX(year), NDV(year) FROM airlines_external;
+-----------+-----------+-----------+
| min(year) | max(year) | ndv(year) |
+-----------+-----------+-----------+
| 1987      | 2008      | 22        |
+-----------+-----------+-----------+
> SELECT year, COUNT(*) howmany FROM airlines_external
  GROUP BY year ORDER BY year DESC;
+------+---------+
| year | howmany |
+------+---------+
| 2008 | 7009728 |
| 2007 | 7453215 |
| 2006 | 7141922 |
| 2005 | 7140596 |
| 2004 | 7129270 |
| 2003 | 6488540 |
| 2002 | 5271359 |
| 2001 | 5967780 |
| 2000 | 5683047 |
| 1999 | 5527884 |
| 1998 | 5384721 |
| 1997 | 5411843 |
| 1996 | 5351983 |
| 1995 | 5327435 |
| 1994 | 5180048 |
| 1993 | 5070501 |
| 1992 | 5092157 |
| 1991 | 5076925 |
| 1990 | 5270893 |
| 1989 | 5041200 |
| 1988 | 5202096 |
| 1987 | 1311826 |
+------+---------+

我们可以使用这种初始原始格式的数据做更深入的研究。如果数据集是有效且值得在Impala中长时间保存并进行广泛的查询时,我们可以将其复制到内部表中,让Impala管理数据文件。

在数据分析的进一步阶段,我们将原始数据复制到分区表中,仍然是Parquet格式。

基于Year列,使用诸如WHEREyear=2001或的子句运行查询WHEREyearBETWEEN1989AND1999,这可以通过忽略超出所需范围的年份的所有数据来显着减少I/O。

Impala无需读取所有数据然后决定匹配年份中的哪些行,而是可以仅将来自特定year分区的数据文件归零。

为此,Impala以物理方式重新组织数据文件,将每年的行放入数据文件中的每个year值的单独HDFS目录中。

第一步是创建一个布局与原始airlines_external表格非常相似的新表格。我们将通过CREATETABLE对第一个表的语句进行逆向工程来做到这一点,然后稍微调整它以包含的PARTITIONBY子句year,并排除该tail_num列。该SHOWCREATETABLE声明为我们提供了起点。

尽管我们可以将该输出编辑为新的SQL语句,但所有ASCII框字符都使这种编辑不方便。为了获得更精简的CREATETABLE开始,我们使用选项重新启动impala-shell命令,该-B选项关闭框绘制行为:CREATE TABLE-B

$ impala-shell -i localhost -B -d airlines_data;
> SHOW CREATE TABLE airlines_external;
"CREATE EXTERNAL TABLE airlines_data.airlines_external (
  year INT COMMENT 'inferred from: optional int32 year',
  month INT COMMENT 'inferred from: optional int32 month',
  day INT COMMENT 'inferred from: optional int32 day',
  dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
  dep_time INT COMMENT 'inferred from: optional int32 dep_time',
  crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
  arr_time INT COMMENT 'inferred from: optional int32 arr_time',
  crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
  carrier STRING COMMENT 'inferred from: optional binary carrier',
  flight_num INT COMMENT 'inferred from: optional int32 flight_num',
  tail_num INT COMMENT 'inferred from: optional int32 tail_num',
  actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
  crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
  airtime INT COMMENT 'inferred from: optional int32 airtime',
  arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
  depdelay INT COMMENT 'inferred from: optional int32 depdelay',
  origin STRING COMMENT 'inferred from: optional binary origin',
  dest STRING COMMENT 'inferred from: optional binary dest',
  distance INT COMMENT 'inferred from: optional int32 distance',
  taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
  taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
  cancelled INT COMMENT 'inferred from: optional int32 cancelled',
  cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
  diverted INT COMMENT 'inferred from: optional int32 diverted',
  carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
  weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
  nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
  security_delay INT COMMENT 'inferred from: optional int32 security_delay',
  late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
)
STORED AS PARQUET
LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
  'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
  'rawDataSize'='-1')"

CREATETABLE语句复制并粘贴到文本编辑器中进行微调后,我们退出并重新启动没有选项的impala-shell-B,以切换回常规输出。

接下来,我们运行CREATETABLE根据SHOWCREATETABLE输出改编的语句。我们保留该STOREDASPARQUET子句是因为我们想稍微重新排列数据,但仍将其保留为高性能Parquet格式。在LOCATIONTBLPROPERTIES条款是不相关的这个新表,所以我们编辑者淘汰。因为我们将根据year列对新表进行分区,所以我们将该列名(及其类型)移动到一个新PARTITIONEDBY子句中:CREATE TABLESHOW CREATE TABLESTORED AS PARQUETLOCATIONTBLPROPERTIESyearPARTITIONED BY

> CREATE TABLE airlines_data.airlines
 (month INT,
  day INT,
  dayofweek INT,
  dep_time INT,
  crs_dep_time INT,
  arr_time INT,
  crs_arr_time INT,
  carrier STRING,
  flight_num INT,
  actual_elapsed_time INT,
  crs_elapsed_time INT,
  airtime INT,
  arrdelay INT,
  depdelay INT,
  origin STRING,
  dest STRING,
  distance INT,
  taxi_in INT,
  taxi_out INT,
  cancelled INT,
  cancellation_code STRING,
  diverted INT,
  carrier_delay INT,
  weather_delay INT,
  nas_delay INT,
  security_delay INT,
  late_aircraft_delay INT)
PARTITIONED BY (year INT)
STORED AS PARQUET;

接下来,我们使用一条INSERT语句将原始表中的所有行复制到这个新表中。(我们对CREATETABLE语句进行了编辑,使INSERT语句的列名按相同的顺序排列。)唯一的变化是添加一个PARTITION(year)子句,并将该year列移动到语句SELECT列表的最末尾INSERT。指定PARTITION(year),而不是固定值,例如PARTITION(year=2000),意味着Impala根据最后一列的值计算每一行的分区值SELECT列表。这是第一个合理花费大量时间的SQL语句,因为来自不同年份的行在集群周围打乱;进入每个分区的行在写入一个或多个新数据文件之前收集在一个节点上:INSERTCREATE TABLEINSERTPARTITION(year)yearSELECTINSERTPARTITION(year)PARTITION(year=2000)SELECT

> INSERT INTO airlines_data.airlines
  PARTITION (year)
  SELECT
    month,
    day,
    dayofweek,
    dep_time,
    crs_dep_time,
    arr_time,
    crs_arr_time,
    carrier,
    flight_num,
    actual_elapsed_time,
    crs_elapsed_time,
    airtime,
    arrdelay,
    depdelay,
    origin,
    dest,
    distance,
    taxi_in,
    taxi_out,
    cancelled,
    cancellation_code,
    diverted,
    carrier_delay,
    weather_delay,
    nas_delay,
    security_delay,
    late_aircraft_delay,
    year
  FROM airlines_data.airlines_external;

一旦分区或连接查询开始发挥作用,Impala可以使用统计信息来优化相应表上的查询,这一点很重要。该COMPUTEINCREMENTALSTATS语句是为分区表收集统计信息的方式。然后SHOWTABLESTATS语句确认每个分区的统计信息到位,并说明每个分区中有多少文件和多少原始数据:COMPUTE INCREMENTAL STATSSHOW TABLE STATS

> COMPUTE INCREMENTAL STATS airlines;
+-------------------------------------------+
| summary                                   |
+-------------------------------------------+
| Updated 22 partition(s) and 27 column(s). |
+-------------------------------------------+
> SHOW TABLE STATS airlines;
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
| year  | #Rows     | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                                                 |
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
| 1987  | 1311826   | 1      | 11.75MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 |
| 1988  | 5202096   | 1      | 44.04MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 |
| 1989  | 5041200   | 1      | 46.07MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 |
| 1990  | 5270893   | 1      | 46.25MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 |
| 1991  | 5076925   | 1      | 46.77MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 |
| 1992  | 5092157   | 1      | 48.21MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 |
| 1993  | 5070501   | 1      | 47.46MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 |
| 1994  | 5180048   | 1      | 47.47MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 |
| 1995  | 5327435   | 1      | 62.40MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 |
| 1996  | 5351983   | 1      | 62.93MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 |
| 1997  | 5411843   | 1      | 65.05MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 |
| 1998  | 5384721   | 1      | 62.21MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 |
| 1999  | 5527884   | 1      | 65.10MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 |
| 2000  | 5683047   | 1      | 67.68MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 |
| 2001  | 5967780   | 1      | 74.03MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 |
| 2002  | 5271359   | 1      | 74.00MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 |
| 2003  | 6488540   | 1      | 99.35MB  | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 |
| 2004  | 7129270   | 1      | 123.29MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 |
| 2005  | 7140596   | 1      | 120.72MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 |
| 2006  | 7141922   | 1      | 121.88MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 |
| 2007  | 7453215   | 1      | 130.87MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 |
| 2008  | 7009728   | 1      | 123.14MB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 |
| Total | 123534969 | 22     | 1.55GB   | 0B           |                   |         |                   |                                                                                                          |
+-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+

此时,我们要检查所做的分区。所有分区都只有一个文件,该文件位于底端。包含子句的查询WHEREyear=2004只会读取单个数据块;该数据块将由单个数据节点读取和处理。

因此,对于针对一年的查询,集群中的所有其他节点都将处于空闲状态,而所有工作都发生在一台机器上。甚至有可能是偶然的(取决于HDFS复制因子和数据块在集群中分布的方式),由过滤器选择的多年分区,例如WHEREyearBETWEEN1999AND2001都可以由同一个数据节点读取和处理。

每个分区拥有的数据文件越多,您可以获得的并行度就越高,并且在特定节点上出现“热点”的可能性就越小,因此通过拥有一个大集群来提高性能。但是,数据文件越多,每个文件中的数据就越少:如果每个节点仅读取几兆字节,则在并行查询中划分工作的开销可能不值得。

对于Parquet数据块来说,50或100m是合适的大小;9或37m较小。也就是说,我们最终基于这种分区方案得到的数据分布处于合理(相当大的文件)和次优(每个分区中的文件很少)之间的边界上。

注意:如果发现每个分区的体积只有几十m,可以考虑降低分区的粒度

例如,不是按年、月、日分区,而是按年和月甚至仅按年分区。在并行查询中有效分配工作的理想布局是每个Parquet文件为数十甚至数百m字节,并且每个分区中的Parquet文件数量略高于数据节点数量。

> SELECT SUM(airtime) FROM airlines_external;
+--------------+
| 8662859484   |
+--------------+
> SELECT SUM(airtime) FROM airlines;
+--------------+
| 8662859484   |
+--------------+
> SELECT SUM(airtime) FROM airlines_external WHERE year = 2005;
+--------------+
| 708204026    |
+--------------+
> SELECT SUM(airtime) FROM airlines WHERE year = 2005;
+--------------+
| 708204026    |
+--------------+

现在我们可以分析来自原始数据文件的数据集,我们不知道它们包含哪些列。

首先看airtime航班的日期是否会因一周中的哪一天而有所不同,可以看到第6天的平均值略高;也许星期六是一个繁忙的飞行日,飞机在着陆前必须在目的地机场盘旋更长的时间:airtime

> SELECT dayofweek, AVG(airtime) FROM airlines
  GROUP BY dayofweek ORDER BY dayofweek;
+-----------+-------------------+
| dayofweek | avg(airtime)      |
+-----------+-------------------+
| 1         | 102.1560425016671 |
| 2         | 102.1582931538807 |
| 3         | 102.2170009256653 |
| 4         | 102.37477661846   |
| 5         | 102.2697358763511 |
| 6         | 105.3627448363705 |
| 7         | 103.4144351202054 |
+-----------+-------------------+

要查看明显的趋势是否会随着时间的推移而保持,让我们按周中的每一天进行相同的细分,但也按年份进行细分。现在我们可以看到,每年第6天的平均播出时间始终较高。我们还可以看到平均播放时间随着时间的推移全面增加。和存在NULL此列在年1987年至1994年节目,涉及此列需要查询被限制在1995年和更高的日期范围:NULL

> SELECT year, dayofweek, AVG(airtime) FROM airlines
  GROUP BY year, dayofweek ORDER BY year DESC, dayofweek;
+------+-----------+-------------------+
| year | dayofweek | avg(airtime)      |
+------+-----------+-------------------+
| 2008 | 1         | 103.1821651651355 |
| 2008 | 2         | 103.2149301386094 |
| 2008 | 3         | 103.0585076622796 |
| 2008 | 4         | 103.4671383539038 |
| 2008 | 5         | 103.5575385182659 |
| 2008 | 6         | 107.4006306562128 |
| 2008 | 7         | 104.8648851041755 |
| 2007 | 1         | 102.2196114337825 |
| 2007 | 2         | 101.9317791906348 |
| 2007 | 3         | 102.0964767689043 |
| 2007 | 4         | 102.6215927201686 |
| 2007 | 5         | 102.4289399000661 |
| 2007 | 6         | 105.1477448215756 |
| 2007 | 7         | 103.6305945644095 |
...
| 1996 | 1         | 99.33860750862108 |
| 1996 | 2         | 99.54225446396656 |
| 1996 | 3         | 99.41129336113134 |
| 1996 | 4         | 99.5110373340348  |
| 1996 | 5         | 99.22120745027595 |
| 1996 | 6         | 101.1717447111921 |
| 1996 | 7         | 99.95410136133704 |
| 1995 | 1         | 96.93779698300494 |
| 1995 | 2         | 96.93458674589712 |
| 1995 | 3         | 97.00972311337051 |
| 1995 | 4         | 96.90843832024412 |
| 1995 | 5         | 96.78382115425562 |
| 1995 | 6         | 98.70872826057003 |
| 1995 | 7         | 97.85570478374616 |
| 1994 | 1         | NULL              |
| 1994 | 2         | NULL              |
| 1994 | 3         | NULL              |
...
| 1987 | 5         | NULL              |
| 1987 | 6         | NULL              |
| 1987 | 7         | NULL              |
+------+-----------+-------------------+