Impala 交叉连接和笛卡尔积


最初,Impala限制了连接查询,因此它们必须在连接运算符每一侧的表列之间至少包含一个相等比较

对于使用Impala来处理数据量庞大的表,任何产生完整笛卡尔积连接结果集的查询都可能消耗大量集群资源

在Impala1.2.2及更高版本中,当您CROSSJOIN在查询中使用运算符时会取消此限制。您仍然无法WHERE从查询中删除所有子句,例如SELECT*FROMt1JOINt2从两个表中生成所有行组合。但是您可以使用CROSSJOIN运算符显式请求这样的笛卡尔积。

该操作适用于较小的表,其中,结果集仍适合单个Impala节点的内存。

下面的示例设置数据:数据描述了漫画中的一系列角色。

使用equijoin进行查询,含义为:允许来自同一时间段同一星球的角色相遇。

[localhost:21000] > create table heroes (name string, era string, planet string);
[localhost:21000] > create table villains (name string, era string, planet string);
[localhost:21000] > insert into heroes values
                  > ('Tesla','20th century','Earth'),
                  > ('Pythagoras','Antiquity','Earth'),
                  > ('Zopzar','Far Future','Mars');
Inserted 3 rows in 2.28s
[localhost:21000] > insert into villains values
                  > ('Caligula','Antiquity','Earth'),
                  > ('John Dillinger','20th century','Earth'),
                  > ('Xibulor','Far Future','Venus');
Inserted 3 rows in 1.93s
[localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle
                  > from heroes join villains
                  > where heroes.era = villains.era and heroes.planet = villains.planet;
+--------------------------+
| battle                   |
+--------------------------+
| Tesla vs. John Dillinger |
| Pythagoras vs. Caligula  |
+--------------------------+
Returned 2 row(s) in 0.47s

之后,加入时间旅行空间旅行的要素,让任一角色都能与任一反派相遇。

在Impala1.2.2之前,这种类型的查询是不可能的,因为所有连接都必须引用两个表之间的匹配值:

[localhost:21000] > -- Cartesian product not possible in Impala 1.1.
                  > select concat(heroes.name,' vs. ',villains.name) as battle from heroes join villains;
ERROR: NotImplementedException: Join between 'heroes' and 'villains' requires at least one conjunctive equality predicate between the two tables

在Impala1.2.2中,我们稍微重写查询以使用CROSSJOIN而不是JOIN,现在结果集包括所有组合:

[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN syntax.
                  > select concat(heroes.name,' vs. ',villains.name) as battle from heroes cross join villains;
+-------------------------------+
| battle                        |
+-------------------------------+
| Tesla vs. Caligula            |
| Tesla vs. John Dillinger      |
| Tesla vs. Xibulor             |
| Pythagoras vs. Caligula       |
| Pythagoras vs. John Dillinger |
| Pythagoras vs. Xibulor        |
| Zopzar vs. Caligula           |
| Zopzar vs. John Dillinger     |
| Zopzar vs. Xibulor            |
+-------------------------------+
Returned 9 row(s) in 0.33s

来自两个表的行的完整组合称为笛卡尔积,这种类型的结果集通常用于创建网格数据结构

您还可以通过包含WHERE未显式比较两个表之间的列的子句来过滤结果集。

以下示例显示如何生成用于图表的年份季度组合列表,然后生成一个仅包含选定季度的较短列表。

[localhost:21000] > create table x_axis (x int);
[localhost:21000] > create table y_axis (y int);
[localhost:21000] > insert into x_axis values (1),(2),(3),(4);
Inserted 4 rows in 2.14s
[localhost:21000] > insert into y_axis values (2010),(2011),(2012),(2013),(2014);
Inserted 5 rows in 1.32s
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis;
+------+---------+
| year | quarter |
+------+---------+
| 2010 | 1       |
| 2011 | 1       |
| 2012 | 1       |
| 2013 | 1       |
| 2014 | 1       |
| 2010 | 2       |
| 2011 | 2       |
| 2012 | 2       |
| 2013 | 2       |
| 2014 | 2       |
| 2010 | 3       |
| 2011 | 3       |
| 2012 | 3       |
| 2013 | 3       |
| 2014 | 3       |
| 2010 | 4       |
| 2011 | 4       |
| 2012 | 4       |
| 2013 | 4       |
| 2014 | 4       |
+------+---------+
Returned 20 row(s) in 0.38s
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis where x in (1,3);
+------+---------+
| year | quarter |
+------+---------+
| 2010 | 1       |
| 2011 | 1       |
| 2012 | 1       |
| 2013 | 1       |
| 2014 | 1       |
| 2010 | 3       |
| 2011 | 3       |
| 2012 | 3       |
| 2013 | 3       |
| 2014 | 3       |
+------+---------+
Returned 10 row(s) in 0.39s