Kylin 无Hadoop环境部署Kylin4


相比于Kylin3.x,Kylin4.0实现了全新Spark构建引擎Parquet存储,使kylin不依赖Hadoop环境部署成为可能。与在AWSEMR之上部署Kylin3.x相比,直接在AWSEC2实例上部署Kylin4.0存在以下优势:

  • 节省成本。相比AWSEMR节点,AWSEC2节点的成本更低。
  • 更加灵活。在EC2节点上,用户可以更加自主选择自己所需的服务以及组件进行安装部署。
  • 去Hadoop。Hadoop生态比较重,需要投入一定的人力成本进行维护,去Hadoop可以更加贴近云原生。

在实现了支持在SparkStandalone模式下进行构建和查询的功能之后,我们在AWS的EC2实例上对无Hadoop部署Kylin4.0做了尝试,并成功构建Cube和进行了查询。

1 环境准备

  • 按照需求申请AWSEC2Linux实例
  • 创建AmazonRDSforMysql作为Kylin以及Hive元数据库
  • S3作为Kylin存储

2 组件版本信息

此处提供的版本信息是我们在测试时选用的版本信息,如果用户需要使用其他的版本进行部署,可以自行更换,保证组件版本之间兼容即可。

  • JDK1.8
  • Hive2.3.9
  • Zookeeper3.4.13
  • Kylin4.0forspark3
  • Spark3.1.1
  • Hadoop3.2.0(不需要启动)

3 安装过程

1)配置环境变量
  • 配置环境变量并使其生效
vim/etc/profile

#在profile文件末尾添加以下内容
exportJAVA_HOME=/usr/local/java/jdk1.8.0_291
exportJRE_HOME=${JAVA_HOME}/jre
exportHADOOP_HOME=/etc/hadoop/hadoop-3.2.0
exportHIVE_HOME=/etc/hadoop/hive
exportCLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
exportPATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH

#保存以上文件内容后执行以下命令
source/etc/profile
2)安装JDK1.8

下载jdk1.8到准备好的EC2实例,解压到/usr/local/java目录:

mkdir/usr/local/java
tar-xvfjava-1.8.0-openjdk.tar-C/usr/local/java
3)配置Hadoop
  • 下载Hadoop并解压
wgethttps://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
mkdir/etc/hadoop
tar-xvfhadoop-3.2.0.tar.gz-C/etc/hadoop
  • copy连接S3所需jar包到hadoop类加载路径,否则可能会出现ClassNotFound类型报错
cd/etc/hadoop
cphadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jarhadoop-3.2.0/share/hadoop/common/lib/
cphadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jarhadoop-3.2.0/share/hadoop/common/lib/
  • 修改core-site.xml,配置aws账号信息以及endpoint,以下为示例内容
<?xmlversion="1.0"encoding="UTF-8"?>
<?xml-stylesheettype="text/xsl"href="configuration.xsl"?>
<!--
LicensedundertheApacheLicense,Version2.0(the"License");
youmaynotusethisfileexceptincompliancewiththeLicense.
YoumayobtainacopyoftheLicenseat

http://www.apache.org/licenses/LICENSE-2.0

Unlessrequiredbyapplicablelaworagreedtoinwriting,software
distributedundertheLicenseisdistributedonan"ASIS"BASIS,
WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.
SeetheLicenseforthespecificlanguagegoverningpermissionsand
limitationsundertheLicense.SeeaccompanyingLICENSEfile.
-->

<!--Putsite-specificpropertyoverridesinthisfile.-->

<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>SESSION-ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SESSION-SECRET-KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.$REGION.amazonaws.com</value>
</property>
</configuration>
4)安装Hive
  • 下载Hive并解压
wgethttps://downloads.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
tar-xvfapache-hive-2.3.9-bin.tar.gz-C/etc/hadoop
mv/etc/hadoop/apache-hive-2.3.9-bin/etc/hadoop/hive
  • 编辑hive配置文件vim${HIVE_HOME}/conf/hive-site.xml,请提前启动AmazonRDSforMysqldatabase,获取连接URI、用户名和密码。

注意:正确配置VPC和安全组,以保证EC2实例可以正常访问数据库。

hive-site.xml文件示例内容如下:

<?xmlversion="1.0"encoding="UTF-8"standalone="no"?>
<?xml-stylesheettype="text/xsl"href="configuration.xsl"?><!--
LicensedtotheApacheSoftwareFoundation(ASF)underoneormore
contributorlicenseagreements.SeetheNOTICEfiledistributedwith
thisworkforadditionalinformationregardingcopyrightownership.
TheASFlicensesthisfiletoYouundertheApacheLicense,Version2.0
(the"License");youmaynotusethisfileexceptincompliancewith
theLicense.YoumayobtainacopyoftheLicenseat

http://www.apache.org/licenses/LICENSE-2.0

Unlessrequiredbyapplicablelaworagreedtoinwriting,software
distributedundertheLicenseisdistributedonan"ASIS"BASIS,
WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.
SeetheLicenseforthespecificlanguagegoverningpermissionsand
limitationsundertheLicense.
--><configuration>
<!--WARNING!!!ThisfileisautogeneratedfordocumentationpurposesONLY!-->
<!--WARNING!!!AnychangesyoumaketothisfilewillbeignoredbyHive.-->
<!--WARNING!!!Youmustmakeyourchangesinhive-site.xmlinstead.-->
<!--HiveExecutionParameters-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>passwordtouseagainstmetastoredatabase</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://host-name:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBCconnectstringforaJDBCmetastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>DriverclassnameforaJDBCmetastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>admin</value>
<description>Usernametouseagainstmetastoredatabase</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforcemetastoreschemaversionconsistency.
True:VerifythatversioninformationstoredinmetastorematcheswithonefromHivejars.Alsodisableautomatic
schemamigrationattempt.UsersarerequiredtomanuallymigrateschemaafterHiveupgradewhichensures
propermetastoreschemamigration.(Default)
False:Warniftheversioninformationstoredinmetastoredoesn'tmatchwithonefrominHivejars.
</description>
</property>
</configuration>
  • Hive元数据初始化
#下载mysql-jdbc的jar包放置在$HIVE_HOME/lib目录下
cpmysql-connector-java-5.1.47.jar$HIVE_HOME/lib
bin/schematool-dbTypemysql-initSchema
mkdir$HIVE_HOME/logs
nohup$HIVE_HOME/bin/hive--servicemetastore>>$HIVE_HOME/logs/hivemetastorelog.log2>&1&

注意:如果在这个步骤中出现了如下报错:

java.lang.NoSuchMethodError:com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V

这是由于hive2中guava包版本与hadoop3的guava版本不一致导致的,请使用$HADOOP_HOME/share/hadoop/common/lib/目录下的guavajar替换$HIVE_HOME/lib目录中的guavajar。

  • 为防止后续过程中出现jar包冲突,需要从hive的类加载路径中移除一些spark以及scala相关的jar包
rm$HIVE_HOME/lib/spark-*$HIVE_HOME/spark_jar
rm$HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar

注:此处只列出了我们在测试过程中遇到的产生冲突的jar包,如果用户在遇到类似jar包冲突的问题,可以根据类加载路径判断哪些jar包产生了冲突并移除相关jar包。建议当相同jar包产生版本冲突时,保留spark类加载路径下的jar包版本。

5)部署SparkStandalone
  • 下载Spark3.1.1并解压
wgethttp://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
tar-xvfspark-3.1.1-bin-hadoop3.2.tgz-C/etc/hadoop
mv/etc/hadoop/spark-3.1.1-bin-hadoop3.2/etc/hadoop/spark
exportSPARK_HOME=/etc/hadoop/spark
  • Copy连接S3所需jar包
cp$HIVE_HOME/conf/hive-site.xml$SPARK_HOME/conf
  • Copyhive配置文件及mysql-jdbc
cp$HIVE_HOME/conf/hive-site.xml$SPARK_HOME/conf
  • 启动Sparkmaster和worker
$SPARK_HOME/bin/start-master.sh
$SPARK_HOME/bin/start-worker.shspark://hostname:7077
6)部署Zoopkeeper伪集群
  • 下载zookeeper安装包并解压
wgethttp://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
tar-xvfzookeeper-3.4.13.tar.gz-C/etc/hadoop
mv/etc/hadoop/zookeeper-3.4.13/etc/hadoop/zookeeper
  • 修改zookeeper配置文件,启动三节点zookeeper伪集群
cp/etc/hadoop/zookeeper/conf/zoo_sample.cfg/etc/hadoop/zookeeper/conf/zoo1.cfg
cp/etc/hadoop/zookeeper/conf/zoo_sample.cfg/etc/hadoop/zookeeper/conf/zoo2.cfg
cp/etc/hadoop/zookeeper/conf/zoo_sample.cfg/etc/hadoop/zookeeper/conf/zoo3.cfg
  • 依次修改上述三个配置文件,添加如下内容:
server.1=localhost:2287:3387
server.2=localhost:2288:3388
server.3=localhost:2289:3389
dataDir=/tmp/zookeeper/zk1/data
dataLogDir=/tmp/zookeeper/zk1/log
clientPort=2181
  • 创建所需文件夹和文件
mkdir/tmp/zookeeper/zk1/data
mkdir/tmp/zookeeper/zk1/log
mkdir/tmp/zookeeper/zk2/data
mkdir/tmp/zookeeper/zk2/log
mkdir/tmp/zookeeper/zk3/data
mkdir/tmp/zookeeper/zk3/log
vim/tmp/zookeeper/zk1/data/myid 
vim/tmp/zookeeper/zk2/data/myid 
vim/tmp/zookeeper/zk3/data/myid
  • 启动zookeeper集群
/etc/hadoop/zookeeper/bin/zkServer.shstart/etc/hadoop/zookeeper/conf/zoo1.cfg
/etc/hadoop/zookeeper/bin/zkServer.shstart/etc/hadoop/zookeeper/conf/zoo2.cfg
/etc/hadoop/zookeeper/bin/zkServer.shstart/etc/hadoop/zookeeper/conf/zoo3.cfg
7)启动Kylin
  • 下载kylin4.0二进制包并解压
wgethttps://mirror-hk.koddos.net/apache/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
tar-xvfapache-kylin-4.0.0-bin.tar.gz/etc/hadoop
exportKYLIN_HOME=/etc/hadoop/apache-kylin-4.0.0-bin
mkdir$KYLIN_HOME/ext
cpmysql-connector-java-5.1.47.jar$KYLIN_HOME/ext
  • 修改配置文件vim$KYLIN_HOME/conf/kylin.properties
kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://hostname:3306/kylin,username=root,password=password,maxActive=10,maxIdle=10
kylin.env.zookeeper-connect-string=hostname
kylin.engine.spark-conf.spark.master=spark://hostname:7077
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.env.hdfs-working-dir=s3://bucket/kylin
kylin.engine.spark-conf.spark.eventLog.dir=s3://bucket/kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=s3://bucket/kylin/spark-history
kylin.query.spark-conf.spark.master=spark://hostname:7077
  • 执行bin/kylin.shstart
  • Kylin启动时可能会遇到ClassNotFound类型报错,可参考以下方法解决后重启kylin:
#下载commons-collections-3.2.2.jar
cpcommons-collections-3.2.2.jar$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
#下载commons-configuration-1.3.jar
cpcommons-configuration-1.3.jar$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp$HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-1.11.563.jar$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp$HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar$HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/