Apache Druid 数据格式


Druid可以采集非标准化的数据诸如JSONCSV,或者以某种分隔符隔开的TSV格式,当然还支持自定义格式。

虽然大部分的文档使用的都是JSON格式,但是您可以通过Druid来配置自定义格式。

1 Druid当前支持的数据格式

1)JSON

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}

2)CSV

2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143

3)TSV

2013-08-31T01:02:33Z    "Gypsy Danger"  "en"    "nuclear"   "true"  "true"  "false" "false" "article"   "North America" "United States" "Bay Area"  "San Francisco" 57  200 -143

注意:CSV、TSV格式的文件在Driud进行数据摄取操作时不能包含列头

2 自定义格式

Druid支持使用正则解析式JavaScript来自定义数据格式。

配置数据摄取的Schema格式

Data Schema主要描述的是要摄取的数据类型、数据列、指标列、维度列、时间颗粒度等信息。

举例:以CSV格式的Schema为例
"parseSpec": {
"format" : "csv",
"timestampSpec" : {
  "column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
"dimensionsSpec" : {
  "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}}

parseSpec:指明数据源格式

format:指明数据为CSV格式

timestampSpec:指明时间戳字段名

columns:指明数据字段名称

dimensionsSpec:指明维度字段