Druid数据摄取 数据格式


Druid可以接收JSON、CSVTSV等分隔格式或任何自定义格式的非规范化数据。

本篇教程列出了Druid支持的所有默认和核心扩展数据格式。

1 格式化数据

  • JSON
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
  • CSV
2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143
2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330
2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111
2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900
2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9
  • TSV
2013-08-31T01:02:33Z  "Gypsy Danger"  "en"  "nuclear" "true"  "true"  "false" "false" "article" "North America" "United States" "Bay Area"  "San Francisco" 57  200 -143
2013-08-31T03:32:45Z  "Striker Eureka"  "en"  "speed" "false" "true"  "true"  "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey"  459 129 330
2013-08-31T07:11:21Z  "Cherno Alpha"  "ru"  "masterYi"  "false" "true"  "true"  "false" "article" "Asia"  "Russia"  "Oblast"  "Moscow"  123 12  111
2013-08-31T11:58:39Z  "Crimson Typhoon" "zh"  "triplets"  "true"  "false" "true"  "false" "wikipedia" "Asia"  "China" "Shanxi"  "Taiyuan" 905 5 900
2013-08-31T12:41:27Z  "Coyote Tango"  "ja"  "cancer"  "true"  "false" "true"  "false" "wikipedia" "Asia"  "Japan" "Kanto" "Tokyo" 1 10  -9

需要注意的是,CSV和TSV格式的数据文件不包含标题列。此外,除文本格式,Druid还支持二进制格式,比如Orc和Parquet格式。

2 自定义格式

Druid支持自定义数据格式,用户可以通过Regex解析器JavaScript解析器来编译这些格式的文件。

1)InputFormat

有形式的Druid摄取都需要schema对象。

所摄取的数据格式通过ioConfig中的inputFormat指定:

  • JSON

一个加载JSON格式数据的inputFormat示例:

"ioConfig": {
  "inputFormat": {
    "type": "json"
  },
  ...
}

JSON inputFormat有以下组件:

字段

类型

描述

是否必填

type

String

填 json

flattenSpec

JSON对象

指定嵌套JSON数据的展平配置。

featureSpec

JSON对象

Jackson库支持的 JSON解析器特性 。这些特性将在解析输入JSON数据时应用。

  • CSV

一个加载CSV格式数据的inputFormat示例:

"ioConfig": {
  "inputFormat": {
    "type": "csv",
    "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"]
  },
  ...
}

CSV inputFormat有以下组件:

字段

类型

描述

是否必填

type

String

填 csv

listDelimiter

String

多值维度的定制分隔符

否(默认ctrl + A)

columns

JSON数组

指定数据的列。列的顺序应该与数据列的顺序相同。

如果 findColumnsFromHeader 设置为 false 或者缺失, 则为必填项

findColumnsFromHeader

布尔

如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 skipHeaderRows。例如,如果将 skipHeaderRows 设置为2,将 findColumnsFromHeader 设置为 true,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 columns

否(如果 columns 被设置则默认为 false, 否则为null)

skipHeaderRows

整型数值

该项如果设置,任务将略过 skipHeaderRows配置的行数

否(默认为0)

  • TSV
"ioConfig": {
  "inputFormat": {
    "type": "tsv",
    "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
    "delimiter":"|"
  },
  ...
}

TSV inputFormat有以下组件:

字段

类型

描述

是否必填

type

String

填 tsv

delimiter

String

数据值的自定义分隔符

否(默认为 \t)

listDelimiter

String

多值维度的定制分隔符

否(默认ctrl + A)

columns

JSON数组

指定数据的列。列的顺序应该与数据列的顺序相同。

如果 findColumnsFromHeader 设置为 false 或者缺失, 则为必填项

findColumnsFromHeader

布尔

如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 skipHeaderRows。例如,如果将 skipHeaderRows 设置为2,将 findColumnsFromHeader 设置为 true,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 columns

否(如果 columns 被设置则默认为 false, 否则为null)

skipHeaderRows

整型数值

该项如果设置,任务将略过 skipHeaderRows配置的行数

否(默认为0)

  • ORC

一个加载ORC格式数据的 inputFormat 示例:

"ioConfig": {
  "inputFormat": {
    "type": "orc",
    "flattenSpec": {
      "useFieldDiscovery": true,
      "fields": [
        {
          "type": "path",
          "name": "nested",
          "expr": "$.path.to.nested"
        }
      ]
    }
    "binaryAsString": false
  },
  ...
}

ORC inputFormat有以下组件:

字段

类型

描述

是否必填

type

String

填 orc

flattenSpec

JSON对象

指定嵌套JSON数据的展平配置。

binaryAsString

布尔类型

指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。

否(默认为false)

  • Parquet

一个加载Parquet格式数据的inputFormat示例:

"ioConfig": {
  "inputFormat": {
    "type": "parquet",
    "flattenSpec": {
      "useFieldDiscovery": true,
      "fields": [
        {
          "type": "path",
          "name": "nested",
          "expr": "$.path.to.nested"
        }
      ]
    }
    "binaryAsString": false
  },
  ...
}

Parquet inputFormat有以下组件:

字段

类型

描述

是否必填

type

String

填 parquet

flattenSpec

JSON对象

定义一个flattenSpec从Parquet文件提取嵌套的值。注意,只支持”path”表达式(’jq’不可用)

否(默认自动发现根级别的属性)

binaryAsString

布尔类型

指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。

否(默认为false)

  • FlattenSpec

flattenSpec位于inputFormat -> flattenSpec中,负责将潜在的嵌套输入数据(如JSON、Avro等)和Druid的平面数据模型之间架起桥梁。

flattenSpec示例如下:

"flattenSpec": {
  "useFieldDiscovery": true,
  "fields": [
    { "name": "baz", "type": "root" },
    { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
    { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
  ]
}

flattenSpec有以下组件:

字段

描述

默认值

useFieldDiscovery

如果为true,则将所有根级字段解释为可用字段,供 timestampSpec、transformSpec、dimensionsSpec 和 metricsSpec 使用。

如果为false,则只有显式指定的字段才可供使用。

true

fields

指定感兴趣的字段及其访问方式。

[]

更多自定义数据格式请参阅Druid官方网站。