传输数据 - noonecare/opensourcebigdatatools GitHub Wiki

flume

flume 是为了把 stream data 传输到 HDFS 上

flume 配置文件描述了 sources, sinks, channels 。指定配置文件,执行 flume, 比如

flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

flume 就开始从 sources 中收集数据,并把数据写入到 sinks 中。

下面是我跑通的第一个配置文件,完成从 nc 中读取数据,写入到 hdfs 中的功能。

example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /user/wangmeng/flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume 有个重要的概念是 events

flume 中重要的是配置 sources, sinks 和 channels。 下面依次介绍常用的 source, sink。

sources

  • log file
  • netcat
  • kafka

sink

  • hdfs
  • console
  • file

channel

sqoop

sqoop 是为了把 bulk data 传输到 HDFS 上