This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!

This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.

Pipeline Components

Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.

Sources / Producers

udp2log
Flume UDPSource (custom)
Flume Spooling Directory Source
KafkaProducer Shell (kafka-console-producer)
Ori's UDPKafka

Agents / Buffers / Brokers

Flume Memory Channel (volatile)
Flume File Channel
KafkaBroker
plain old files

Sinks / Consumers

Flume HDFS Sink
kafka-hadoop-consumer (3rd party, has Zookeeper support)
Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
plain old cron jobs + hadoop fs -put

Possible Pipelines

udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS

This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.

Flume UDPSource -> HDFS

udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS

udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS

UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS

Storm Pipeline

The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.

Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS

Links

Kafka Spout - a Kafka consumer that emits Storm tuples.
Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
HDFS API Docs for FileSystem.append()