Analytics/Kraken/Firehose
This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!
This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.
Pipeline Components
[edit]Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.
Sources / Producers
[edit]- udp2log
- Flume UDPSource (custom)
- Flume Spooling Directory Source
- KafkaProducer Shell (kafka-console-producer)
- Ori's UDPKafka
Agents / Buffers / Brokers
[edit]- Flume Memory Channel (volatile)
- Flume File Channel
- KafkaBroker
- plain old files
Sinks / Consumers
[edit]- Flume HDFS Sink
- kafka-hadoop-consumer (3rd party, has Zookeeper support)
- Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
- plain old cron jobs + hadoop fs -put
Possible Pipelines
[edit]udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
[edit]This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.
Flume UDPSource -> HDFS
[edit]udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS
[edit]udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS
[edit]UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
[edit]Storm Pipeline
[edit]The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.
Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS
Links
[edit]- Kafka Spout - a Kafka consumer that emits Storm tuples.
- Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
- HDFS API Docs for FileSystem.append()