Analytics/Kraken/Hadoop Tools
Appearance
This page is meant as a bucket for tips and notes on using the Hadoop toolchain.
Pig
[edit]Pig is a dataflow language adept at analyzing unstructured data, converting it into a regular structure.
Best Practices
[edit]- Push filters up even if you have to reprocess fields.
- Drop all unneeded fields as soon as you can.
- Syntax-check your script locally; use
DESCRIBE
to understand data shape. - Don't explicitly set
PARALLEL
unless you have a reason. It only effects reduce operations, and Pig has heuristics based on data size for that; you are unlikely to help performance, and forcingPARALLEL
higher than necessary uses up more slots on and increases job overhead. - Don't Be Afraid To:
- ...Run stupid, half-broken jobs from a grunt shell just to see the data shape or test your UDF-tinkertoys. (Just pls save to your home dir.)
- ...
DUMP
after a complex statement to check output shape. - ...Use Hive to check your data; its aggregation and filtering features are way more advanced and SQL is easy.
Reuse & Metaprogramming
[edit]- Macros aren't really all that interesting as they're severely limited in scope.
- Be careful with
exec
andrun
-- they're more flexible and powerful, but very expensive, as every call spawns more MR jobs! - Parameters are inserted via literal string substitution, which lets you do some pretty wacky metaprogramming via workflows. This means you basically have meta-macros. See rollup.pig as an example.
Gotchas
[edit]MATCHES
RegExp must hit whole input- Don't expect relations to stay sorted -- derived relations after a reduce-op (namely, GROUP) need re-ordering!
- Implicit coercion from bag of single-tuple to scalar ONLY works for relations! -- it does NOT work for grouped-records or other actual bags
- Escaping quotes in a parameter is impossible. I swear it. It's is worse than several layers of `ssh box -- bash -lc "eval '$WAT'"`. I gave up; no combination of backslashes and quotes made any difference.
Oozie
[edit]Best Practices
[edit]- Test each layer, work outward; I always make a
$JOB-wf.properties
to test the workflow alone before moving on to the coord (with atest-$JOB-coord.properties
and a$JOB-coord.properties
). - Everything can formally declare parameters using a
<parameters>
block at the beginning. DO IT! and avoid pointless defaults -- better to fail early. - Check the
xmlns
on your root element!- Coordinators:
xmlns="uri:oozie:coordinator:0.4"
- Workflows:
xmlns="uri:oozie:workflow:0.4"
- Coordinators:
Workflows
[edit]- Know your
<action>
options: control flow, sub-workflow, fs, shell, java, streaming all have uses! - Sub-workflows are like functions -- compose and reuse!
<prepare>
should probably be in all our jobs -- delete the output dir before starting work to ensure you don't pointlessly fail due to temporary cruft.<globals>
allows you to set properties for all actions. All jobs should set job-tracker and namenode here.job.xml
(s) -- they cascade -- will be useful once we start profiling and tuning jobs. Save those tweaks together as job-confs for similarly structured jobs to reuse!
Coordinators
[edit]<dataset>
initial instances should always predate the job. This only restricts the possible valid results; it doesn't dictate anything about where the job starts.- Always create coordinator parameters for
jobStart
,jobEnd
,jobName
,jobQueue
! This lets you easily fire off the job as a backfill in a different queue, or one-of instances of the job, etc. datasets.xml
lets you share<dataset>
definitions. It's is worth investigating as the number of jobs grows.- Chaining datasets between coordinators is fussy. I haven't seen it worth the energy so far.
Gotchas
[edit]- Some workflow action elements are order-sensitive (!!). Ex:
<configuration>
must come before<script>
in<pig-action>
, and yes, the error message is oblique and unhelpful.
Hadoop
[edit]- All jobs keep performance counters and stats. These can be extremely helpful to improve job speed.
- Be familiar with the hdfs shell tool -- it's a lot more expressive than you might expect.
Tutorials and Guides
[edit]- The Coalesce Workflow: Concatenate and rename job results
- The Rollup Workflow: Aggregate a result field into a rollup