Azkaban and Docker

Azkaban and Docker

By February 23, 2017Engineering

At ShareThis, we don’t like creating extra work for ourselves. Managing Terabytes and Petabytes of data in real-time is already hard enough. As such, we aggressively look for ways to work faster and more efficiently. This means evaluating different technologies and automating as much as possible. It also involves rapid prototyping and getting things up and running with as little code as possible.

Recently, I wanted to test out Azkaban to schedule job flows quickly and easily. However, I did not want to spend a lot of time bringing the system up and down while testing, and I wanted to keep track of all the changes so that I could build and destroy the system at will. This sounded like a great time to pull out my Docker hat and use docker-compose. It also allowed me to brush up on bash curl.

The first step was to get a Dockerfile that sets up a single node application. This would allow me to get a first go at using the system:



from java:7


COPY azkaban-solo-server-2.5.0.tar.gz /azkaban-solo-server-2.5.0.tar.gz
RUN tar -xf /azkaban-solo-server-2.5.0.tar.gz
RUN apt-get update
RUN apt-get install zip
ADD flows /azkaban-solo-2.5.0/flows
ADD run.sh /
ADD jq /usr/bin/jq


CMD /azkaban-solo-2.5.0/bin/azkaban-solo-start.sh

Then, docker-compose could be used to bring the whole thing up. As such, I wrote a dc-solo.yml file:



azkaban:
  build: solo/.
  ports:
        - 8081:8081

Now I can :

$> docker-compose -f dc-solo.yml build
$> docker-compose -f dc-solo.yml up

Azkaban is running! I played with the UI and realized if this was going to work, we would need to have flows saved in Github and auto-loaded. This led me to the second part of the process which was to play around with how to get flows from Github to Azkaban. To do this, I started hacking their API’s using curl (I would never advise this for a finished project, but for iterating quickly, this works). I also got to know a nice json tool: “jq”.

First, let’s write a function that saves a session token to a variable called FCRED:

#

getSession simply creates a session with default credentials.

#

getSession () { CRED=curl -k -X POST --data "action=login&username=azkaban&password=azkaban" $PROD while [ $? -ne 0 ] do sleep 1 CRED=curl -k -X POST --data "action=login&username=azkaban&password=azkaban" $PROD done FCRED=echo $CRED | jq '."session.id"' | sed s/\"//g }

Then, let’s create a project in Azkaban:

#

createProject creates a project in Azkaban

#

$1 The name of the project.

$2 The description of the project.

#

createProject () { RESP=curl -k -X POST --data "session.id=$FCRED&name=${1}&description=$2" $PROD/manager?action=create }

You can upload a zip file with your flow to your project:

uploadZip () { RESP=curl -k -H "Content-Type: multipart/mixed" -X POST --form "session.id=$1" --form "ajax=upload" --form "file=@$2;type=application/zip" --form "project=$3" $PROD/manager PROJECTID=echo $RESP | jq '.projectId' | sed s/\"//g }

Even schedule it:

schedule () { RESP=curl -k "localhost:8081/schedule?ajax=scheduleFlow&session.id=$1&projectName=$2&flow=$3&projectId=$4&scheduleTime=$5&scheduleDate=$6&is_recurring=on&period=$7" echo "scheduling: $?" echo $RESP echo $RESP | jq '.' }

I ended up just iterating through the directory tree, zipping the directories that had Azkaban job files and uploading them.

#

uploadFlow will zip up the contents of each project directory and upload the flow.

#

$1 The name of the project which corresponds to a directory.

#

uploadFlow () { proj=$1 rm $proj.zip zip $proj.zip $proj/* uploadZip $FCRED $proj.zip $proj }

#

Main Script

#

getSession for dir in ls -d */; do proj=${dir%%/} desc="cat ${dir}description.txt" createProject $proj "$desc" uploadFlow $proj done;

This is not the end. These Azkaban flows are proving themselves useful in an MVP fashion and I’ve started expanding the docker-compose recipe so that we are backed by amazon’s RDS. Do you see how I’m saving myself work by not implementing the DB? I love the cloud!! Here’s my multi-node docker-compose: (my working dc-full.yml for staging on a local environment. For using RDS, I replace the docker image with a real network image.)


mysql:
  image: mysql
  environment:
        - MYSQLROOTPASSWORD=root
        - MYSQLDATABASE=azkaban
        - MYSQLUSER=azkaban
        - MYSQL_PASSWORD=azkaban
  volumes:
        - /mnt/mysql/azkaban:/var/lib/mysql
executor:
  build: exec/.
  links:
        - mysql
  ports:
        - 12321
web:
  build: web/.
  links:
        - mysql
        - executor
  ports:
        - 8081:8081

That’s it for now. We’ll see if this proves out at ShareThis and then continue to iterate on it. One day, it might run all of our automated pipelines. If you like iterating quickly and hate processes that clog up dev time, then please join us!

About ShareThis

ShareThis has unlocked the power of global digital behavior by synthesizing social share, interest, and intent data since 2007. Powered by consumer behavior on over three million global domains, ShareThis observes real-time actions from real people on real digital destinations.

Subscribe to our Newsletter

Get the latest news, tips, and updates

Subscribe

Related Content