Back in time, when I was working for Hortonworks as a Solutions Engineer, I had access to plenty of environments to test things out; that was great, and it helped a lot when I had to work with customers for demanding projects and use cases.
However, I was asked many times for very specific and low level questions about Hadoop the core; some of the questions were related to HDFS and its CLI, some others were related to YARN or MapReduce itself. The environments I was used to work with, were full Hortonworks Data Platform (HDP) deployments with artifacts to showcase multiple use cases. The good ones were available in the Cloud, while there were local demo environments with a so so performance. This was not ideal and it was very limiting to me as, in order to work on those asks from customers, I had to (1) have good connectivity to those cloud environments or (2) lose my nerves working with the local ones. (1) must be obvious for most of you but, as an employee working in the “trenches”, I spent most of my time in train stations, airports or working in places where connectivity was not reliable at all 😦
Some months back (late 2018) I was delivering a Hadoop course for the Master in Business Analytics & Big Data at IE, and I struggled to make my students play around with Hadoop as I wished; I thought many times: “wouldn’t be nice if they could use Hadoop the way we use any local application in our laptops on a daily basis?”. We’re using computers with multiple cores and high amount of RAM memory, that should be doable.
Some weeks after this self-conversation (one of many :)) and while I was playing around with Docker to learn some more about it, I came up with the idea of building a Hadoop Docker image to run containers with a vanilla version of Hadoop: HDFS, YARN and MapReduce. I gave it a try, spent many days building something usable and, finally, did it. The outcome is available on the Internet as (1) an artifact available in Docker Hub to be used right away, and (2) resources to allow guys into this topic understand, customize and build its own stuff.
I’ll be delivering the same course this year again so, I believe this is going to be a good collateral for my students who want to play some more with it at their own pace. Hadoop, believe it or not, is still a core Big Data platform for many organizations and, many other software vendors need to integrate with it. This material could be also interesting to me while trying to make other technologies work with Hadoop so, it’s a great candidate for this blog 🙂
Self of the future, watch this video if you want to refresh what you did back in time! You might be impressed about it but, actually, it wasn’t that difficult when you did it 🙂
See you in the future