It's time to learn how to process data at scale. I've always wanted to learn Hadoop and the nifty Apache tools, most notably Zeppelin. So in my quest to try out a Zeppelin notebook, I need to first start and download the Hortonworks Data Platform, or HDP.

What is HDP?
The Sandbox is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop Enterprise, specifically Hortonworks Data Platform (HDP) Distribution. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.

Specifically, HDP contains HDFS or Hadoop Distributed File System and the YARN suite of applications to process data in HDFS.

hadoop and yarn

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. (Ref: https://hortonworks.com/apache/hdfs/)

Downloading and Installing
What I did first was download the HDP image (12GB, yeah you need to allocate some space on your VM!) here: https://hortonworks.com/downloads/#data-platform

The second thing I did was download Docker for Mac. I chose to download the HDP using Docker as the VM instead of VirtualBox or VM Ware. I only chose to do it with Docker because I've done it with the others before, and Docker is kinda new to me, so might as well.

After downloading it, which will take a while, simply use this tutorial to follow all the steps. The tutorial recommends 4 CPUs and 8.0 GB on your VM.

https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/section/3/#for-mac

Note: The Sandbox system requirements include that you have a 64 bit OS with at least 8 GB of RAM and enabled BIOS for virtualization. Ref here.

After that's installed and you follow all the command line instructions, like checking that it's been installed:

docker images 

Then, you'll need to download the .sh file in any folder and run it in Terminal or the Command Prompt after cd'ing into it:

sh start_sandbox-hdp.sh

This then gives you the following prompts, letting you know that everything is starting up:

startup

At this point, HDP is ready to go. Start configuring a few more things at this tutorial:

https://hortonworks.com/tutorial/learning-the-ropes-of-the-hortonworks-sandbox/

Setting the Host
From the tutorial what you need to do is add your host name using Terminal (or Command Prompt in Windows). All you have to do is type in just like the instructions, into Terminal:

echo '{Host-Name} sandbox.hortonworks.com' | sudo tee -a /private/etc/hosts

And make sure that {Host-Name} is just 127.0.0.1

Now, all you need to do is open a browser window and enter:

127.0.0.1:8888

hDP

And, voila. Off we go.