Getting Started with HDInsight - Part 1 - Introduction to HDInsight

By:   |   Updated: 2014-11-26   |   Comments (7)   |   Related: More > Big Data


Problem

I have read the previous tips on Big Data Basics series of tips and have an overview of Big Data, Hadoop, and the related concepts. My current employer is more of a Microsoft shop and we use Microsoft technologies day-in and day-out. My employer is planning to implement one of our upcoming projects using Hadoop. So I would like to know about Microsoft's Hadoop offering and get prepared before the start of the project.

Solution

In this tip, we will take a look at Microsoft's Hadoop offering and get an overview of Microsoft's Hadoop Distribution.

Microsoft has a Hadoop Distribution known as HDInsight. Like other distributions available in the market, HDInsight contains Apache Hadoop as its core engine.

Hadoop vs. HDInsight Architecture

Typical Hadoop Architecture

As we have seen in the Big Data Basics Tip Series, here is a typical architecture of Apache Hadoop.

Typical Hadoop Architecture

Source - Big Data Basics - Part 3 - Overview of Hadoop

Here are few highlights of Apache Hadoop Architecture:

  • Hadoop works in a master-worker / master-slave fashion.
  • HDFS (Hadoop Distributed File System) offers a highly reliable storage and ensures reliability, even on commodity hardware, by replicating the data across multiple nodes.
  • MapReduce offers an analysis system which can perform complex computation on large datasets.
  • Master contains the Namenode and Job Tracker components.
    • Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of Hadoop Cluster.
    • Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
  • Each Worker / Slave node contains the Task Tracker and Datanode components.
    • Task Tracker is responsible for running the task assigned to it.
    • Datanode is responsible for holding the data.
  • The nodes present in the cluster can be present in any location and there is no dependence on the location of the physical node.
  • Apache Hadoop is at the core of various Hadoop Distributions available on the market.

Source - Big Data Basics - Part 3 - Overview of Hadoop

Typical HDInsight Architecture

Here is a typical architecture of Microsoft HDInsight.

Typical HDInsight Architecture

Here are the highlights of Microsoft HDInsight Architecture:

  • HDInsight is built on top of Hortonworks Data Platform (HDP).
  • HDInsight is 100% compliant with Apache Hadoop.
  • HDInsight is offered in the cloud as Azure HDInsight on Microsoft's Azure Cloud.
  • HDInsight is tightly integrated with Azure Cloud and various other Microsoft Technologies.
  • HDInsight can be installed on the Windows operating system unlike the majority of the distributions which are based on the Linux operating system.
  • HDInsight also works in a master-slave fashion. Master / Control Node (Head node) controls the overall operation of the Cluster. A Secondary Head node is also included as part of Azure HDInsight deployments.
  • HDInsight can be configured to store the data either on HDFS within HDInsight cluster nodes or on Azure Blob Storage. The most common approach is to use Azure Storage to store the data, intermediate results, and the output and not store data on individual nodes.
  • User data (Data to be processed) and job metadata resides in Windows Azure Storage - Blob (WASB). WASB is an implementation of HDFS on Azure Blob Storage.
  • Individual nodes of the cluster offer MapReduce functionality.
  • The Master node reads the job metadata and the user data from Blob Storage and uses it to do the processing. The intermediate and the final results are stored in the Blob Storage.

Storing Data on Azure Blob Storage - Few Highlights

This unique architecture of HDInsight, which stores the data on Azure Blob Storage, offers various advantages over other Hadoop distributions. Here are few highlights of this unique architecture:

  • Cluster can be provisioned and destroyed as and when required and the data is still available on the Blob Storage even after the cluster is destroyed. This is highly cost effective as we don't need to keep the cluster active to access the data.
  • Storing the data on Blob Storage, which is a common storage, allows other tools/processes to access/use this data for other processing, reporting, etc.
  • Azure Storage Account (associated with the Cluster) and the HDInsight Cluster (to be created) should be located in the same data center.
  • Microsoft implements Azure Flat Network Storage technology to offer a high speed connectivity between WASB and compute nodes.

Different ways of deploying Hadoop on Microsoft's Azure Cloud

There are couple of different ways to deploy HDInsight in Microsoft Azure Cloud.

Manual Deployment

We can manually deploy and configure Hadoop on Microsoft's Azure Cloud. Here are few highlights of this approach:

  • Provision Virtual Machines on Azure (IaaS). VMs with any kind of operating system, depending upon the Hadoop distribution, can be provisioned.
  • Install Hadoop on those virtual machines. Core Hadoop or other Hadoop distributions can be installed.
  • Configure those VMs as a Hadoop cluster.
  • This manual process of deploying Hadoop involves a lot of time and effort and is prone to errors unless worked on by experienced Hadoop Administrators.

Deployment using Azure HDInsight

We can deploy Hadoop as Azure HDInsight in a few simple, proven, and tested steps defined by Microsoft, which takes care of all the necessary configurations, Hadoop version management, etc. Here are few highlights of this approach:

  • Easy to Deploy - We can deploy the cluster in less time with tested and proven methods / steps defined by Microsoft (via the Azure portal or through Windows PowerShell).
  • Provision a cluster with the required number of nodes, depending upon processing needs, and destroy the cluster when you are done. This way, you pay only for what you use.
  • By using Azure Storage to store the data, one can retain the data even after the cluster is destroyed unlike the other distributions where the data resides on the data nodes and cluster should be available to access that data.
  • The data stored in Azure storage can be accessed / used by other processes / applications as appropriate.
  • With Azure's flexibility, one can easily scale up / down the resources depending upon the usage / need.

Now that we have built the foundation by understanding the basics of Microsoft HDInsight offering, we will start exploring other aspects of this distributions in the next tips in this series. Stay tuned!

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Dattatrey Sindol Dattatrey Sindol has 8+ years of experience working with SQL Server BI, Power BI, Microsoft Azure, Azure HDInsight and more.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2014-11-26

Comments For This Article




Monday, December 8, 2014 - 1:51:36 AM - Dattatrey Sindol (Datta) Back To Top (35541)

Hi Sridevi,

If you can provide more details on your scenario, I would be able to provide some recommendations/inputs.

Best Regards,

Dattatrey Sindol (Datta)


Saturday, December 6, 2014 - 5:36:10 AM - Sridevi Back To Top (35535)

Hi Datta,

 

We are trying to create SSAS(data cube) using HDInsight environment. Have you came accross a scenario like this. If yes, your suggestions would really help us. Please share it.

 

Regards,

Sridevi.


Friday, December 5, 2014 - 11:55:06 PM - Sridevi Back To Top (35526)

Hi Datta,

 

Yes. Thank you..

 

Regards,

Sridevi.


Wednesday, December 3, 2014 - 1:46:07 PM - Dattatrey Sindol (Datta) Back To Top (35489)

Hi Sridevi,

In HDInsight cluster, when we store the data on Blob Storage, Storage Account's redundancy options are used by cluster as well.

In a typical HDInsight implementation, wherein data is stored on the Blob Storage, all your data nodes access the data from the same location - Windows Azure Storage Blob (WASB). However, if you plan to deploy HDInsight to use local stoage, in which case the individual compute nodes will store data on them and the regular hadoop principles apply.

Since the data is stored in WASB in a typical HDInsight implementation, data nodes access the data based on the location specified by the user and/or the metadata maintained by HDInsight.

Hope that answers your questions. Stay tuned!

Best Regards,

Dattatrey Sindol (Datta)


Monday, December 1, 2014 - 11:56:58 PM - Sridevi Back To Top (35452)

Hi Datta,

Thanks for the clarification.

In hadoop cluster, data will be replicated and stored accross cluster nodes for fault tolerance.

In HDInsight cluster what happens actually? 

Storage accounts have redundancy options. That is used by the cluster as well?

If i am using HDInsight cluster with 1 namenode and 2 datanodes. One storage account with a container in it. That one container will be accessed by all the datanodes while processing with Pig/Hive? Datanodes don't have storage seperately?

What about the blockreports all all those that hadoop namenode maintains - in case of HDInsight cluster?

Thanks,

Sridevi.

 


Monday, December 1, 2014 - 12:37:20 PM - Dattatrey Sindol (Datta) Back To Top (35449)

Hi Sridevi,

 

HDInsight requires one one container designated as its root container where it can store the system files. It uses this root container to store the data as well by default. However, if you want to store the data on different containers and access that data, you can do that as well. HDInsight allows us to access data present not only on different containers but also data present in different storage accounts.

Thank you and stay tuned for future tips.

 

Best Regards,

Dattatrey Sindol (Datta)


Monday, December 1, 2014 - 5:07:51 AM - Sridevi Back To Top (35436)

Hi

I have a doubt. While creating HDInsight cluster, only one container is created. That conainer will be accessed by all the datanodes in the cluster?

In the "typical architecture of HDinsight" you mentioned above, datanodes access data from different conatiners. Can you please explain me that?

Thanks,

Sridevi















get free sql tips
agree to terms