By: Fikrat Azizov | Updated: 2020-09-08 | Comments | Related: > Azure Synapse Analytics
Problem
Azure Synapse Analytics services is one of the latest innovations from Microsoft in Big Data and Analytics space. This service is a greatly enhanced successor of the SQL DW product. What are the main features Azure Synapse Analytics and how is it different from SQL DW?
Solution
A typical modern data warehouse is based on two data storage layers:
- Data Lakes - Data Lakes are designed to store unlimited data volumes, coming in different formats. This data can be structured (CSV, parquet, etc.), semi-structured (JSON, XML, etc.) or unstructured (audio/video files, text files, etc.). Data Lakes offer a low-cost storage solution, where we can store any data without worrying about its processing needs.
- Data Warehouses - Data Warehouses contain processed and structured data, ready to be consumed by business applications and reports.
Today's organizations face the following challenges, when building modern data warehouses:
- There is a need for scalable data warehousing solutions, capable of processing large amounts of data.
- Different data formats require a variety of processing and visualization tools, like SQL, Spark, Python, Power BI, etc.
- Different IT teams (like data engineers, analysts, data scientists) often work in a silo, without having visibility to each other's data and processes.
- Security teams need multiple tools to secure/audit data and manage permissions.
- Data residing in Data Lakes requires knowledge and cataloging of its schema, to explore and process.
Azure Synapse Analytics is designed to overcome many of these challenges.
Overview of the Azure Synapse Analytics
Azure Synapse Analytics is a platform as a service, designed to unite analytics, data integration and visualization experiences for users. It has the following main features, some of them are still in a preview:
- Analytics - As a successor of SQL DW technology, Azure Synapse Analytics has inherited its Massively Parallel Processing (MPP) capability, which allows storing and efficiently processing petabytes of data, using multi-node clusters. Microsoft constantly adds new features to the SQL core engine, like workload management, materialized views, data masking, etc., which makes this product even more attractive.
- Data Exploration - A newly introduced Synapse Studio (currently in preview) makes data exploration in Data Lakes, SQL engine and Spark very easy. Users now can easily browse data in SQL and Spark tables, as well as in the Data Lakes, without knowing its schema.
- Data Integration - The Synapse Studio has inherited Azure Data Factory's data movement and transformation components, which allows building complex ETL pipelines, without a single line of code.
- Development - Azure Synapse Analytic now supports languages and platforms, like Spark, Python, Scala, as well as Spark notebooks, in addition to the traditional SQL based scripts. The users can work on data engineering, analytics, and data science tasks collaboratively, using a single tool.
- Data Visualization - Using the Synapse Studio users can connect to Power BI workspace and get the same report development experience.
Create Azure Synapse Workspace
Most of the exciting features I mentioned above, require Azure Synapse Workspace and Synapse Studio, both are currently in preview.
Let me first clarify the term pool, which we will be using as part of the Synapse Workspace. The Synapse pool represents a database within your workspace. Here are some pool types available in Synapse:
- SQL on demand - This is a serverless service, allowing you to do light data explorations in Data Lake. It is provisioned automatically when creating a workspace and users have no control over it.
- SQL pool - This is a Synapse database, based on a multi-node cluster, which is also formerly known as SQL DW database. This component is optional and may or may not exist, depending on your Data Warehouse design.
- Spark pool - This is a Spark database, based on a multi-node cluster. This is also an optional component.
It is worth mentioning that every Synapse Workspace requires at least one Data Lake account associated with it, which is specified during its creation. Here are the steps to create a Synapse Workspace:
- Log into the Azure portal, create a new resource and specify the Azure Synapse Analytics (workspaces preview) type.
- Provide the resource group, workspace name, region, Data Lake storage account and file system name within that account. Alternatively, you can create storage and file system accounts if they do not exist. Here is a sample screenshot:
Figure 1
The workspace provisioning may take a few minutes, after which a Synapse Workspace home page opens, as follows:
Figure 2
The central panel workspace contains a list of your SQL and Spark pools. You can create pools from either this screen, or from the Synapse Studio, which I will describe a bit later. Let's launch Synapse Studio using top-right button shown above.
Here's the Synapse Studio home screen:
Figure 3
The left panel contains the main commands, as described below:
- Data - This command allows browsing available databases and Data Lake accounts.
- Develop - This command allows creating SQL scripts and Spark notebooks.
- Orchestrate - This command allows building ETL pipelines.
- Monitor - This command allows monitoring pipeline executions.
- Manage - This command allows managing pools, linked services, etc.
This concludes the overview of Azure Synapse Analytics. We will dive into the Azure Synapse Analytics exciting features and learn how to apply them for solving real-life analytics problems in upcoming tips.
Next Steps
- Read: What is Azure Synapse Analytics (formerly SQL DW)?
- Read: Quickstart: Create a Synapse workspace
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-09-08