19 November 2019

What is Azure Synapse Analytics (formerly SQL DW)?

What is Azure Synapse Analytics 

Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs

Vision Keynote with Satya Nadella | Microsoft Ignite 2019

Azure Synapse has four components:

  • SQL Analytics: Complete T-SQL based analytics – Generally Available
  • SQL pool (pay per DWU provisioned)
  • SQL on-demand (pay per TB processed) – (Preview)
  • Spark: Deeply integrated Apache Spark (Preview)
  • Data Integration: Hybrid data integration (Preview)
  • Studio: Unified user experience. (Preview)

To access the preview features of Azure Synapse, request access here. Microsoft will triage all requests and respond as soon as possible.

SQL Analytics and SQL pool in Azure Synapse

SQL Analytics refers to the enterprise data warehousing features that are generally available with Azure Synapse.

Azure Synapse Analytics - Next-gen Azure SQL Data Warehouse

SQL pool represents a collection of analytic resources that are being provisioned when using SQL Analytics. The size of SQL pool is determined by Data Warehousing Units (DWU).

Import big data with simple PolyBase T-SQL queries, and then use the power of MPP to run high-performance analytics. As you integrate and analyze, SQL Analytics will become the single version of truth your business can count on for faster and more robust insights.

Modern Data Warehouse overview | Azure SQL Data Warehouse

In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store, Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex analysis, SQL Analytics uses PolyBase to query the big data stores. PolyBase uses standard T-SQL queries to bring the data into SQL Analytics tables.

Azure data platform overview

SQL Analytics stores data in relational tables with columnar storage. This format significantly reduces the data storage costs, and improves query performance. Once data is stored, you can run analytics at massive scale. Compared to traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead of days.

The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain insights to make well-informed business decisions.

Azure Synapse Analytics (formerly SQL DW) architecture

Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs.

Azure Synapse has four components:

1) SQL Analytics : Complete T-SQL based analytics:

  • SQL pool (pay per DWU provisioned) – Generally Available
  • SQL on-demand (pay per TB processed) – (Preview)
2) Spark : Deeply integrated Apache Spark (Preview)
3) Data Integration : Hybrid data integration (Preview)
4) Studio : unified user experience. (Preview)

On November fourth, Microsoft announced Azure Synapse Analytics, the next evolution of Azure SQL Data Warehouse. Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs.

With Azure Synapse, data professionals can query both relational and non-relational data using the familiar SQL language. This can be done using either serverless on-demand queries for data exploration and ad hoc analysis or provisioned resources for your most demanding data warehousing needs. A single service for any workload.

In fact, it’s the first and only analytics system to have run all the TPC-H queries at petabyte-scale. For current SQL Data Warehouse customers, you can continue running your existing data warehouse workloads in production today with Azure Synapse and will automatically benefit from the new preview capabilities when they become generally available. You can sign up to preview new features like Serverless on-demand query, Azure Synapse studio, and Apache Spark™ integration.

Building a modern data warehouse

Taking SQL beyond data warehousing

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

Unlock powerful insights faster from all data
Azure Synapse deeply integrates with Power BI and Azure Machine Learning to drive insights for all users, from data scientists coding with statistics to the business user with Power BI. And to make all types of analytics possible, we’re announcing native and built-in prediction support, as well as runtime level improvements to how Azure Synapse handles streaming data, parquet files, and Polybase. Let’s dive into more detail:

With the native PREDICT statement, you can score machine learning models within your data warehouse—avoiding the need for large and complex data movement. The PREDICT function (available in preview) relies on open model framework and takes user data as input to generate predictions. Users can convert existing models trained in Azure Machine Learning, Apache Spark™, or other frameworks into an internal format representation without having to start from scratch, accelerating time to insight.

Azure SQL Database & Azure SQL Data Warehouse

We’ve enabled direct streaming ingestion support and ability to execute analytical queries over streaming data. Capabilities such as: joins across multiple streaming inputs, aggregations within one or more streaming inputs, transform semi-structured data and multiple temporal windows are all supported directly in your data warehousing environment (available in preview). For streaming ingestion, customers can integrate with Event Hubs (including Event Hubs for Kafka) and IoT Hubs.

We’re also removing the barrier that inhibits securely and easily sharing data inside or outside your organization with Azure Data Share integration for sharing both data lake and data warehouse data.

Modern Data Warehouse Overview

By using new ParquetDirect technology, we are making interactive queries over the data lake a reality (in preview). It’s designed to access Parquet files with native support directly built into the engine. Through improved data scan rates, intelligent data caching and columnstore batch processing, we’ve improved Polybase execution by over 13x.

Introducing the modern data warehouse solution pattern with Azure SQL Data Warehouse

Workload isolation
To support customers as they democratize their data warehouses, we are announcing new features for intelligent workload management. The new Workload Isolation functionality allows you to manage the execution of heterogeneous workloads while providing flexibility and control over data warehouse resources. This leads to improved execution predictability and enhances the ability to satisfy predefined SLAs.

COPY statement
Analyzing petabyte-scale data requires ingesting petabyte-scale data. To streamline the data ingestion process, we are introducing a simple and flexible COPY statement. With only one command, Azure Synapse now enables data to be seamlessly ingested into a data warehouse in a fast and secure manner.

This new COPY statement enables using a single T-SQL statement to load data, parse standard CSV files, and more.

COPY statement sample code:

COPY INTO dbo.[FactOnlineSales] FROM ’https://contoso.blob.core.windows.net/Sales/’

Safe keeping for data with unmatched security
Azure has the most advanced security and privacy features in the market. These features are built into the fabric of Azure Synapse, such as automated threat detection and always-on data encryption. And for fine-grained access control businesses can ensure data stays safe and private using column-level security, native row-level security, and dynamic data masking (now generally available) to automatically protect sensitive data in real time.

To further enhance security and privacy, we are introducing Azure Private Link. It provides a secure and scalable way to consume deployed resources from your own Azure Virtual Network (VNet). A secure connection is established using a consent-based call flow. Once established, all data that flows between Azure Synapse and service consumers is isolated from the internet and stays on the Microsoft network. There is no longer a need for gateways, network addresses translation (NAT) devices, or public IP addresses to communicate with the service.

SQL Analytics MPP architecture components

SQL Analytics leverages a scale out architecture to distribute computational processing of data across multiple nodes. The unit of scale is an abstraction of compute power that is known as a data warehouse unit. Compute is separate from storage which enables you to scale compute independently of the data in your system.

AI for Intelligent Cloud and Intelligent Edge: Discover, Deploy, and Manage with Azure ML Services

SQL Analytics uses a node-based architecture. Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for SQL Analytics. The Control node runs the MPP engine which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

With decoupled storage and compute, when using SQL Analytics one can:

  • Independently size compute power irrespective of your storage needs.
  • Grow or shrink compute power, within a SQL pool (data warehouse), without moving data.
  • Pause compute capacity while leaving data intact, so you only pay for storage.
  • Resume compute capacity during operational hours.

Data Warehousing And Big Data Analytics in Azure Basics Tutorial

Azure storage

SQL Analytics leverages Azure storage to keep your user data safe. Since your data is stored and managed by Azure storage, there is a separate charge for your storage consumption. The data itself is sharded into distributions to optimize the performance of the system. You can choose which sharding pattern to use to distribute the data when you define the table. These sharding patterns are supported:

  • Hash
  • Round Robin
  • Replicate
  • Control node

The Control node is the brain of the architecture. It is the front end that interacts with all applications and connections. The MPP engine runs on the Control node to optimize and coordinate parallel queries. When you submit a T-SQL query to SQL Analytics, the Control node transforms it into queries that run against each distribution in parallel.
Compute nodes

The Compute nodes provide the computational power. Distributions map to Compute nodes for processing. As you pay for more compute resources, SQL Analytics re-maps the distributions to the available Compute nodes. The number of compute nodes ranges from 1 to 60, and is determined by the service level for SQL Analytics.
Each Compute node has a node ID that is visible in system views. You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes. For a list of these system views, see MPP system views.
Data Movement Service

Data Movement Service (DMS) is the data transport technology that coordinates data movement between the Compute nodes. Some queries require data movement to ensure the parallel queries return accurate results. When data movement is required, DMS ensures the right data gets to the right location.

Machine Learning and AI

A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. When SQL Analytics runs a query, the work is divided into 60 smaller queries that run in parallel.
Each of the 60 smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions. A SQL pool with maximum compute resources has one distribution per Compute node. A SQL pool with minimum compute resources has all the distributions on one compute node.

Hash-distributed tables

A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
To shard data into a hash-distributed table, SQL Analytics uses a hash function to deterministically assign each row to one distribution. In the table definition, one of the columns is designated as the distribution column. The hash function uses the values in the distribution column to assign each row to a distribution.

The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

Distributed table
Each row belongs to one distribution.
A deterministic hash algorithm assigns each row to one distribution.
The number of table rows per distribution varies as shown by the different sizes of tables.
There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.

Round-robin distributed tables
A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.
A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data and this takes additional time.

Replicated Tables
A replicated table provides the fastest query performance for small tables.
A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized with small tables. Extra storage is required and there is additional overhead that is incurred when writing data which make large tables impractical.
The diagram below shows a replicated table which is cached on the first distribution on each compute node.

AI for an intelligent cloud and intelligent edge: Discover, deploy, and manage with Azure ML services

Compare price-performance of Azure Synapse Analytics and Google BigQuery

Azure Synapse Analytics (formerly Azure SQL Data Warehouse) outperforms Google BigQuery in all TPC-H and TPC-DS* benchmark queries. Azure Synapse Analytics consistently demonstrated better price-performance compared with BigQuery, and costs up to 94 percent less when measured against Azure Synapse Analytics clusters running TPC-H* benchmark queries.

*Performance and price-performance claims based on data from a study commissioned by Microsoft and conducted by GigaOm in January 2019 for the TPC-H benchmark report and March 2019 for the TPC-DS benchmark report. Analytics in Azure is up to 14 times faster and costs 94 percent less, according to the TPC-H benchmark, and is up to 12 times faster and costs 73 percent less, according to the TPC-DS benchmark. Benchmark data is taken from recognized industry standards, TPC Benchmark™ H (TPC-H) and TPC Benchmark™ DS (TPC-DS), and is based on query execution performance testing of 66 queries for TPC-H and 309 queries for TPC-DS, conducted by GigaOm in January 2019 and March 2019, respectively; testing commissioned by Microsoft. Price-performance is calculated by GigaOm as the TPC-H/TPC-DS metric of cost of ownership divided by composite query. Prices are based on publicly available US pricing as of January 2019 for TPC-H queries and March 2019 for TPC-DS queries. Actual performance and prices may vary. Learn more about the GigaOm benchmark study

QSSUG: Azure Cognitive Services – The Rise of the Machines

Forrester interviewed four customers and surveyed 364 others on their use of Azure analytics with Power BI. Of those surveyed customers, 85 percent agreed or strongly agreed that well-integrated analytics databases and storage, a data management stack, and business intelligence tools were beneficial to their organization. Customers also reported a 21.9 percent average reduction in the overall cost of Microsoft analytics and BI offerings when compared to alternative analytics solutions.

Based on the companies interviewed and surveyed, Forrester projects that a Microsoft analytics and business intelligence (BI) solution could provide:
Benefits of $22.1 million over three years versus costs of $6 million, resulting in a net present value of $16.1 million and a projected return on investment of 271 percent.
Reduced total cost of ownership by 25.7 percent.
Better overall analytics system performance with improved data security, enhanced decision making, and democratized data access.

Modern Data Warehousing with BigQuery (Cloud Next '19)

Analytics in Azure is up to 14x faster and costs 94% less than other cloud providers. Why go anywhere else?

Julia White Corporate Vice President, Microsoft Azure
It’s true. With the volume and complexity of data rapidly increasing, performance and security are critical requirements for analytics. But not all analytics services are built equal. And not all cloud storage is built for analytics.

Only Azure provides the most comprehensive set of analytics services from data ingestion to storage to data warehousing to machine learning and BI. Each of these services have been finely tuned to provide industry leading performance, security and ease of use, at unmatched value. In short, Azure has you covered.

Unparalleled price-performance

When it comes to analytics, price-performance is key. In July 2018, GigaOm published a study that showed that Azure SQL Data Warehouse was 67 percent faster and 23 percent cheaper than Amazon Web Service RedShift.

That was then. Today, we’re even better!

In the most recent study by GigaOm, they found that Azure SQL Data Warehouse is now outperforming the competition up to a whopping 14x times. No one else has produced independent, industry-accepted benchmarks like these. Not AWS Redshift or Google BigQuery. And the best part? Azure is up to 94 percent cheaper.

This industry leading price-performance extends to the rest of our analytics stack. This includes Azure Data Lake Storage, our cloud data storage service, and Azure Databricks, our big data processing service. Customers like Newell Brands – worldwide marketer of consumer and commercial products such as Rubbermaid, Mr. Coffee and Oster – recently moved their workload to Azure and realized significant improvements.

“Azure Data Lake Storage will streamline our analytics process and deliver better end to end performance with lower cost.” 
– Danny Siegel, Vice President of Information Delivery Systems, Newell Brands
Secure cloud analytics

All the price-performance in the world means nothing without security. Make the comparison and you will see Azure is the most trusted cloud in the market. Azure has the most comprehensive set of compliance offerings, including more certifications than any other cloud vendor combined with advanced identity governance and access management with Active Directory integration.

For analytics, we have developed additional capabilities to meet customers’ most stringent security requirements. Azure Data Lake Storage provides multi-layered security including POSIX compliant file and folder permissions and at-rest encryption. Similarly, Azure SQL Data Warehouse utilizes machine learning to provide the most comprehensive set of security capabilities across data protection, access control, authentication, network security, and automatic threat detection.

Insights for all

What’s the best compliment to Azure Analytics’ unmatched price-performance and security? The answer is Microsoft Power BI.

Power BI’s ease of use enables everyone in your organization to benefit from our analytics stack. Employees can get their insights in seconds from all enterprise data stored in Azure. And without limitations on concurrency, Power BI can be used across teams to create the most beautiful visualizations that deliver powerful insights.

Leveraging Microsoft’s Common Data Model, Power BI users can easily access and analyze enterprise data using a common data schema without needing complex data transformation. Customers looking for petabyte-scale analytics can leverage Power BI Aggregations with Azure SQL Data Warehouse for rapid query. Better yet, Power BI users can easily apply sophisticated AI models built with Azure. Powerful insights easily accessible to all.

Customers like Heathrow Airport, one of the busiest airports in the world, are empowering their employees with powerful insights:

“With Power BI, we can very quickly connect to a wide range of data sources with very little effort and use this data to run Heathrow more smoothly than ever before. Every day, we experience a huge amount of variability in our business. With Azure, we’re getting to the point where we can anticipate passenger flow and stay ahead of disruption that causes stress for passengers and employee.”
– Stuart Birrell, Chief Information Officer, Heathrow Airport
Code-free modern data warehouse using Azure SQL DW and Data Factory | Azure Friday


We continue to focus on making Azure the best place for your data and analytics. Our priority is to meet your needs for today and tomorrow.

So, we are excited to make the following announcements:

General availability of Azure Data Lake Storage: The first cloud storage that combines the best of hierarchical files system and blob storage.
General availability of Azure Data Explorer: A fast, fully managed service that simplifies ad hoc and interactive analysis over telemetry, time-series, and log data. This service, powering other Azure services like Log Analytics, App Insights, Time Series Insights, is useful to query streaming data to identify trends, detect anomalies, and diagnose problems.
Preview of new Mapping Data Flow capability in Azure Data Factory: Visual Flow provides a visual, zero-code experience to help data engineers to easily build data transformations. This complements the Azure Data Factory’s code-first experience to enable data engineers of all skill levels to collaborate and build powerful hybrid data transformation pipelines.
Azure provides the most comprehensive platform for analytics. With these updates, Azure solidifies its leadership in analytics.

More Information:












0 reacties:

Post a Comment