23 April 2018

Microsoft Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform.

A fast, easy, and collaborative Apache Spark™ based analytics platform optimized for Azure

Designed in collaboration with Microsoft, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Bring teams together in an interactive workspace. From data gathering to model creation, use Databricks notebooks to unify the process and instantly deploy to production. Launch your new Spark environment with a single click. Integrate effortlessly with a wide variety of data stores and services such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub. Add artificial intelligence (AI) capabilities instantly and share insights through rich integration with Power BI.

Protect your data and business with Azure Active Directory integration, role-based controls, and enterprise-grade SLAs. Get peace of mind with fine-grained user permissions, enabling secure access to Databricks notebooks, clusters, jobs and data.

Globally scale your analytics and data science projects. Build and innovate faster using machine learning capabilities. Add capacity instantly. Reduce cost and complexity with a fully-managed, cloud-native platform. Target any size data or project using a complete set of analytics technologies including SQL, Streaming, MLlib, and Graph.

Introduction to Azure Databricks

Big-data company Databricks Inc. made its flagship analytics platform available as an integrated service within Microsoft Corp.’s Azure public cloud.

The service, called Microsoft Azure Databricks, is designed to help customers better process massive amounts of data stored in Microsoft’s cloud, the companies said.

Databricks has grown to become one of the most recognized players on the big-data scene. The company was formed by the creators of the Spark research project at the University of California at Berkeley, which later became the popular open-source big data processing framework called Apache Spark. Databricks was founded to commercialize that software through its Unified Analytics Platform, which is analytics service based on Spark that’s increasingly being used to power modern workloads such as artificial intelligence.

In a blog post, Microsoft Vice President of Azure Data Rohan Kumar and Databricks Chief Executive Officer Ali Ghodsi revealed that Azure Databricks was the fruit of more than two years of collaboration. The executives said the companies began working on the service in response to customer requests for a version of Databricks that’s compatible with Azure. The service, introduced in beta last November, is now being made generally available.

“We experienced a lot of interest and engagement in the preview from organizations in need of a high-performance analytics platform based on Spark,” Kumar said. “With Azure Databricks, deeply integrated with services like Azure SQL Data Warehouse, our customers are now positioned to increase productivity and collaboration and globally scale analytics and data science projects on a trusted, secure cloud environment.”

Azure Databricks has been designed to help make things easier for customers. Rather than doing all the heavy lifting that comes with deploying Databricks in their own data centers, customers can simply access the service via the Azure cloud. Azure Databricks also provides greater compatibility with Microsoft’s own services.

With Azure Databricks it becomes possible to take data from other services and prepare it and process it using machine learning algorithms. From there, the data can also be streamed to other services such as CosmosDB and PowerBI, the executives said.

Azure Databricks was chiefly designed to fulfill companies’ interest in using data to power their artificial intelligence systems. To that end, the service was built with three design principles in mind. The first is enhancing user productivity in developing Big Data applications and analytics pipelines. The second principle was to build a system that could scale almost infinitely without skyrocketing costs. Third, the companies had to ensure that the new service met strict security and compliance standards for enterprises.

“Azure Databricks protects customer data with enterprise-grade SLAs, simplified security and identity, and role-based access controls with Azure Active Directory integration,” the executives said. “As a result, organizations can safeguard their data without compromising productivity of their users.”

“This speaks to the increasing power of cloud services,” said Rob Enderle, principal analyst at the Enderle Group. “Databricks is analytics at scale and this effort should put the analysis engine far closer to the massive amounts of data already being placed on Azure. The result should be a combination of higher performance and lower cost for analytics at massive scale.”

Databricks, provider of the leading Unified Analytics Platform and founded by the team who created Apache Spark™, will showcase its Unified Analytics Platform as a Silver sponsor (booth #1111) at the Gartner Data & Analytics Summit 2018 held March 5-8 in Grapevine, Texas. Hundreds of organizations are leveraging Databricks’ Unified Analytics Platform as a simplified approach for data science and data engineering teams to accelerate innovation and make data-driven business decisions based on big data analytics and artificial intelligence (AI).  Databricks, recently named a Visionary in the Gartner Magic Quadrant for Data Science and Machine-Learning Platforms 2018, focuses on making Big Data and AI simple for enterprise organizations.

The Gartner Data & Analytics Summit will offer a holistic view of current trends and topics around data management, business intelligence (BI), and analytics, including innovative technologies such as AI, blockchain and IoT. Enterprises attend the Summit to learn about the shift toward a data-driven culture to lead the way to better business outcomes. Databricks’ Unified Analytics Platform directly addresses organizations’ issues associated with AI adoption and deployment, making this technology suitable for all businesses.

Databricks’ Unified Analytics Platform is a cloud-based platform powered by Apache Spark, the most popular open source technology for big data processing and machine learning workloads.

“Most data and analytics leaders realize that when it comes to embarking on new AI and Machine Learning initiatives, it’s still really about the data first and foremost.  Their teams need to figure out how you get a massive amount of data, often in real-time, to your model in a way that supports an iterative process and generates a meaningful business result,” said Rick Schultz, chief marketing officer at Databricks. “The Databricks Unified Analytics Platform addresses precisely this problem and, as such, we expect strong engagement from the attendees of Gartner Data & Analytics Summit, many of whom already use Spark.”

To expand the global reach of the Unified Analytics Platform, Databricks recently announced a joint product partnership with Microsoft. The new alliance addresses customer demand for Spark on Microsoft Azure by offering the Unified Analytics Platform as a First Party Service called Azure Databricks. This new integrated service makes it easier for organizations around the globe to derive value from their Big Data and realize the promise of AI.  With Azure Databricks, customers can accelerate innovation with one-click set up and effortless integration with a wide variety of Microsoft data stores and services.

About Databricks
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the team who created Apache Spark™, Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership. Databricks, venture-backed by Andreessen Horowitz, NEA and Battery Ventures, among others, has a global customer base that includes Viacom, Shell and HP. For more information, visit www.databricks.com.

Microsoft Azure Databricks - Azure Power Lunch

Azure is the best place for Big Data & AI
We are excited to add Azure Databricks to the Azure portfolio of data services and have taken great care to integrate it with other Azure services to unlock key customers scenarios.

High-performance connectivity to Azure SQL Data Warehouse, a petabyte scale, and elastic cloud data warehouse allows organizations to build Modern Data Warehouses to load and process any type of data at scale for enterprise reporting and visualization with Power BI. It also enables data science teams working in Azure Databricks notebooks to easily access high-value data from the warehouse to develop models.

Integration with Azure IoT Hub, Azure Event Hubs, and Azure HDInsight Kafka clusters enables enterprises to build scalable streaming solutions for real-time analytics scenarios such as recommendation engines, fraud detection, predictive maintenance, and many others.

Integration with Azure Blob Storage, Azure Data Factory, Azure Data Lake Store, Azure SQL Data Warehouse, and Azure Cosmos DB allows organizations to use Azure Databricks to clean, join, and aggregate data no matter where it sits.

We are committed to making Azure the best place for organizations to unlock the insights hidden in their data to accelerate innovation. With Azure Databricks and its native integration with other services, Azure is the one-stop destination to easily unlock powerful new analytics, machine learning, and AI scenarios.

Lift, shift, and modernize Apps using containers on Azure Service Fabric

Apache Spark + Databricks + Enterprise Cloud = Azure Databricks
Once you manage data at scale in the cloud, you open up massive possibilities for predictive analytics, AI, and real-time applications. Over the past five years, the platform of choice for building these applications has been Apache Spark: with a massive community at thousands of enterprises worldwide, Spark makes it possible to run powerful analytics algorithms at scale and in real time to drive business insights. Managing and deploying Spark at scale has remained challenging, however, especially for enterprise use cases with large numbers of users and strong security requirements.

Enter Databricks. Founded by the team that started the Spark project in 2013, Databricks provides an end-to-end, managed Apache Spark platform optimized for the cloud. Featuring one-click deployment, autoscaling, and an optimized Databricks Runtime that can improve the performance of Spark jobs in the cloud by 10-100x, Databricks makes it simple and cost-efficient to run large-scale Spark workloads. Moreover, Databricks includes an interactive notebook environment, monitoring tools, and security controls that make it easy to leverage Spark in enterprises with thousands of users.

In Azure Databricks, we have gone one step beyond the base Databricks platform by integrating closely with Azure services through collaboration between Databricks and Microsoft. Azure Databricks features optimized connectors to Azure storage platforms (e.g. Data Lake and Blob Storage) for the fastest possible data access, and one-click management directly from the Azure console. This is the first time that an Apache Spark platform provider has partnered closely with a cloud provider to optimize data analytics workloads from the ground up.

Benefits for Data Engineers and Data Scientists
Why is Azure Databricks so useful for data scientists and engineers? Let’s look at some ways:

Azure Databricks is optimized from the ground up for performance and cost-efficiency in the cloud. The Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs by as much as 10-100x when running on Azure:

High-speed connectors to Azure storage services such as Azure Blob Store and Azure Data Lake, developed together with the Microsoft teams behind these services.
Auto-scaling and auto-termination for Spark clusters to automatically minimize costs.
Performance optimizations including caching, indexing, and advanced query optimization, which can improve performance by as much as 10-100x over traditional Apache Spark deployments in cloud or on-premise environments.

Remember the jump in productivity when documents became truly multi-editable? Why can’t we have that for data engineering and data science? Azure Databricks brings exactly that. Notebooks on Databricks are live and shared, with real-time collaboration, so that everyone in your organization can work with your data. Dashboards enable business users to call an existing job with new parameters. And Databricks integrates closely with PowerBI for interactive visualization.  All this is possible because Azure Databricks is backed by Azure Database and other technologies that enable highly concurrent access, fast performance and geo-replication.

Azure Databricks comes packaged with interactive notebooks that let you connect to common data sources, run machine learning algorithms, and learn the basics of Apache Spark to get started quickly. It also features an integrated debugging environment to let you analyze the progress of your Spark jobs from within interactive notebooks, and powerful tools to analyze past jobs. Finally, other common analytics libraries, such as the Python and R data science stacks, are preinstalled so that you can use them with Spark to derive insights. We really believe that big data can become 10x easier to use, and we are continuing the philosophy started in Apache Spark to provide a unified, end-to-end platform.

Architecture of Azure Databricks
So how is Azure Databricks put together? At a high level, the service launches and manages worker nodes in each Azure customer’s subscription, letting customers leverage existing management tools within their account.

Microsoft Data Platform - What's included

Specifically, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription.   The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata (such as scheduled jobs) is stored in an Azure Database with geo-replication for fault tolerance.

Azure Databricks Architecture

For users, this design means two things. First, they can easily connect Azure Databricks to any storage resource in their account, e.g., an existing Blob Store subscription or Data Lake. Second, Databricks is managed centrally from the Azure control center, requiring no additional setup.

Is the traditional data warehouse dead?

Total Azure Integration
We are integrating Azure Databricks closely with all features of the Azure platform in order to provide the best of the platform to users. Here are some pieces we’ve done so far:

Diversity of VM types:  Customers can use all existing VMs: F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.
  • Security and Privacy:  In Azure, ownership and control of data is with the customer.  We have built Azure Databricks to adhere to these standards.  We aim for Azure Databricks to provide all the compliance certifications that the rest of Azure adheres to.
  • Flexibility in network topology: Customers have a diversity of network infrastructure needs.  Azure Databricks supports deployments in customer VNETs, which can control which sources and sinks can be accessed and how they are accessed.
  • Azure Storage and Azure Data Lake integration: these storage services are exposed to Databricks users via DBFS to provide caching and optimized analysis over existing data.
  • Azure Power BI: Users can connect Power BI directly to their Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools.
  • Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions so naturally AAD can be used to control access to sources, results and jobs.
  • Azure SQL Data Warehouse, Azure SQL DB and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.
In addition to all the integration you can see, we have worked hard to integrate in ways that you can’t see – but can see the benefits of.

Internally, we use Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.

  • Accelerated Networking provides the fastest virtualized network infrastructure in the cloud.   Azure Databricks utilizes this to further improve Spark performance.
  • The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO.  These make Databricks I/O performance even better.
  • We are just scratching the surface though!  As the service becomes GA and moves beyond that, we expect to add continued integrations with other upcoming Azure services.

Microsoft and Databricks are very excited to partner together to bring you Azure Databricks. For the first time, a leading cloud provider and leading analytics system provider have partnered to build a cloud analytics platform optimized from the ground up – from Azure’s storage and network infrastructure all the way to Databricks’s runtime for Apache Spark. We believe that Azure Databricks will greatly simplify building enterprise-grade production data applications, and we would love to hear your feedback as the service rolls out.

Azure Stream Analytics

More Information:
















0 reacties:

Post a Comment