• IBM Consulting

    DBA Consulting can help you with IBM BI and Web related work. Also IBM Linux is our portfolio.

  • Oracle Consulting

    For Oracle related consulting and Database work and support and Migration call DBA Consulting.

  • Novell/RedHat Consulting

    For all Novell Suse Linux and SAP on Suse Linux questions releated to OS and BI solutions. And offcourse also for the great RedHat products like RedHat Enterprise Server and JBoss middelware and BI on RedHat.

  • Microsoft Consulting

    For Microsoft Server 2012 onwards, Microsoft Client Windows 7 and higher, Microsoft Cloud Services (Azure,Office 365, etc.) related consulting services.

  • Citrix Consulting

    Citrix VDI in a box, Desktop Vertualizations and Citrix Netscaler security.

  • Web Development

    Web Development (Static Websites, CMS Websites (Drupal 7/8, WordPress, Joomla, Responsive Websites and Adaptive Websites).

26 May 2017

KVM (Kernel Virtual Machine) or Xen? Choosing a Virtualization Platform

KVM versus Xen which should you choose?

KVM (Kernel Virtual Machine)

KVM (for Kernel-based Virtual Machine) is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). It consists of a loadable kernel module, kvm.ko, that provides the core virtualization infrastructure and a processor specific module, kvm-intel.ko or kvm-amd.ko.

Virtualization Architecture & KVM

Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

Virtualization Platform Smackdown: VMware vs. Microsoft vs. Red Hat vs. Citrix

KVM is open source software. The kernel component of KVM is included in mainline Linux, as of 2.6.20. The userspace component of KVM is included in mainline QEMU, as of 1.3.

Blogs from people active in KVM-related virtualization development are syndicated at http://planet.virt-tools.org/


This is a possibly incomplete list of KVM features, together with their status. Feel free to update any of them as you see fit.

Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM

As a guideline, there is a feature description template in here:

  • QMP - Qemu Monitor Protocol
  • KSM - Kernel Samepage Merging
  • Kvm Paravirtual Clock - A Paravirtual timesource for KVM
  • CPU Hotplug support - Adding cpus on the fly
  • PCI Hotplug support - Adding pci devices on the fly
  • vmchannel - Communication channel between the host and guests
  • migration - Migrating Virtual Machines
  • vhost -
  • SCSI disk emulation -
  • Virtio Devices -
  • CPU clustering -
  • hpet -
  • Device assignment -
  • pxe boot -
  • iscsi boot -
  • x2apic -
  • Floppy -
  • CDROM -
  • USB -
  • USB host device passthrough -
  • Sound -
  • Userspace irqchip emulation -
  • Userspace pit emulation -
  • Balloon memory driver -
  • Large pages support -
  • Stable Guest ABI -

Xen Hypervisor

The Xen hypervisor was first created by Keir Fraser and Ian Pratt as part of the Xenoserver research project at Cambridge University in the late 1990s. A hypervisor "forms the core of each Xenoserver node, providing the resource management, accounting and auditing that we require." The earliest web page dedicated to the Xen hypervisor is still available on Cambridge web servers.  The early Xen history can easily be traced through a variety of academic papers from Cambridge University. Controlling the XenoServer Open Platform is an excellent place to begin in understanding the origins of the Xen hypervisor and the XenoServer project. Other relevant research papers can be found at:

Xen and the Art of Virtualization - Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield. Puplished at SOSP 2003
Xen and the Art of Repeated Research - Bryan Clark, Todd Deshane, Eli Dow, Stephen Evanchik, Matthew Finlayson, Jason Herne, Jenna Neefe Matthews. Clarkson University. Presented at FREENIX 2004

  • Safe Hardware Access with the Xen Virtual Machine Monitor - Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, Mark Williamson. Published at OASIS ASPLOS 2004 Workshop
  • Live Migration of Virtual Machines - Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, Andrew Warfield. Published at NSDI 2005
  • Ottawa Linux Symposium 2004 Presentation
  • Linux World 2005 Virtualization BOF Presentation - Overview of Xen 2.0, Live Migration, and Xen 3.0 Roadmap
  • Xen Summit 3.0 Status Report - Cambridge 2005
  • Introduction to the Xen Virtual Machine - Rami Rosen, Linux Journal. Sept 1, 2005
  • Virtualization in Xen 3.0 - Rami Rosen, Linux Journal. March 2, 2006
  • Xen and the new processors - Rami Rosen, Lwn.net. May 2, 2006

Over the years, the Xen community has hosted several Xen Summit events where the global development community meets to discuss all things Xen. Many presentations and videos of those events are available here.

Why Xen Project?

The Xen Project team is a global open source community that develops the Xen Project Hypervisor and its associated subprojects.  Xen (pronounced /’zɛn/) Project has its origins in the ancient greek term Xenos (ξένος), which can be used to refer to guest-friends whose relationship is constructed under the ritual of xenia ("guest-friendship"), which in term is a wordplay on the idea of guest operating systems as well as a community of developers and users. The original website was created in 2003 to allow a global community of developers to contribute and improve the hypervisor.  Click on the link to find more about the projects’s interesting history.

Virtualization and Hypervisors

The community supporting the project follows a number of principles: Openess, Transparency, Meritocracy and Consensus Decision Making. Find out more about how the community governs itself.

What Differentiates the Xen Project Software?

Xen and the art of embedded virtualization (ELC 2017)

There are several virtualization technologies available in the world today. Our Xen Project virtualization and cloud software includes many powerful features which make it an excellent choice for many organizations:

Supports multiple guest operating systems: Linux, Windows, NetBSD, FreeBSD A virtualization technology which only supports a few guest operating systems essentially locks the organization into those choices for years to come. With our hypervisor, you have the flexibility to use what you need and add other operating system platforms as your needs dictate. You are in control.

VMware Alternative: Using Xen Server for Virtualization

Supports multiple Cloud platforms: CloudStack, OpenStack A virtualization technology which only supports one Cloud technology locks you into that technology. With the world of the Cloud moving so quickly, it could be a mistake to commit to one Cloud platform too soon. Our software keeps your choices open as Cloud solutions continue to improve and mature.
Reliable technology with a solid track record The hypervisor has been in production for many years and is the #1 Open Source hypervisor according to analysts such as Gartner. Conservative estimates show that Xen has an active user base of 10+ million: these are users, not merely hypervisor installations which are an order of magnitude higher. Amazon Web Services alone runs ½ million virtualized Xen Project instances according to a recent study and other cloud providers such as Rackspace and hosting companies use the hypervisor at extremely large scale. Companies such as Google and Yahoo use the hypervisor at scale for their internal infrastructure. Our software is the basis of successful commercial products such as Citrix XenServer and Oracle VM, which support an ecosystem of more than 2000 commercially certified partners today. It is clear that many major industry players regard our software as a safe virtualization platform for even the largest clouds.

Scalability The hypervisor can scale up to 4,095 host CPUs with 16Tb of RAM. Using Para Virtualization (PV), the hypervisor supports a maximum of 512 VCPUs with 512Gb RAM per guest. Using Hardware Virtualization (HVM), it supports a maximum of 128 VCPUs with 1Tb RAM per guest.

Performance Xen tends to outperform other open source virtualization solutions in most configurations. Check out Ubuntu 15.10: KVM vs. Xen vs. VirtualBox Virtualization Performance (Phoronix, Oct 2015) for a recent benchmarks of Xen 4.6.

High-Performance Virtualization for HPC Cloud on Xen - Jun Nakajima & Tianyu Lan, Intel Corp.

Security Security is one of the major concerns when moving critical services to virtualization or cloud computing environments. The hypervisor provides a high level of security due to its modular architecture, which separates the hypervisor from the control and guest operating systems. The hypervisor itself is thin and thus provides a minimal attack surface. The software also contains the Xen Security Modules (XSM), which have been developed and contributed to the project by the NSA for ultra secure use-cases. XSM introduces control policy providing fine-grained controls over its domains and their interaction amongst themselves and the outside world. And, of course, it is also possible to use the hypervisor with SELinux. In addition, Xen’s Virtual Machine Introspection (VMI) subsystems make it the best hypervisor for security applications. For more information, see Virtual Machine Introspection with Xen and VM Introspection: Practical Applications.

Live patching the xen project hypervisor

The Xen Project also has a dedicated security team, which handles security vulnerabilities in accordance with our Security Policy. Unlike almost all corporations and even most open source projects, the Xen Project properly discloses, via an advisory, every vulnerability discovered in supported configurations. We also often publish advisories about vulnerabilities in other relevant projects, such as Linux and QEMU.

Flexibility Our hypervisor is the most flexible hypervisor on the market, enabling you to tailor your installation to your needs. There are lots of choices and trade-offs that you can make. For example: the hypervisor works on older hardware using paravirtualization, on newer hardware using HVM or PV on HVM. Users can choose from three tool stacks (XL, XAPI & LIBVIRT), from an ecosystem of software complementing the project and choose the most suitable flavour of Linux and Unix operating system for their needs. Further, the project's flexible architecture enables vendors to create Xen-based products and services for servers, cloud, desktop in particular for ultra secure environments.

Modularity Our architecture is uniquely modular, enabling a degree of scalability, robustness, and security suitable even for large, critical, and extremely secure environments. The control functionality in our control domain can be divided into small modular domains running a minimal kernel and a driver, control logic or other functionality: we call this approach Domain Disaggregation. Disaggregated domains are conceptually similar to processes in an operating system. They can be started/ended on demand, without affecting the rest of the system. Disaggregated domains reduce attack surface and distribute bottlenecks.  It enables you to restart an unresponsive device driver without affecting your VMs.

Analysis of the Xen code review process: An example of software development analytics

VM Migration The software supports Virtual Machine Migration. This allows you to react to changing loads on your servers, protecting your workloads.
Open Source Open Source means that you have influence over the direction of the code. You are not at the mercy of some immovable external organization which may have priorities which do not align with your organization. You can participate and help ensure that your needs are heard in the process. And you never have to worry that some entity has decided to terminate the product for business reasons. An Open Source project will live as long as there are parties interested in advancing the software.

Multi-vendor support The project enjoys support from a number of major software and service vendors.  This gives end-users numerous places to find support, as well as numerous service providers to work with.  With such a rich commercial ecosystem around the project, there is plenty of interest in keeping the project moving forward to ever greater heights.

KVM or Xen? Choosing a Virtualization Platform

When Xen was first released in 2002, the GPL'd hypervisor looked likely to take the crown as the virtualization platform for Linux. Fast forward to 2010, and the new kid in town has displaced Xen as the virtualization of choice for Red Hat and lives in the mainline Linux kernel. Which one to choose? Read on for our look at the state of Xen vs. KVM.

Things in virtualization land move pretty fast. If you don't have time to keep up with the developments in KVM or Xen development, it's a bit confusing to decide which one (if either) you ought to choose. This is a quick look at the state of the market between Xen and KVM.

KVM and Xen

Xen is a hypervisor that supports x86, x86_64, Itanium, and ARM architectures, and can run Linux, Windows, Solaris, and some of the BSDs as guests on their supported CPU architectures. It's supported by a number of companies, primarily by Citrix, but also used by Oracle for Oracle VM, and by others. Xen can do full virtualization on systems that support virtualization extensions, but can also work as a hypervisor on machines that don't have the virtualization extensions.

KVM is a hypervisor that is in the mainline Linux kernel. Your host OS has to be Linux, obviously, but it supports Linux, Windows, Solaris, and BSD guests. It runs on x86 and x86-64 systems with hardware supporting virtualization extensions. This means that KVM isn't an option on older CPUs made before the virtualization extensions were developed, and it rules out newer CPUs (like Intel's Atom CPUs) that don't include virtualization extensions. For the most part, that isn't a problem for data centers that tend to replace hardware every few years anyway — but it means that KVM isn't an option on some of the niche systems like the SM10000 that are trying to utilize Atom CPUs in the data center.

If you want to run a Xen host, you need to have a supported kernel. Linux doesn't come with Xen host support out of the box, though Linux has been shipping with support to run natively as a guest since the 2.6.23 kernel. What this means is that you don't just use a stock Linux distro to run Xen guests. Instead, you need to choose a Linux distro that ships with Xen support, or build a custom kernel. Or go with one of the commercial solutions based on Xen, like Citrix XenServer. The problem is that those solutions are not entirely open source.

And many do build custom kernels, or look to their vendors to do so. Xen is running on quite a lot of servers, from low-cost Virtual Private Server (VPS) providers like Linode to big boys like Amazon with EC2. A TechTarget article demonstrates how providers that have invested heavily in Xen are not likely to switch lightly. Even if KVM surpasses Xen technically, they're not likely to rip and replace the existing solutions in order to take advantage of a slight technical advantage.

And KVM doesn't yet have the technical advantage anyway. Because Xen has been around a bit longer, it also has had more time to mature than KVM. You'll find some features in Xen that haven't yet appeared in KVM, though the KVM project has a lengthy TODO list that they're concentrating on. (The list isn't a direct match for parity with Xen, just a good idea what the KVM folks are planning to work on.) KVM does have a slight advantage in the Linux camp of being the anointed mainline hypervisor. If you're getting a recent Linux kernel, you've already got KVM built in. Red Hat Enterprise Linux 5.4 included KVM support and the company is dropping Xen support for KVM in RHEL 6.

This is, in part, an endorsement of how far KVM has come technically. Not only does Red Hat have the benefit of employing much of the talent behind KVM, there's the benefit of introducing friction to companies that have cloned Red Hat Enterprise Linux and invested heavily in Xen. By dropping Xen from the roadmap, they're forcing other companies to drop Xen or pick up maintenance of Xen and diverging from RHEL. This means extra engineering costs, requiring more effort for ISV certifications, etc.

KVM isn't entirely on par with Xen, though it's catching up quickly. It has matured enough that many organizations feel comfortable deploying it in production. So does that mean Xen is on the way out? Not so fast.

There Can Be Only One?

The choice of KVM vs. Xen is as likely to be dictated by your vendors as anything else. If you're going with RHEL over the long haul, bank on KVM. If you're running on Amazon's EC2, you're already using Xen, and so on. The major Linux vendors seem to be standardizing on KVM, but there's plenty of commercial support out there for Xen. Citrix probably isn't going away anytime soon.

It's tempting in the IT industry to look at technology as a zero sum game where one solution wins and another loses. The truth is that Xen and KVM are going to co-exist for years to come. The market is big enough to support multiple solutions, and there's enough backing behind both technologies to ensure that they do well for years to come.

Containers vs. Virtualization: The new Cold War?

More Information:



















27 April 2017

Microsoft touts SQL Server 2017 as 'first RDBMS with built-in AI'

The 2017 Microsoft Product Roadmap

Many key Microsoft products reached significant milestones in 2016, with next-gen versions of SharePoint Server, SQL Server and Windows Server all being rolled out alongside major updates to the Dynamics portfolio and, of course, Windows. This year's product roadmap looks to be a bit less crowded, though major changes are on tap for Microsoft's productivity solutions, while Windows 10 is poised for another landmark update. Here's what to watch for in the coming months 

With a constantly changing, and increasingly diversifying IT landscape– particularly in terms of heterogeneous operating systems (Linux, Windows, etc.) - IT organizations must contend with multiple data types, different development languages, and a mix of on-premises/cloud/hybrid environments, and somehow simultaneously reduce operational costs. To enable you to choose the best platform for your data and applications, SQL Server is bringing its world-class RDBMS to Linux and Windows with SQL Server v.Next.

You will learn more about the SQL Server on Linux offering and how it provides a broader range of choice for all organizations, not just those who want to run SQL on Windows. It enables SQL Server to run in more private, public, and hybrid cloud ecosystems, to be used by developers regardless of programming languages, frameworks or tools, and further empowers ‘every person and every organization on the planet to achieve more.’

Bootcamp 2017 - SQL Server on Linux

Learn More about:

  • What’s next for SQL Server on Linux
  • The Evolution and Power of SQL Server 2016
  • Enabling DevOps practices such as Dev/Test and CI/CD  with containers
  • What is new with SQL Server 2016 SP1: Enterprise class features in every edition
  • How to determine which SQL Server edition to deploy based on operation need, not feature set

SQL Server on Linux: High Availability and security on Linux

Why Microsoft for your operational database management system?

When it comes to the systems you choose for managing your data, you want performance and security that won't get in the way of running your business. As an industry leader in operational database management systems (ODBMS), Microsoft continuously improves its offerings to help you get the most out of your ever-expanding data world.

Read Gartner’s assessment of the ODBMS landscape and learn about the Microsoft "cloud first" strategy. In its latest Magic Quadrant report for ODBMS, Gartner positioned the Microsoft DBMS furthest in completeness of vision and highest for ability to execute. Gartner Reprint of SQL Server 2017

Top Features Coming to SQL Server 2017
From Python to adaptive query optimization to the many cloud-focused changes (not to mention Linux!), Joey D'Antoni takes you through the major changes coming to SQL Server 2017.

Top three capabilities to get excited about in the next version of SQL Server

Microsoft announced the first public preview of SQL Server v.Next in November 2016, and since then we’ve had lots of customer interest, but a few key scenarios are generating the most discussion.

If you’d like to learn more about SQL Server v.Next on Linux and Windows, please join us for the upcoming Microsoft Data Amp online event on April 19 at 8 AM Pacific. It will showcase how data is the nexus between application innovation and intelligence—how data and analytics powered by the most trusted and intelligent cloud can help companies differentiate and out-innovate their competition.

In this blog, we discuss three top things that customers are excited to do with the next version of SQL Server.

1. Scenario 1: Give applications the power of SQL Server on the platform of your choice

With the upcoming availability of SQL Server v.Next on Linux, Windows, and Docker, customers will have the added flexibility to build and deploy more of their applications on SQL Server. In addition to Windows Server and Windows 10, SQL Server v.Next supports Red Hat Enterprise Linux (RHEL), Ubuntu, and SUSE Linux Enterprise Server (SLES). SQL Server v.Next also runs on Linux and Windows Docker containers opening up even more possibilities to run on public and private cloud application platforms like Kubernetes, OpenShift, Docker Swarm, Mesosphere DC/OS, Azure Stack, and Open Stack. Customers will be able to continue to leverage existing tools, talents, and resources for more of their applications.

Some of the things customers are planning for SQL Server v.Next on Windows, Linux, and Docker include migrating existing applications from other databases on Linux to SQL Server; implementing new DevOps processes using Docker containers; developing locally on the dev machine of choice, including Windows, Linux, and macOS; and building new applications on SQL Server that can run anywhere—on Windows, Linux, or Docker containers, on-premises, and in the cloud.

SQL Server on Linux - march 2017

2. Scenario 2: Faster performance with minimal effort

SQL Server v.Next further expands the use cases supported by SQL Server’s in-memory capabilities, In-Memory OLTP and In-Memory ColumnStore. These capabilities can be combined on a single table delivering the best Hybrid Transactional and Analytical Processing (HTAP) performance available in any database system. Both in-memory capabilities can yield performance improvements of more than 30x, enabling the possibility to perform analytics in real time on operational data.

In v.Next natively compiled stored procedures (In-memory OLTP) now support JSON data as well as new query capabilities. For the column store both building and rebuilding a nonclustered column store can now be done online. Another critical addition to the column store is support for LOBs (Large Objects).

SQL Server on Linux 2017

With these additions, the parts of an application that can benefit from the extreme performance of SQL Server’s in-memory capabilities have been greatly expanded! We also introduced a new set of features that learn and adapt from an application’s query patterns over time without requiring actions from your DBA.

3. Scenario 3: Scale out your analytics

In preparation for the release of SQL Server v.Next, we are enabling the same High Availability (HA) and Disaster Recovery (DR) solutions on all platforms supported by SQL Server, including Windows and Linux. Always On Availability Groups is SQL Server’s flagship solution for HA and DR. Microsoft has released a preview of Always On Availability Groups for Linux in SQL Server v.Next Community Technology Preview (CTP) 1.3.

SQL Server Always On availability groups can have up to eight readable secondary replicas. Each of these secondary replicas can have their own replicas as well. When daisy chained together, these readable replicas can create massive scale-out for analytics workloads. This scale-out scenario enables you to replicate around the globe, keeping read replicas close to your Business Analytics users. It’s of particularly big interest to users with large data warehouse implementations. And, it’s also easy to set up.

In fact, you can now create availability groups that span Windows and Linux nodes, and scale out your analytics workloads across multiple operating systems.

In addition, a cross-platform availability group can be used to migrate a database from SQL Server on Windows to Linux or vice versa with minimal downtime. You can learn more about SQL Server HA and DR on Linux by reading the blog SQL Server on Linux: Mission-critical HADR with Always On Availability Groups Mission Critical HADR .

To find out more, you can watch our SQL Server on Linux webcast Linux Webinars . Find instructions for acquiring and installing SQL Server v.Next on the operating system of your choice at www.microsoft.com/sqlserveronlinux http://www.microsoft.com/sqlserveronlinux . To get your SQL Server app on Linux faster, you can nominate your app for the SQL Server on Linux Early Adopter Program, or EAP. Sign up now to see if your application qualifies for technical support, workload validation, and help moving your application to production on Linux before general availability.

To find out more about SQL Server v.Next and get all the latest announcements, register now to attend Microsoft Data Amp—where Data Amp—- where Data data gets to work.

Microsoft announced the name and many of the new features in the next release of SQL Server at its Data Amp Virtual Event on Wednesday. While SQL Server 2017 may not have as comprehensive of a feature set as SQL Server 2016, there is still some big news and very interesting new features. The reason for this is simple -- the development cycle for SQL Server 2017 is much shorter than the SQL Server 2016 development cycle. The big news at Wednesday's event is the release of SQL Server 2017 later this year on both Windows and Linux operating systems.

Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL Server 2016 and R Services

I was able to quickly download the latest Linux release on Docker and have it up and running on my Mac during today's briefing. (I have previously written about the Linux release here.) That speed to development is one of the major benefits of Docker that Microsoft hopes developers will leverage when building new applications. Docker is just one of many open source trends we have seen Microsoft adopt in recent years with SQL Server. Wednesday's soft launch not only introduced SQL on Linux, but also includes Python support, a new graph engine and a myriad of other features.

First R, Now Python
One of the major features of SQL Server 2016 was the integration of R, an open source statistical analysis language, into the SQL Server database engine. Users can use the sp_execute_external_script stored procedure to run R code that takes advantage of parallelism in the database engine. Savvy users of this procedure might notice the first parameter of this stored procedure is @language. Microsoft designed this stored procedure to be open-ended, and now adds Python as the second language that it supports. Python combines powerful scripting with eminent readability and is broadly used by IT admins, developers, data scientists, and data analysts. Additionally, Python can leverage external statistical packages to perform data manipulation and statistical analysis. When you combine this capability with Transact-SQL (T-SQL), the result is powerful.

SQL Server 2017: Advanced Analytics with Python
In this session you will learn how SQL Server 2017 takes in-database analytics to the next level with support for both Python and R; delivering unparalleled scalability and speed with new deep learning algorithms built in. Download SQL Server 2017: https://aka.ms/sqlserver17linuxyt

Big Changes to the Cloud
It is rare for a Microsoft launch event to omit news about cloud services, and Wednesday's event was no exception. Microsoft Azure SQL Database (formerly known as SQL Azure), which is the company's Database as a Service offering, has always lacked complete compatibility with the on-premises (or in an Azure VM) version of SQL Server. Over time, compatibility has gotten much better, but there are still gaps such as unsupported features like SQL CLR and cross-database query.

SQL Server 2017: Security on Linux

The new solution to this problem is a hybrid Platform as a Server (PaaS)/Infrastructure as a Service (IaaS) solution that is currently called Azure Managed Instances. Just as with Azure SQL Database, the Managed Instances administrator is not responsible for OS and patching operations. However, the Managed Instances solution supports many features and functions that are not currently supported in SQL Database. One such new feature is the cross-database query capability. In an on-premises environment, multiple databases commonly exist on the same instance, and a single query can reference separate databases by using database.schema.table notation. In SQL Database, it is not possible to reference multiple databases in one query which has limited many migrations to the platform due to the amount of code that must be rewritten. Support for cross-database queries in Managed Instances simplifies the process of migrating applications to Azure PaaS offerings, and should thereby increase the number of independent software vendor (ISV) applications that can run in PaaS.

SQL Server 2017: HA and DR on Linux

SQL Server 2017: Adaptive Query Processing

Microsoft also showcased some of the data protection features in Azure SQL Database that are now generally available. Azure SQL Database Threat Detection detects SQL Injection, potential SQL Injection vulnerabilities, and anomalous login monitoring. This can simply be turned on at the SQL Database level by enabling auditing and configuring notifications. The administrator is then notified when the threat detection engine detects any anomalous behavior.

Graph Database
One of things I was happiest to see in SQL Server 2017 was the introduction of a graph database within the core database engine. Despite the name, relational databases struggle in managing relationships between data objects. The simplest example of this struggle is hierarchy management . In a classic relational structure, an organizational chart can be a challenge to model -- who does the CEO report to? With graph database support in SQL Server, the concept of nodes and edges is introduced. Nodes represent entities, edges represent relationships between any two given nodes, and both nodes and edges can be associated with data properties . SQL Server 2017 also uses extensions in the T-SQL language to support join-less queries that use matching to return related values.

SQL Server 2017: Building applications using graph data
Graph extensions in SQL Server 2017 will facilitate users in linking different pieces of connected data to help gather powerful insights and increase operational agility. Graphs are well suited for applications where relationships are important, such as fraud detection, risk management, social networks, recommendation engines, predictive analysis, dependence analysis, and IoT applications. In this session we will demonstrate how you can use SQL Graph extensions to build your application using graph data. Download SQL Server 2017: Now on Windows, Linux, and Docker https://www.microsoft.com/en-us/sql-server/sql-server-vnext-including-Linux

Graph databases are especially useful in Internet of Things (IoT), social network, recommendation engine, and predictive analytics applications. It should be noted that many vendors have been investing in graph solutions in recent years. Besides Microsoft, IBM and SAP have also released graph database features in recent years.

Adaptive Query Plans
One the biggest challenges of a DBA is managing system performance over time. As data changes, the query optimizer generates new execution plans which at times might be less than optimal . With Adaptive Query Optimization in SQL Server 2017, SQL Server can evaluate the runtime of a query and compare the current execution to the query's history, building on some of the technology that was introduced in the Query Store feature in SQL Server 2016 . For the next run of the same query, Adaptive Query Optimization can then improve the execution plan .

Because a change to an execution plan  that is based on one slow execution can have a dramatically damaging effect on system performance, the changes made by Adaptive Query Optimization are incremental and conservative. Over time, this feature handles the tuning a busy DBA may not have time to perform. This feature also benefits from Microsoft's management of Azure SQL Database because the development team monitors the execution data and the improvements that adaptive execution plans make in the cloud. They can then optimize the process and flow for adaptive execution plans in future versions of the on-premises product.

Are You a Business Intelligence Pro?
SQL Server includes much more than the database engine. Tools like Reporting Services (SSRS) and Analysis Services (SSAS) have long been a core part of the value proposition of SQL Server. Reporting Services benefited from a big overhaul in SQL Server 2016, and more improvements are coming in SQL Server 2017 with on-premises support for storage of Power BI reports in a SSRS instance. This capability is big news to organizations who are cloud-averse for various reasons. In addition, SQL Server 2017 adds support for the Power Query data sources in SSAS tabular models to expand. This capability means tabular models can store data from a broader range of data sources than it currently supports, such as Azure Blob Storage and Web page data.

2017 OWASP SanFran March Meetup - Hacking SQL Server on Scale with PowerShell

And More...
Although it is only an incremental release, Microsoft has packed a lot of functionality into SQL Server 2017. I barely mentioned Linux in this article for a reason: From a database perspective SQL Server on Linux is simply SQL Server. Certainly, there are some changes in infrastructure, but your development experience in SQL Server, whether on Linux, Windows or Docker, is exactly the same.

Keep your environment always on with sql server 2016 sql bits 2017

From my perspective, the exciting news is not just the new features that are in this version, but also the groundwork for feature enhancements down the road. Adaptive query optimization will get better over time, as will the graph database feature which you can query by using standard SQL syntax. Furthermore, the enhancements to Azure SQL Database with managed instances should allow more organizations to consider adoption of the database as a service option. In general, I am impressed with Microsoft's ability to push the envelope on database technology so shortly after releasing SQL Server 2016.

Nordic infrastructure Conference 2017 - SQL Server on Linux Overview

You can get started with the CTP by downloading the package for Docker, https://hub.docker.com/r/microsoft/mssql-server-windows/ or the Linux, https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup-red-hat platforms, or you can download the Windows release here https://www.microsoft.com/evalcenter/evaluate-sql-server-vnext-ctp .

More Information:












27 March 2017

IBM Power 9 CPU a Game Changer.

IBM Power 9 CPU

IBM is looking to take a bigger slice out of Intel’s lucrative server business with Power9, the company’s latest and greatest processor for the datacenter. Scheduled for initial release in 2017, the Power9 promises more cores and a hefty performance boost compared to its Power8 predecessor. The new chip was described at the Hot Chips event.

IBM Power9 CPU

The Power9 will end up in IBM’s own servers, and if the OpenPower gods are smiling, in servers built by other system vendors. Although none of these systems have been described in any detail, we already know that bushels of IBM Power9 chips will end up in Summit and Sierra, two 100-plus-petaflop supercomputers that the US Department of Energy will deploy in 2017-2018. In both cases, most of the FLOPS will be supplied by NVIDIA Volta GPUs, which will operate alongside IBM’s processors.

Power 9 Processor For The Cognitive Era

The Power9 will be offered in two flavors: one for single- or dual-socket servers for regular clusters, and the other for NUMA servers with four or more sockets, supporting much larger amounts of shared memory. IBM refers to the dual-socket version is as the scale-out (SO) design and the multi-socketed version as the scale-up (SU) design. They basically correspond to the Xeon E5 (EP) and Xeon E7 (EX) processor lines, although Intel is apparently going to unify those lines post-Broadwell.

The SU Power9 is aimed at mission-critical enterprise work and other application where large amounts of shared memory are desired. It has extra RAS features, buffered memory, and will tend to have fewer cores running at faster clock rates. As such, it carries on many of the traditions of the Power architecture through Power8. The SU Power9 is going to be released in 2018, well after the SO version hits the streets.

The SO Power9 is going after the Xeon dual-socket server market in a more straightforward manner. These chips will use direct attached memory (DDR4) with commodity DIMMs, instead of the buffered memory setup mentioned above. In general, this processor will adhere to commodity packaging so that Power9-based servers can utilize industry standard componentry. This is the platform destined for large cloud infrastructure and general enterprise computing, as well as HPC setups. It’s due for release sometime next year.

Distilling out the differences between the two varieties, here are the basics of the new Power9 (Power8 specs in parentheses for comparison):

  • 8 billion transistors (4.2 billion)
  • Up to 24 cores (Up to 12 cores)
  • Manufactured using 14nm FinFET (22nm SOI)
  • Supports PCIe Gen4 (PCIe Gen3)
  • 120 MB shared L3 cache (96 MB shared L3 cache)
  • 4-way and 8-way simultaneous multithreading (8-way simultaneous multithreading)
  • Memory bandwidth of 120 or 230 GB/sec (230 GB/sec)

From the looks of things, IBM spent most of the extra transistor budget it got from the 14nm shrink on extra cores and a little bit more L3 cache. New on-chip data links were also added, with an aggregate bandwidth of 7 TB/sec, which is used to feed each core at the rate of 256 GB/sec in a 12-core configuration. The bandwidth fans out in the other direction to supply data to memory, additional Power9 sockets, PCIe devices, and accelerators. Speaking of which, there is special support for NVIDIA GPUs in the form of NVLink 2.0 support, which promises much faster communication speeds than vanilla PCIe. An enhanced CAPI interface is also supported for accelerators that support that standard.

The accelerator story is one of the key themes of the Power9, which IBM is touting as “the premier platform for accelerated computing.” In that sense, IBM is taking a different tack than Intel, which is bringing accelerator technology on-chip and making discrete products out of them, as it has done with Xeon Phi and is in the process of doing with Altera FPGAs. By contrast, IBM has settled on the host-coprocessor model of acceleration, which offloads special-purpose processing to external devices. This has the advantage of flexibility; the Power9 can connect to virtually any type of accelerator or special-purpose coprocessor as long it speaks PCIe, CAPI or NVLink.

Understanding the IBM Power Systems Advantage

Thus the Power9 sticks with an essentially general-purpose design. As a standalone processor it is designed for mainstream datacenter applications (assuming that phrase has meaning anymore). From the perspective of floating point performance, it is about 50 percent faster than Power8, but that doesn’t make it an HPC chip, and in fact, even a mid-range Broadwell Xeon (E5-2600 V4) would likely outrun a high-end Power9 processor on Linpack. Which is fine. That’s what the GPUs and NVLink support are for.

IBM Power Systems Update 1Q17

If there is any deviation from the general-purpose theme, it’s in the direction of data-intensive workloads, especially analytics, business intelligence, and the broad category of “cognitive computing” that IBM is so fond of talking about. Here the Power processors have had something of an historical advantage in that they offered much higher memory bandwidth that their Xeon counterparts, in fact, about two to four times higher. The SO Power9 supports 120 GB/sec of memory bandwidth; the SU version, 230 GB/sec. The Power9 also comes with a very large (120 MB) L3 cache, which is built with eDRAM technology that supports speeds of up to 256 GB/sec. All of which serves to greatly lessen the memory bottleneck for data-intensive applications.

IBM Power Systems Announcement Update

According to IBM, Power9 was about 2.2 times faster for graph analytics workloads and about 1.9 times faster for business intelligence workloads. That’s on a per socket basis, comparing a 12-core Power9 to that of a 12-core Power8 at the same 4GHz clock frequency. Which is a pretty impressive performance bump from one generation to the next, although it should be pointed out that IBM offered no comparisons against the latest Broadwell Xeon chips.

The official Power roadmap from IBM does not say much in terms of timing, but thanks to the “Summit” and “Sierra” supercomputers that IBM, Nvidia, and Mellanox Technologies are building for the U.S. Department of Energy, we knew Power9 was coming out in late 2017. Here is the official Power processor roadmap from late last year:

And here is the updated one from the OpenPower Foundation that shows how compute and networking technologies will be aligned:

IBM revealed that the Power9 SO chip will be etched in the 14 nanometer process from Globalfoundries and will have 24 cores, which is a big leap for Big Blue.

That doubling of cores in the Power9 SO is a big jump for IBM, but not unprecedented. IBM made a big jump from two cores in the Power6 and Power6+ generations to eight cores with the Power7 and Power7+ generations, and we have always thought that IBM wanted to do a process shrink and get to four cores on the Power6+ and that something went wrong. IBM ended up double-stuffing processor sockets with the Power6+, which gave it an effective four-core chip. It did the same thing with certain Power5+ machines and Power7+ machines, too.

The other big change with the Power9 SO chip is that IBM is going to allow the memory controllers on the die to reach out directly and control external DDR4 main memory rather than have to work through the “Centaur” memory buffer chip that is used with the Power8 chips. This memory buffering has allowed for very high memory bandwidth and a large number of memory slots as well as an L4 cache for the processors, but it is a hassle for entry systems designs and overkill for machines with one or two sockets. Hence, it is being dropped.

The Power9 SU processor, which will be used in IBM’s own high-end NUMA machines with four or more sockets, will be sticking with the buffered memory. IBM has not revealed what the core count will be on the Power9 SU chip, but when we suggested that based on the performance needs and thermal profiles of big iron that this chip would probably have fewer cores, possibly more caches, and high clock speeds, McCredie said these were all reasonable and good guesses without confirming anything about future products.

LINUX on Power

The Power9 chips will sport an enhanced NVLink interconnect (which we think will have more bandwidth and lower latency but not more aggregate ports on the CPUs or GPUs than is available on the Power8), and we think it is possible that the Power9 SU will not have NVLink ports at all. (Although we could make a case for having a big NUMA system with lots and lots of GPUs hanging off of it using lots of NVLink ports instead of using an InfiniBand interconnect to link multiple nodes in a cluster together.)

The Power9 chip with the SMT8 cores are aimed at analytics workloads that are wrestling with lots of data, in terms of both capacity and throughput. The 24 core variant of the Power9 with SMT8 has 512 KB L2 cache memory per core, and 120 MB of L3 cache is shared across the dies in 10 MB segments with each pair of cores. The on-chip switch fabric can move data in and out of the L3 cache at 256 GB/sec, and adding in the various interconnects for memory controllers, PCI-Express 4.0 controllers, and the “Bluelink” 25 Gb/sec ports that are used to attach accelerators to the processors as well as underpinning the NVLink 2.0 protocol that will be added to next year’s “Volta” GV100 GPUs from Nvidia and IBM’s own remote SMP links for creating NUMA clusters with more than four sockets, and you have an on-chip fabric with over 7 TB/sec of aggregate bandwidth.

The Power9 chips will have 48 lanes of PCI-Express 4.0 peripheral I/O per socket, for an aggregate of 192 GB/sec of duplex bandwidth. In addition to this, the chip will support 48 lanes of 25 Gb/sec Bluelink bandwidth for other connectivity, with an aggregate bandwidth of 300 GB/sec. On the Power9 SU chips, 48 of the 25 Gb/sec lanes will be used for remote SMP links between quad-socket nodes to make a 16-socket machine, and the remaining 48 lanes of PCI-Express 4.0 will be used for PCI-Express peripherals and CAPI 2.0 accelerators. The Power9 chip has integrated 16 Gb/sec SMP links for gluelessly making the four-socket modules. In addition to the CAPI 2.0 coherent links running atop PCI-Express 4.0, there is a further enhanced CAPI protocol that runs atop the 25 Gb/sec Bluelink ports that is much more streamlined and we think is akin to something like NVM-Express for flash running over PCI-Express in that it eliminates a lot of protocol overhead from the PCI-Express bus. But that is just a hunch. It doesn’t look like the big bad boxes will be able to support this new CAPI or NVLink ports, by the way, since the Bluelink ports are eaten by NUMA expansion.

More Information:








21 February 2017

Why Cloudera's hadoop and Oracle?

Oracle 12c & Hadoop: Optimal Store and Process of Big Data

How to use the Hadoop Ecosystem tools to extract data from an Oracle 12c database, use the Hadoop Framework to process and transform data and then load the data processed within Hadoop into an Oracle 12c database.

Oracle big data appliance and solutions

This blog covers basic concepts:

  • What is Big Data? Big Data is the amount of data that one single machine cannot store and process. Data comes with different formats (structured, non - structured) from different sources and with great velocity of grow. 
  • What is Apache Hadoop? It is a framework allowing distributed processing of large data sets across many (can be thousands) of machines. Hadoop concept was first introduced by Google. Hadoop framework consists of HDFS and MapReduce. 
  • What is HDFS? HDFS (Hadoop Distributed File System): the Hadoop File System that enables storing large data sets across multiple machines. 
  • What is Map Reduce? The data processing component of the Hadoop Framework that consists of Map phase and Reduce phase. 
  • What is Apache Sqoop? Apache Sqoop(TM) is a tool to transfer bulk data between Apache Hadoop and structured data stores such as relational databases. It is part or the Hadoop ecosystem. 
  • What is Apache Hive? Hive is a tool to query and manage large datasets stored in Hadoop HDFS. It is also part of the Hadoop ecosystem. 
  • Where Does Hadoop Fit In? We will use the Apache Hadoop Ecosystem (Apache Sqoop) to extract data from an Oracle 12c database and store it into the Hadoop Distributed File System (HDFS). We will then use the Apache Hadoop Ecosystem (Apache Hive) to transform data and process it using the Map Reduce (We can also use Java programs to do the same). Apache Sqoop will be used to load the data already processed within Hadoop into an Oracle 12c database. The following image describes where Hadoop fits in the process. This scenario represents a practical solution to processing big data coming from Oracle database as a source; the only condition is that data source must be structured. Note that Hadoop can also process non – structured data like videos, log files etc.

Why Cloudera + Oracle?
For over 38 years Oracle has been the market leader of RDMBS database systems and a major influencer of enterprise software and hardware technology. Besides leading the industry in database solutions, Oracle also develops tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and has worked closely to develop enterprise class solutions that enable enterprise customers to quickly manage with big data workloads.
As the leader in Apache Hadoop-based data platforms, Cloudera has the enterprise quality and expertise that make them the right choice to work with on Oracle Big Data Appliance.
— Andy Mendelson, Senior Vice President, Oracle Server Technologies
Joint Solution Overview
Oracle Big Data Appliance
The Oracle Big Data Appliance is an engineered system optimized for acquiring, organizing, and loading unstructured data into Oracle Database 12c. The Oracle Big Data Appliance includes CDH, Oracle NoSQL Database, Oracle Data Integrator with Application Adapter for Apache Hadoop, Oracle Loader for Hadoop, an open source distribution of R, Oracle Linux, and Oracle Java HotSpot Virtual Machine.

Extending Hortonworks with Oracle's Big Data Platform

Oracle Big Data Discovery
Oracle Big Data Discovery is the visual face of Hadoop that allows anyone to find, explore, transform, and analyze data in Hadoop. Discover new insights, then share results with big data project teams and business stakeholders.

Oracle Big Data SQL Part 1-4

Oracle NoSQL Database
Oracle NoSQL Database Enterprise Edition is a distributed, highly scalable, key-value database. Unlike competitive solutions, Oracle NoSQL Database is easy-to-install, configure and manage, supports a broad set of workloads, and delivers enterprise-class reliability backed by enterprise-class Oracle support.

Oracle Data Integrator Enterprise Edition
Oracle Data Integrator Enterprise Edition is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes. Oracle Data Integrator Enterprise Edition (ODI EE) provides native Cloudera integration allowing the use of the Cloudera Hadoop Cluster as the transformation engine for all data transformation needs. ODI EE utilizes Cloudera’s foundation of Impala, Hive, HBase, Sqoop, Pig, Spark as well as many others, to provide best in class performance and value. Oracle Data Integrator Enterprise Edition enhances productivity and provides a simple user interface for creating high performance to load and transform data to and from Cloudera data stores.

Oracle Loader for Hadoop
Oracle Loader for Hadoop enables customers to use Hadoop MapReduce processing to create optimized data sets for efficient loading and analysis in Oracle Database 12c. Unlike other Hadoop loaders, it generates Oracle internal formats to load data faster and use less database system resources.

How the Oracle and Hortonworks Handle Petabytes of Data

Oracle R Enterprise
Oracle R Enterprise integrates the open-source statistical environment R with Oracle Database 12c. Analysts and statisticians can run existing R applications and use the R client directly against data stored in Oracle Database 12c, vastly increasing scalability, performance and security. The combination of Oracle Database 12c and R delivers an enterprise-ready deeply-integrated environment for advanced analytics.

Discover Data Insights and Build Rich Analytics with Oracle BI Cloud Service

Oracle NoSQL Database, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, and Oracle R Enterprise will be available both as standalone software products independent of the Oracle Big Data Appliance.

Learn More
Download details about the Oracle Big Data Appliance
Download the solution brief: Driving Innovation in Mobile Devices with Cloudera and Oracle

Oracle is the leader in developing software to address a enterprise data management.  Typically known as a database leader, they also develop and build tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and have worked closely to develop enterprise class solutions that can enable end customers to more quickly get up and running with big data.

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle Big Data Discovery

Oracle Big Data SQL product, will be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.

Bringing Self Service Data Preparation to the Cloud; Oracle Big Data Preparation Cloud Services

Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.

Storytelling with Oracle Analytics Cloud

Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) - but also meaning that it's Exadata only, and not available for Oracle Databases running on non-Exadata hardware.

As explained by the launch blog post by Oracle’s Dan McClary https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one  , Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data Analyics using Oracle Advanced Analytics12c and BigDataSQL

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Oracle Exadata X6: Technical Deep Dive - Architecture and Internals

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.
Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-Time and Predictive Analytics

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Data Reduction in Data Base:

Oracle In-database MapReduce in 12c (big data)
There is some interest from the field about what is In-database map-reduce option and why and how it is different than hadoop solution.
I though I will share my thoughts on it.

 In-database map-reduce is an umbrella term that includes two features.
            "SQL Map-reduce" or  "SQL pattern matching".
             In database container for Hadoop.  to be released in future release.

"SQL MapReduce" : Oracle database 12c introduced a new feature called PATTERN MATCHING using "MATCH_RECOGNIZE" clause in SQL. This is one of the latest ANSI SQL standards proposed and implemented by Oracle. The new sql syntax helps to intuitively solve complex queries that are not easy to implement using 11g analytical functions alone. Some of the use cases are fraud detection, gene sequencing, time series calculation, stock ticker pattern matching . Etc.  I found most of the use case for Hadoop can be done using match_recognize in database on structured data. Since this is just a SQL enhancement , it is there in both Enterprise & Standard Edition database.

Big Data gets Real time with Oracle Fast Data

"In database container for Hadoop  (beta)" : if you have your development team more skilled at Hadoop and not SQL , or want to implement some complex pre-packaged Hadoop algorithms, you could use oracle container for Hadoop (beta). It is a Hadoop prototype APIs  which run within the java virtual machine in the database.

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

It implements Hadoop Java APIs and interfaces with database using parallel table functions to read data in parallel. One interesting fact about parallel table functions is that it can run in parallel across RAC cluster and also can also route data to a specific parallel processes . This functionality is the key in making Hadoop scale across clusters and  this functionality exited in database for over 15 years now.  Advantage of in-database Hadoop  is:

  • No need to move data out of database for running Mapreduce functions and hence save time and resources.
  •  More  real time data could be used.
  •  Less redundant copies of data and hence better security & less disk space used.
  •  The servers could be used for not just MapReduce work, but also used to run the database making better resource utilization,
  • The output of the MapReduce is immediately available for analytic tools and can combine this functionality along with database features like "in-memory option (beta) to get near real time analysis of Big Data. 
  • Combine db features for security. Backup, auditing, performance with MapReduce. API.
  • The ability to stream the output of one parallel table function as input to the next parallel table function has an advantage of not needing to maintain any intermediate stages.
  • Features like graphical, test, spacial and semantic within oracle database can be used for further analysts.

In addition to this, Oracle 12c will support schema less access using JSON protocol. That will help big data use cases of NOSQL to run on data within Oracle database as well.

Having these features will help to solve MapReduce challenges when the data is mostly within database and reduce data movement and make better use of available resources..
If Most of your data is outside the DB, then sql Connectors for hadoop and Oracle Loader for Hadoop could be used.

More Information: