21 February 2017

Why Cloudera's hadoop and Oracle?

Oracle 12c & Hadoop: Optimal Store and Process of Big Data

How to use the Hadoop Ecosystem tools to extract data from an Oracle 12c database, use the Hadoop Framework to process and transform data and then load the data processed within Hadoop into an Oracle 12c database.

Oracle big data appliance and solutions

This blog covers basic concepts:

  • What is Big Data? Big Data is the amount of data that one single machine cannot store and process. Data comes with different formats (structured, non - structured) from different sources and with great velocity of grow. 
  • What is Apache Hadoop? It is a framework allowing distributed processing of large data sets across many (can be thousands) of machines. Hadoop concept was first introduced by Google. Hadoop framework consists of HDFS and MapReduce. 
  • What is HDFS? HDFS (Hadoop Distributed File System): the Hadoop File System that enables storing large data sets across multiple machines. 
  • What is Map Reduce? The data processing component of the Hadoop Framework that consists of Map phase and Reduce phase. 
  • What is Apache Sqoop? Apache Sqoop(TM) is a tool to transfer bulk data between Apache Hadoop and structured data stores such as relational databases. It is part or the Hadoop ecosystem. 
  • What is Apache Hive? Hive is a tool to query and manage large datasets stored in Hadoop HDFS. It is also part of the Hadoop ecosystem. 
  • Where Does Hadoop Fit In? We will use the Apache Hadoop Ecosystem (Apache Sqoop) to extract data from an Oracle 12c database and store it into the Hadoop Distributed File System (HDFS). We will then use the Apache Hadoop Ecosystem (Apache Hive) to transform data and process it using the Map Reduce (We can also use Java programs to do the same). Apache Sqoop will be used to load the data already processed within Hadoop into an Oracle 12c database. The following image describes where Hadoop fits in the process. This scenario represents a practical solution to processing big data coming from Oracle database as a source; the only condition is that data source must be structured. Note that Hadoop can also process non – structured data like videos, log files etc.

Why Cloudera + Oracle?
For over 38 years Oracle has been the market leader of RDMBS database systems and a major influencer of enterprise software and hardware technology. Besides leading the industry in database solutions, Oracle also develops tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and has worked closely to develop enterprise class solutions that enable enterprise customers to quickly manage with big data workloads.
As the leader in Apache Hadoop-based data platforms, Cloudera has the enterprise quality and expertise that make them the right choice to work with on Oracle Big Data Appliance.
— Andy Mendelson, Senior Vice President, Oracle Server Technologies
Joint Solution Overview
Oracle Big Data Appliance
The Oracle Big Data Appliance is an engineered system optimized for acquiring, organizing, and loading unstructured data into Oracle Database 12c. The Oracle Big Data Appliance includes CDH, Oracle NoSQL Database, Oracle Data Integrator with Application Adapter for Apache Hadoop, Oracle Loader for Hadoop, an open source distribution of R, Oracle Linux, and Oracle Java HotSpot Virtual Machine.

Extending Hortonworks with Oracle's Big Data Platform

Oracle Big Data Discovery
Oracle Big Data Discovery is the visual face of Hadoop that allows anyone to find, explore, transform, and analyze data in Hadoop. Discover new insights, then share results with big data project teams and business stakeholders.

Oracle Big Data SQL Part 1-4

Oracle NoSQL Database
Oracle NoSQL Database Enterprise Edition is a distributed, highly scalable, key-value database. Unlike competitive solutions, Oracle NoSQL Database is easy-to-install, configure and manage, supports a broad set of workloads, and delivers enterprise-class reliability backed by enterprise-class Oracle support.

Oracle Data Integrator Enterprise Edition
Oracle Data Integrator Enterprise Edition is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes. Oracle Data Integrator Enterprise Edition (ODI EE) provides native Cloudera integration allowing the use of the Cloudera Hadoop Cluster as the transformation engine for all data transformation needs. ODI EE utilizes Cloudera’s foundation of Impala, Hive, HBase, Sqoop, Pig, Spark as well as many others, to provide best in class performance and value. Oracle Data Integrator Enterprise Edition enhances productivity and provides a simple user interface for creating high performance to load and transform data to and from Cloudera data stores.

Oracle Loader for Hadoop
Oracle Loader for Hadoop enables customers to use Hadoop MapReduce processing to create optimized data sets for efficient loading and analysis in Oracle Database 12c. Unlike other Hadoop loaders, it generates Oracle internal formats to load data faster and use less database system resources.

How the Oracle and Hortonworks Handle Petabytes of Data

Oracle R Enterprise
Oracle R Enterprise integrates the open-source statistical environment R with Oracle Database 12c. Analysts and statisticians can run existing R applications and use the R client directly against data stored in Oracle Database 12c, vastly increasing scalability, performance and security. The combination of Oracle Database 12c and R delivers an enterprise-ready deeply-integrated environment for advanced analytics.

Discover Data Insights and Build Rich Analytics with Oracle BI Cloud Service

Oracle NoSQL Database, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, and Oracle R Enterprise will be available both as standalone software products independent of the Oracle Big Data Appliance.

Learn More
Download details about the Oracle Big Data Appliance
Download the solution brief: Driving Innovation in Mobile Devices with Cloudera and Oracle

Oracle is the leader in developing software to address a enterprise data management.  Typically known as a database leader, they also develop and build tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and have worked closely to develop enterprise class solutions that can enable end customers to more quickly get up and running with big data.

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle Big Data Discovery

Oracle Big Data SQL product, will be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.

Bringing Self Service Data Preparation to the Cloud; Oracle Big Data Preparation Cloud Services

Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.

Storytelling with Oracle Analytics Cloud

Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) - but also meaning that it's Exadata only, and not available for Oracle Databases running on non-Exadata hardware.

As explained by the launch blog post by Oracle’s Dan McClary https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one  , Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data Analyics using Oracle Advanced Analytics12c and BigDataSQL

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Oracle Exadata X6: Technical Deep Dive - Architecture and Internals

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.
Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-Time and Predictive Analytics

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Data Reduction in Data Base:

Oracle In-database MapReduce in 12c (big data)
There is some interest from the field about what is In-database map-reduce option and why and how it is different than hadoop solution.
I though I will share my thoughts on it.

 In-database map-reduce is an umbrella term that includes two features.
            "SQL Map-reduce" or  "SQL pattern matching".
             In database container for Hadoop.  to be released in future release.

"SQL MapReduce" : Oracle database 12c introduced a new feature called PATTERN MATCHING using "MATCH_RECOGNIZE" clause in SQL. This is one of the latest ANSI SQL standards proposed and implemented by Oracle. The new sql syntax helps to intuitively solve complex queries that are not easy to implement using 11g analytical functions alone. Some of the use cases are fraud detection, gene sequencing, time series calculation, stock ticker pattern matching . Etc.  I found most of the use case for Hadoop can be done using match_recognize in database on structured data. Since this is just a SQL enhancement , it is there in both Enterprise & Standard Edition database.

Big Data gets Real time with Oracle Fast Data

"In database container for Hadoop  (beta)" : if you have your development team more skilled at Hadoop and not SQL , or want to implement some complex pre-packaged Hadoop algorithms, you could use oracle container for Hadoop (beta). It is a Hadoop prototype APIs  which run within the java virtual machine in the database.

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

It implements Hadoop Java APIs and interfaces with database using parallel table functions to read data in parallel. One interesting fact about parallel table functions is that it can run in parallel across RAC cluster and also can also route data to a specific parallel processes . This functionality is the key in making Hadoop scale across clusters and  this functionality exited in database for over 15 years now.  Advantage of in-database Hadoop  is:

  • No need to move data out of database for running Mapreduce functions and hence save time and resources.
  •  More  real time data could be used.
  •  Less redundant copies of data and hence better security & less disk space used.
  •  The servers could be used for not just MapReduce work, but also used to run the database making better resource utilization,
  • The output of the MapReduce is immediately available for analytic tools and can combine this functionality along with database features like "in-memory option (beta) to get near real time analysis of Big Data. 
  • Combine db features for security. Backup, auditing, performance with MapReduce. API.
  • The ability to stream the output of one parallel table function as input to the next parallel table function has an advantage of not needing to maintain any intermediate stages.
  • Features like graphical, test, spacial and semantic within oracle database can be used for further analysts.

In addition to this, Oracle 12c will support schema less access using JSON protocol. That will help big data use cases of NOSQL to run on data within Oracle database as well.

Having these features will help to solve MapReduce challenges when the data is mostly within database and reduce data movement and make better use of available resources..
If Most of your data is outside the DB, then sql Connectors for hadoop and Oracle Loader for Hadoop could be used.

More Information:

















0 reacties:

Post a Comment