23 December 2016

Software Defined Storage and Ceph - What Is all the Fuss About?

Ceph: What It Is

Ceph is open source, software-defined distributed storage maintained by Red Hat since their acquisition of InkTank in April 2014.

The power of Ceph can transform your organization’s IT infrastructure and your ability to manage vast amounts of data. If your organization runs applications with different storage interface needs, Ceph is for you! Ceph’s foundation is the Reliable Autonomic Distributed Object Store (RADOS), which provides your applications with object, block, and file system storage in a single unified storage cluster—making Ceph flexible, highly reliable and easy for you to manage.
Ceph’s RADOS provides you with extraordinary data storage scalability—thousands of client hosts or KVMs accessing petabytes to exabytes of data. Each one of your applications can use the object, block or file system interfaces to the same RADOS cluster simultaneously, which means your Ceph storage system serves as a flexible foundation for all of your data storage needs. You can use Ceph for free, and deploy it on economical commodity hardware. Ceph is a better way to store data.

Ceph provides seamless access to objects using native language bindings or radosgw, a REST interface that’s compatible with applications written for S3 and Swift.


Ceph’s software libraries provide client applications with direct access to the RADOS object-based storage system, and also provide a foundation for some of Ceph’s advanced features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System.

The Ceph librados software libraries enable applications written in C, C++, Java, Python and PHP to access Ceph’s object storage system using native APIs. The librados libraries provide advanced features, including:
  • partial or complete reads and writes
  • snapshots
  • atomic transactions with features like append, truncate and clone range
  • object level key-value mappings

RADOS Gateway provides Amazon S3 and OpenStack Swift compatible interfaces to the RADOS object store.

Ceph’s RADOS Block Device (RBD) provides access to block device images that are striped and replicated across the entire storage cluster.


Ceph’s object storage system isn’t limited to native binding or RESTful APIs. You can mount Ceph as a thinly provisioned block device! When you write data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. Ceph’s RADOS Block Device (RBD) also integrates with Kernel Virtual Machines (KVMs), bringing Ceph’s virtually unlimited storage to KVMs running on your Ceph clients.

Ceph RBD interfaces with the same Ceph object storage system that provides the librados interface and the Ceph FS file system, and it stores block device images as objects. Since RBD is built on top of librados, RBD inherits librados capabilites, including read-only snapshots and revert to snapshot. By striping images across the cluster, Ceph improves read access performance for large block device images.

  • Thinly provisioned
  • Resizable images
  • Image import/export
  • Image copy or rename
  • Read-only snapshots
  • Revert to snapshots
  • Ability to mount with Linux or QEMU KVM clients!

Ceph provides a POSIX-compliant network file system that aims for high performance, large data storage, and maximum compatibility with legacy applications.


Ceph’s object storage system offers a significant feature compared to many object storage systems available today: Ceph provides a traditional file system interface with POSIX semantics. Object storage systems are a significant innovation, but they complement rather than replace traditional file systems. As storage requirements grow for legacy applications, organizations can configure their legacy applications to use the Ceph file system too! This means you can run one storage cluster for object, block and file-based data storage.

Ceph’s file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and file names of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

The Ceph file system provides numerous benefits:
  • It provides stronger data safety for mission-critical applications.
  • It provides virtually unlimited storage to file systems.
  • Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required!
  • Ceph automatically balances the file system to deliver maximum performance.

Red hat ceph storage customer presentation

It’s capable of block, object, and file storage, though only block and object are currently deployed in production.  It is scale-out, meaning multiple Ceph storage nodes (servers) cooperate to present a single storage system that easily handles many petabytes (1PB = 1,000 TB = 1,000,000 GB) and increase both performance and capacity at the same time. Ceph has many basic enterprise storage features including replication (or erasure coding), snapshots, thin provisioning, tiering (ability to shift data between flash and hard drives), and self-healing capabilities.

Why Ceph is HOT

In many ways Ceph is a unique animal—it’s the only storage solution that deliver four  critical  capabilities:
  • open-source
  • software-defined
  • enterprise-class
  • unified storage (object, block, file).
Many other storage products are open source or scale out or software-defined or unified or have enterprise features, and some let you pick 2 out of 3, but almost nothing else offers all four together.

Red Hat Ceph Storage: Past, Present and Future

Open source means lower cost
Software-defined means deployment flexibility, faster hardware upgrades, and lower cost
Scale-out means it’s less expensive to build large systems and easier to manage them
Block + Object means more flexibility (most other storage products are block only, file only, object only, or file+block; block+object is very rare)
Enterprise features mean a reasonable amount of efficiency and data protection

Quick and Easy Deployment of a Ceph Storage Cluster with SLES 

Ceph includes many basic enterprise storage features including: replication (or erasure coding), snapshots, thin provisioning, auto-tiering (ability to shift data between flash and hard drives), self-healing capabilities

Red Hat Storage Day New York - What's New in Red Hat Ceph Storage

Despite all that Ceph has to offer there are still two camps: those that love it and those that dismiss it.

I Love Ceph!
The nature of Ceph means some of the storage world loves it, or at least has very high hopes for it. Generally server vendors love Ceph because it lets them sell servers as enterprise storage, without needing to develop and maintain complex storage software. The drive makers (of both spinners and SSDs) want to love Ceph because it turns their drive components into a storage system. It also lowers the cost of the software and controller components of storage, leaving more money to spend on drives and flash.

Ceph, Meh!
On the other hand, many established storage hardware and software vendors hope Ceph will fade into obscurity. Vendors who already developed richly featured software don’t like it because it’s cheaper competition and applies downward price pressure on their software. Those who sell tightly coupled storage hardware and software fear it because they can’t revise their hardware as quickly or sell it as cheaply as the commodity server vendors used by most Ceph customers.

Battle of the Titans – ScaleIO vs. Ceph at OpenStack Summit Tokyo 2015 (Full Video)

To be honest, Ceph isn’t perfect for everyone. It’s not the most efficient at using flash or CPU (but it’s getting better), the file storage feature isn’t fully mature yet, and it is missing key efficiency features like deduplication and compression. And some customers just aren’t comfortable with open-source or software-defined storage of any kind. But every release of Ceph adds new features and improved performance, while system integrators build turnkey Ceph appliances that make it easy to deploy and come with integrated hardware and software support.
What’s Next for Ceph?

EMC- Battle of the Titans: Real-time Demonstration of Ceph vs. ScaleIO Performance for Block Storage

Ceph continues to evolve, backed by both Red Hat (which acquired Inktank in 2014) and by a community of users and vendors who want  to see it succeed.  In every release it gets faster, gains new features, and becomes easier to manage.

The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat

Ceph is basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.
If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.

Building exascale active archives with Red Hat Ceph Storage

This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.

Ceph components
We’ll start at the bottom of the stack and work our way up.

OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg.
) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We’re using btrfs for our testing.

Using RAIDed OSDs
A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.

Placement Groups
Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:

  • PG-1 might map to OSDs 1, 37 and 99
  • PG-2 might map to OSDs 4, 22 and 41
  • PG-3 might map to OSDs 18, 26 and 55
  • Etc.

Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.

A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

The Future of Cloud Software Defined: Andrew Hatfield, Red Hat

CRUSH maps
CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Focus on: Red Hat Storage big data

Still confused?
This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:

Ceph services
Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?

What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.

Decoupling Storage from Compute in Apache Hadoop with Ceph

As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

  • RBDs are striped over multiple PGs for performance
  • RBDs are resizable
  • Thin provisioning means on-disk space isn’t used until actually required

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.

Red Hat Storage Day Boston - Why Software-defined Storage Matters

CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.

Kraken Ceph Dashboard

More Information:



















Apache: Big Data North America 2016   https://www.youtube.com/watch?v=hTfIAWhd3qI&list=PLGeM09tlguZQ3ouijqG4r1YIIZYxCKsLp

DISTRIBUTED STORAGE PERFORMANCE FOR OPENSTACK CLOUDS: RED HAT STORAGE SERVER VS. CEPH STORAGE   http://docplayer.net/2905788-Distributed-storage-performance-for-openstack-clouds-red-hat-storage-server-vs-ceph-storage.html

Red Hat Announces Ceph Storage 2  http://www.storagereview.com/red_hat_announces_ceph_storage_2

Red Hat Ceph Storage


0 reacties:

Post a Comment