OCFS2
- Oracle Cluster File System for Linux
OCFS2 (Oracle Cluster File System 2) is a free, open source,
general-purpose, extent-based clustered file system which Oracle developed
and contributed to the Linux community, and accepted into Linux kernel
2.6.16.
Unlike stated in the table below, OCFS2 is available in SUSE Linux Enterprise Server (where
it is the primarily supported shared cluster file system), CentOS, Debian
GNU/Linux, and Ubuntu Server Edition. Oracle also provides packages for Red
Hat Enterprise Linux (RHEL).
Building on Oracle's long-term commitment to the Linux and open source
community, OCFS2 provides an open source, enterprise-class alternative to
proprietary cluster file systems, and provides both high performance and high
availability. OCFS2 provides local file system semantics and it can be used
with any application. Cluster-aware applications can leverage cache-coherent
parallel I/O for higher performance, and other applications can make use of
the file system to provide a fail-over setup to increase availability.
Most Complete, Open, Integrated Enterprise Software Stack for Linux
OCFS2 is among the many key Oracle value-added
technologies in Oracle Linux, and Oracle VM uses OCFS2 as its
cluster filesystem to host virtual machine images, as well as the OCFS2
heartbeat to handle high availability. Only Oracle delivers the most
complete, open, integrated enterprise software stack for Linux, including
database, middleware, applications, and virtualization.
That it is a really great contribution and widely appreciated you can
read from this comment from Google:
"This is a vital contribution to the open source
community. The endorsement of OCFS2 by the Linux community represents a
significant milestone for Oracle and demonstrates how Oracle's continued
contributions are driving adoption of open source technologies." -
Andrew Morton, Linux Kernel Maintainer, Google
|
|
Availability/Support
From Oracle
|
|
Oracle Linux 6.0
|
ocfs2 version 1.6.3
(Using the Unbreakable Enterprise Kernel)
ocfs2-tools version
1.6.4 (default)
|
Oracle Linux 5.6
|
ocfs2 version 1.6.3
(Using the Unbreakable Enterprise Kernel)
ocfs2 version 1.4.8
(alternative kernel)
ocfs2-tools version
1.6.3 (default)
|
Technical
Specifications (OCFS2 version 1.6)
|
|
Supported Platforms
|
x86, x86_64
|
Networking
|
TCP/IP
|
Supported Operating
Systems
|
Oracle Linux, RHEL (OCFS
1.4.8 only)
|
Block Size
|
from 512 bytes to 4KB
|
Cluster Size
|
from 4KB to 1MB
|
Max File Size
|
16TB
|
Max number of
subdirectories
|
32,000
|
Max number of
addressable clusters
|
2^32 (a future release
will increase the limit to 2^64)
|
Max Filesystem Size
|
4PB* (1MB cluster size)
|
Version
Compatibility
|
|
Compatibility between
OCFS2 versions
|
OCFS2 strives for
backwards compatibility with older versions. OCFS2 Release 1.6 is fully
compatible with OCFS2 Release 1.4 or 1.2. A node with the new release can
join a cluster of nodes running the older file system.
OCFS2 Release 1.6 is
on-disk compatible with OCFS2 Release 1.2. However, not all the new features
are activated by default; users can enable and disable features as needed
using tunefs.ocfs2. The latest version of ocfs2-tools supports all existing
versions of the file system.
|
OCFS2
Features
|
||
Feature
|
Details
|
|
Variable Block and
Cluster sizes
|
Supports block sizes
ranging from 512 bytes to 4KB and cluster sizes ranging from 4KB to 1MB.
|
|
Extent-based
allocations
|
Tracks the allocated
space in ranges of clusters making it especially efficient for storing large
files.
|
|
Metadata Checksums
|
Ensures integrity by
detecting silent corruption in meta-data objects like inodes and directories.
The error correction code is capable of fixing single-bit errors
automatically.
|
|
Extended Attributes
|
Supports attaching an
unlimited number of name:value pairs to the file system objects like regular
files, directories or symbolic links.
|
|
Advanced Security
|
Supports POSIX ACLs
(Access Control Lists) and SELinux attributes in addition to the traditional
file access permission/ownership model.
|
|
Quotas
|
Supports user and group
quotas.
|
|
File snapshots –
REFLINK
|
This feature allows a
regular user to create multiple, write-able snapshots of regular files. The
snapshot created is a point-in-time image of the file that includes both the
file data and all its attributes (including extended attributes). The file
system creates a new inode with the same extent pointers as the original
inode. Multiple inodes are thus able to share data extents. Because of this,
creating a REFLINK snapshot requires very little space initially. It grows
only when a snapshot is modified, using a copy-on-write mechanism. REFLINK
works across the cluster.
|
|
Journaling
|
Supports both ordered
and writeback data journaling modes to provide file system consistency in the
event of power failure or system crash.
Ordered Journal Mode
This new default journal mode (mount option data=ordered) forces the
file system to flush file data to disk before committing the corresponding
meta-data. This flushing ensures that the data written to newly allocated
regions will not be lost due to a file system crash. While this feature
removes the ever-so-small probability of stale or null data to appearing in a
file after a crash, it does so at the expense of some performance. Users can
revert to the older journal mode by mounting with data=writeback mount
option. It should be noted that file system meta-data integrity is preserved
by both journaling modes.
|
|
In-built
Cluster-stack with DLM
|
Includes an easy to
configure, in-kernel cluster-stack with a distributed lock manager.
|
|
Buffered, Direct,
Asynchronous, Splice and Memory Mapped I/Os
|
Supports all modes of
I/Os for maximum flexibility and performance.
|
|
Large Inodes
|
Block-sized inodes allow
it to store small files in the inode itself.
|
|
Comprehensive Tools
Support
|
Provides a familiar
EXT3-style tool-set that uses similar parameters for ease-of-use. The toolset
is cluster-aware in that it prevents users from formatting a volume from a
node if the volume is in use on some other node. Other tools such as
tunefs.ocfs2 detect active volumes and allow only operations that can be
performed on a live volume.
|
|
Performance
Enhancements
|
Enhances performance by
either reducing the numbers of I/Os or by doing them asynchronously.
Indexed
Directories – This feature allows
a user to perform quick lookups of a directory entry in a very large
directory. It also results in faster creates and unlinks and thus provides
better overall performance.
Directory
Readahead - Directory operations
asynchronously read the blocks that may get accessed in the future.
File
Lookup - Improves cold-cache
stat(2) times by cutting the required amount of disk I/O in half.
File
Remove and Rename - Replaces
broadcast file system messages with DLM locks for unlink(2) and rename(2)
operations. This improves node scalability, as the number of messages does
not grow with the number of nodes in the cluster.
|
|
Splice I/O
|
Adds support for the new
splice(2) system call. This allows for efficient copying between file
descriptors by moving the data in kernel.
|
|
Access Time Updates
|
Access times are now
updated consistently and are propagated throughout the cluster. Since such
updates can have a negative performance impact, the file system allows users
to tune it via the following mount options:
atime_quantum=
noatime - This standard mount option turns off atime
updates completely.
relatime - This is another standard mount option (added in
Linux v2.6.20) supported by OCFS2. Relative atime only updates the atime if
the previous atime is older than the mtime or ctime. This is useful for
applications that only need to know that a file has been read since it was
last modified. Additionally, all time updates in the file system have
nanosecond resolution.
|
|
Flexible Allocation
|
The file system now
supports some advanced features that are intended to allow users more control
over file data allocation. These features entail an on-disk change.
Sparse
File Support - It adds the ability
to support holes in files. This allows the ftruncate(2) system call to
efficiently extend files. The file system can postpone allocating space until
the user actually writes to those clusters.
Unwritten
Extents - It adds the ability for
an application to request a range of clusters to be pre-allocated, but not
initialized, within a file. Pre-allocation allows the file system to optimize
the data layout with fewer, larger extents. It also provides a performance
boost, delaying initialization until the user writes to the clusters. Users
can access these features via an ioctl(2), or via fallocate(2) on current
kernels.
Punching
Holes - It adds the ability for an
application to remove arbitrary allocated regions within a file. Creating
holes, essentially. This could be more efficient if a user can avoid zeroing
the data. Users can access these features via an ioctl(2), or via
fallocate(2) on later kernels.
Discontiguous
Block Group – It allows the
allocation of space for inodes to grow in smaller, variable-sized chunks.
|
|
Shared Writeable
mmap(2)
|
Shared writeable memory
mappings are fully supported now on OCFS2. The file system supports cluster
coherent shared writeable mmap. Processes on different nodes can mmap() a
file and write to the memory region fully expecting the writes to
transparently show up on other nodes.
|
|
Inline Data
|
This feature makes use
of OCFS2’s large inodes by storing the data of small files and directories in
the inode block itself. This saves space and can have a positive impact on
cold-cache directory and file operations. Data is transparently moved out to
an extent when it no longer fits inside the inode block. This feature entails
an on-disk change.
|
|
Online File system
Resize
|
Users can now grow the
file system without having to un-mount it. This feature requires a compatible
clustered logical volume manager. Compatible volumes managers will be
announced when support is available.
|
|
Clustered flock(2)
|
The flock(2) system call
is now cluster-aware. File locks taken on one node from user-space will
interact with those taken on other nodes. All flock(2) options are supported,
including the kernel's ability to cancel a lock request when an appropriate
kill signal is received.
|
|
Endian and
Architecture Neutral
|
Supports a cluster of
nodes with mixed architectures. Allows concurrent mounts on 32-bit and
64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64)
architectures.
|
|
Linux Community
Adoption
|
OCFS2 has been ported to
many architectures, including ppc64, ia64, s390x, and it's also integrated
with many Linux distributions, including SLES, Ubuntu, openSUSE, Fedora Core,
and Debian.
|
|
*
Theoretical Maximum. File systems up to 16TB have been tested.
|
0 reacties:
Post a Comment