13 September 2011

RAC Cache Fusion (All Rights Oracle)

Global Cache and Global Enqueue Service
Processes and Functions

Cache Fusion uses the most efficient communications as possible to limit the amount of traffic used on the interconnect, now you don't need this level of detail to administer a RAC environment but it sure helps to understand how RAC works when trying to diagnose problems. RAC appears to have one large buffer but this is not the case, in reality the buffer caches of each node remain separate, data blocks are shared through distributed locking and messagingoperations. RAC copies data blocks across the interconnect to other instances as it is more efficient than reading the disk, yes memory and networking together are faster than disk I/O.


The transfer of a data block from instances buffer cache to another instances buffer cache is know as a ping. As mentioned already when an instance requires a data block it sends the request to the lock master to obtain a lock in the desired mode, this process isknown as blocking asynchronous trap (BAST). When an instance receives a BAST it downgrades the lock ASAP, however it might have to write the corresponding block to disk, this operation is known as disk ping or hard ping. Disk pings have been reduce in the later versions of RAC, thus relaying on block transfers more, however there will always be a small amount of disk pinging. In the newer versions of RAC when a BAST is received sending the block or downgrading the lock may be deferred by tens of milliseconds, this extra time allows the holding instance to complete an active transaction and mark the block header appropriately, this will eliminate any need for the receiving instance to check the status of the transaction immediately after receiving/reading a block. Checking the status of a transaction is an expensive operation that may require access (and pinging) to the related undo segment header and undo data blocks as well. The parameter _gc_defer_time can be used to define the duration by which an instance deferred downgrading a lock.

Past Image Blocks (PI)
Past Images (PIs), basically are copies of data blocks in the local buffer cache of an instance. When an instance sends a block it has recently modified to another instance, it preserves a copy of that block, marking as a PI. The PI is kept until that block is written to disk by the current owner of the block. When the block is written to disk and is known to have a global role, indicating the presents of PIs in other instances buffer caches, GCS informs the instance holding the PIs to discard the PIs. When a checkpoint is required it informs GCS of the write requirement, GCS is responsible for finding the most current block image and informing the instance holding that image to perform a block write. GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block, allowing the global resource to be released. You can view the past image blocks present in the fixed table X$BH.

Cache Fusion I

Cache Fusion I is also know as consistent read server and was introduced in Oracle 8.1.5, it keeps a list of recent transactions that have changed a block.the original data contained in the block is preserved in the undo segment, which can be used to provide consistent read versions of the block.

In a single instance the following happens when reading a block

·         When a reader reads a recently modified block, it might find an active transaction in the block
·         The reader will need to read the undo segment header to decide whether the transaction has been committed or not
·         If the transaction is not committed, the process creates a consistent read (CR) version of the block in the buffer cache using the data in the block and the data stored in the undo segment
·         If the undo segment shows the transaction is committed, the process has to revisit the block and clean out the block (delay block cleanout) and generate the redo for the changes.

In an RAC environment if the process of reading the block is on an instance other than the one that modified the block, the reader will have to read the following blocks from the disk

·         data block to get the data and/or transaction ID and Undo Byte Address (UBA)
·         undo segment header block to find the last undo block used for the entire transaction
·         undo data block to get the actual record to construct a CR image

Before these blocks can be read the instance modifying the block will have to write those's blocks to disk, resulting in 6 I/O operations. In RAC the instance can construct a CR copy by hopefully using the above blocks that are still in memory and then sending the CR over the interconnect thus reducing 6 I/O operations.

As from Oracle 8 introduced a new background process called the Block Server Process makes the CR fabrication at the holders cache and ships the CR version of the block across the interconnect, the sequence is detailed in the table below

While making a CR copy, the holding instance may refuse to do so if

·         it does not find any of the blocks needed in its buffer cache, it will not perform a disk read to make a CR copy for another instance
·         It is repeatedly asked to send a CR copy of the same block, after sending the CR copies four times it will voluntarily
·         relinquish the lock, write the block to the disk and let other instances get the block from the disk. The number of copies it will serve before doing so is governed by the parameter _fairness_threshold

Cache Fusion II

Read/Write contention was addressed in cache fusion I, cache fusion II addresses the write/write contention

Cache Fusion in Operation
A quick recap of GCS, a GCS resource can be local or global, if it is local it can be acted upon without consulting other instances, if it is global it cannot be acted upon without consulting or informing remote instances. GCS is used as a messaging agent to coordinate manipulation of a global resource. By default all resources are in NULL mode (remember null mode is used to convert from one type to another (share or exclusive)).

The table below denotes the different states of a resource

Null (N)
Shared (S)
Exclusive (X)
it can serve a copy of the block to other instances and it can read the block from disk, since the block is not modified there is no need to write to disk
it has sole ownership and interest in that resource, it has exclusive right to modify the block, all changes to the blocks are in the local buffer cache and it can write the block to the disk. If another instance wants the block it can to come via the GCS
used to protect consistent read block, if an instance wants it in X mode, the current instance will send the block to the requesting instance and downgrades its role to NL
a block is present in one or more instances, an instance can read the read from disk and serve it to other instances
a block can have one or more PIs, the instance with the XG role has the latest copy of the block and is the most likely candidate to write the block to the disk. GCS can ask the instance to write the block and serve it to other instances
after discarding PIs when instructed to by GCS, the block is kept in the buffer cache with NG role, this serves only as the CR copy of the block.

Below are a number of common scenarios to help understand the following

  • ·         reading from disk
  • ·         reading from cache
  • ·         getting the block from cache for update
  • ·         performing an update on a block
  • ·         performing an update on the same block
  • ·         reading a block that was globally dirty
  • ·         performing a rollback on a previously updated block
  • ·         reading the block after commit

We will assume the following

  • ·         Four RAC environment (Instances A, B, C and D)
  • ·         Instance D is the master of the lock resource for the data block BL
  • ·         We will only use one block and it will reside at SCN 987654
  • ·         We will use a three-letter code for the lock states
o    first letter will indicate the lock mode - N = Null, S = Shared and X = Exclusive
o    second latter will indicate lock role - G = Global, L = Local
o    The third letter will indicate the PIs - 0 = no PIs, 1 = a PI of the bloc

for example a code of SL0 means a global shared lock with no past images (PIs)

The above sequence of events can be seen in the table below

 (Last overview picture of all RAC processes, all rights Julian Dyke)


0 reacties:

Post a Comment