Welcome to Linux Forums! With a comprehensive Linux Forum, information on various types of Linux software and many Linux Reviews articles, we have all the knowledge you need a click away, or accessible via our knowledgeable members.
Find the answer to your Linux question:
Site Navigation
Linux Forums
Linux Articles
Product Showcase
Linux Downloads
Linux Hosting
Free Magazines
Job Board
IRC Chat
RSS Feeds
Free Publications


Since the advent of ext3 over a decade ago, journalling layer has been a part of linux kernel. It helps filesystems to maintain consistency, by applying complex filesystem operations in form of atomic transactions, which are guaranteed either to be fully complete or not at all. Thus at any point of time, filesystem is consistent even after crashes or unclean unmounts. In this article we walk through the internals of jbd and see how filesystem consistency is achieved using the jbd layer. Linux kernel 3.0.0 is used as the source code for reference, taking ext3 as an example of the filesystem using jbd wherever required.
Abstract

Since the advent of ext3 over a decade ago, journalling layer has been a part of linux kernel. It helps filesystems to maintain consistency, by applying complex filesystem operations in form of atomic transactions, which are guaranteed either to be fully complete or not at all. Thus at any point of time, filesystem is consistent even after crashes or unclean unmounts. In this article we walk through the internals of jbd and see how filesystem consistency is achieved using the jbd layer. Linux kernel 3.0.0 is used as the source code for reference, taking ext3 as an example of the filesystem using jbd wherever required.

ContentsTerminologyJournal handle - A handle pointer, each of which represents a single atomic filesystem operation. It tracks all the buffer modifications done as part of one atomic operation. Here atomicity is defined as all the operations required to move the filesystem from one consistent state to another.
Transaction - Collection of atomic sequence of events which guarantees filesytem consistency. It can consist of a single journal handle or multiple handles for batching efficiency.
Transaction commit - Flushing the in-memory contents of journal to appropriate blocks in journal along with writing a commit record on disk in journal.
Transaction checkpoint - Flushing the contents from journal to their actual location on disk. Since journal is a circular array of fixed blocks, this is done periodically to make journal space reusable.

Journal overviewMost of the times, any single file system operation consists of multiple sub-operations. For example, a write involves allocating new blocks, updating the inode contents, updating block accounting etc. A crash in between any of the operations leaves half baked write and filesystem in an inconsistent state. Inorder to ensure that the filesystem is consistent at every point of time, its important that the changes done as part of all these sub-operations aren't reflected in the filesystem till all the operations are complete. Journalling filesystems use a separate area, known as journal (which can either reside in the same filesystem or on an external device) where these modifications are recorded, while these operations are in progress. Only when it is guaranteed that all the sub-operations have completed, these changes are replayed to their original location on-disk. So even if filesystem crashes in the middle of sub-operations, filesystem is consistent.Journal is simply a fixed size circular array of blocks which is used to store the filesystem modifications. Each modification is tagged and stored on journal by a transaction-id and can be identified by start and end markers. First block in the journal contains the journal superblock (journal_superblock_t) which maintains various accounting and metadata information about rest of the journal.See references for detailed journal layout.

Anatomy of a transactionJournalling a filesystem operation broadly consists of following three steps :-
  • Starting a handle - journal_start(): As part of starting a journal handle, we need to specify the maximum number of filesystem blocks this op can potentially modify. This is required to ensure that there would be enough space in the journal to completely write all the buffers which will be modified as part of this operation. The number of blocks required is the total number of blocks, including the data which is going to change, metadata blocks, quota blocks if any etc. As an example see EXT3_DATA_TRANS_BLOCKS. These are called buffer credits for the handle.
  • Update the handle : After getting a handle, next step is to associate the modified blocks with the journal handle, so that journal knows that it has to write these blocks in journal. This is done via the API journal_get_write_access(handle, bh), which tells the journal that this buffer is going to be modified. A buffer which is of interest to journalling layer has BH_JBD set on it and has a non-zero b_count. At this point a "journal_head"is attached to the buffer. A journal_head can only be part of 1 transaction. A buffer is a "journalled" buffer, only if it has a journal head attached to it.
    journal_get_write_access(handle, bh) {
    journal_add_journal_head(bh);
    do_get_write_access(handle, jh, 0);
    journal_put_journal_head(bh);
    }

    journal_add_journal_head(bh) {
    jh = journal_alloc_journal_head();
    set_buffer_jbd(bh);
    bh->b_private = jh;
    jh->b_bh = bh;
    jh->b_jcount++;
    }
    A buffer is already part of a transaction if its journal_head's b_transaction or b_next_transaction is set. Most of the times, only b_transaction is set. b_next_transaction will be set incase the buffer is getting committed from previous transaction and we are changing it for the current transaction. The b_next_transaction tells journal that this buffer is going in next transaction. In this case a copy on write is performed and the frozen copy is stored in jh->b_frozen_data.NB: Buffer's b_transaction will only be set if its part of running or committing transaction and not if it resides on some other list like checkpoint list etc.Metadata buffers are handled via the jbd2_journal_dirty_metadata call, which would set b_modified to 1, remove the buffer from BJ_Reserved list, and put it on metadata buffer list (t_buffers).
  • Stop the handle - journal_stop(): As the name suggests, journal stop marks the completion of an op wrt to journal. It returns any left over unused buffer credits to the transaction, drops appropriate references and frees the handle pointer.If the filesystem requested this op in sync mode, we also need to start committing the transaction to the journal on completion of handle. However in the current code there are some optimizations built around it to figure it out whether it is beneficial to start writing to disk immediately, or based on the op rate wait for sometime and let the subsequent operation do it.
See the following code as example.
journal_get_write_access(handle, bh) {
journal_add_journal_head(bh);
do_get_write_access(handle, jh, 0);
journal_put_journal_head(bh);
}

journal_add_journal_head(bh) {
jh = journal_alloc_journal_head();
set_buffer_jbd(bh);
bh->b_private = jh;
jh->b_bh = bh;
jh->b_jcount++;
}
Each journal_start/stop pair ie...each handle consists of one atomic filesystem operation. Some fs operations may be atomic in itself but still may not be sufficient enough to have the filesystem in a consistent state. An example of such an op is a write which requires a quota update. Nested journal handles will be required to have such atomic op.For such operations, typical journalling sequence would be :
  • Start journal handle for write.
  • Start journal handle for quota update.
  • Stop journal handle for quota update.
  • Stop journal handle for write.
Its only after the last journal stop, the transaction can be committed to disk.A journal transaction consists of various lists where buffers of interest can reside. Buffers end up on one of the list, depending on what flag/state they are. Below is the buffer flag to transaction list mapping. See the function __journal_file_buffer() to see how buffers are moved across lists.

BJ_SyncData => transaction->t_sync_datalist
BJ_Metadata => transaction->t_buffers (place for metadata buffers)
BJ_Forget => transaction->t_forget
BJ_IO => transaction->t_iobuf_list
BJ_Shadow => transaction->t_shadow_list
BJ_LogCtl => transaction->t_log_list
BJ_Reserved => transaction->t_reserved_list
BJ_Locked => transaction->t_locked_list
Committing a transactionJournal's transaction commit is the process of recording one or more atomic filesystem operations in the journal on disk. Only after a successful commit of transaction, can the filesystem operation be considered to be complete and is guaranteed that even in case of crash scenarios, these operations will be replayed and will result in a consistent filesystem. A transaction commit consists of 8 phases, with the journal's state transitions mentioned as below in each of the phase. The main function which does the journal commit is journal_commit_transaction(). When we decide to commit the transaction, journal is already in running state.(T_RUNNING). Since a transaction can have multiple journal handles, we need to stop accepting any more handles for the committing transaction. Thus journal's state is changed to T_LOCKED so that no new handles can be attached to the journal and the transaction waits for all the existing journal handle updates to complete before starting the commit. At this point any buffers belonging to the reserved lists (t_reserved_list) are discarded. It also tries to free up some memory by getting rid of buffers sitting on the checkpoint list (t_checkpoint_list).

Transaction commit phase transitions:Phase 1
  • Switch the revoke tables. The current revoke hash will be written down to disk and the alternate one will be available for the next running transaction to log.
  • Change transaction state to T_FLUSH
  • At this point there is no running transaction, it is changed to the current commiting transaction. Any new handles requiring a transaction will start a new transaction.
Phase 2Flushing starts now.
  • Data buffers are flushed first ie..buffers sitting on t_sync_datalist. These buffers will be written to their original location on disk directly.
  • Write out revoke records from the revoke hash list and flush to the descriptor blocks in journal.
  • Change transaction state to T_COMMIT
Phase 3
  • Flush metadata buffers (present on t_buffers list). Since metadata is getting logged in the journal, we also need to store the original location where this data was supposed to go in the actual filesystem. Journal stores the mapping of journal block to the actual filesystem block number in form of tags in the descriptor blocks. See journal_write_metadata_buffer().Since journal blocks have magic numbers as identifier at the start, if the data getting logged coincides with JFS_MAGIC_NUMBER at the appropriate offset, transaction commit code needs to special mark such blocks so that the recovery code doesn't get confuse by reading this blocks and treat it as journal internal metadata block. This is done by setting the value to 0, and putting a flag JFS_FLAG_ESCAPE in the journal tag for this block. This flag ensures that during recovery, the value will be restored back to its original value of JFS_MAGIC_NUMBER.
Phase 4
  • Wait for all the buffers submitted for IO above.
  • Wait for metadata buffers which are present on t_iobuf_list. The dummy buffer heads created for metadata buffers are released. The original metadata buffer which was put on shadow list is released, but put into t_forget list.
Phase 5
  • Wait for the submitted revoke record and descriptor buffers to complete and written out. This is done by waiting for buffers on t_log_list.
Phase 6
  • Flushing of contents is complete, need to synchronously write a commit record now.
  • Change transaction state to T_COMMIT_RECORD
  • IO for data is complete now. Write the commit record in journal. Completion of this marks one atomic transaction in filesystem and recovery is possible incase we crash after this.
Phase 7
  • If there are number of transactions present in the journal, walk the journal's t_forget list to get rid of buffers till there are no more buffers on it. As each buffer is examined, we check if it was on the checkpoint io list of previous transaction. If it is, its removed and if required (in case its dirty) its transferred to the checkpoint list of the committing transaction. See __journal_insert_checkpoint()
Phase 8We are done committing the transaction now.
  • Change transaction state to T_FINISHED.
  • Set committing transaction = NULL.
  • Calculate average commit time for future use.
  • Setup the checkpointing transaction.
Checkpointing a transactionTransaction checkpointing is the act of flushing down the journalled blocks to their actual location on disk in filesystem. Since journal is of fixed small size, this is required to reclaim journal space and make it available for future handles. There can be multiple transactions which require checkpointing at any point of time in a journal. Each of this transactions has a list of buffers which need to be flushed to disk.

The main functions involved in checkpointing are :
  • log_do_checkpoint()
  • __process_buffer()
  • __flush_batch()
  • __wait_cp_io()
log_do_checkpoint picks up the first transaction on the checkpoint list and then iterates over all the buffers present in the transaction by calling __process_buffer on each of them. As it traverses, it keeps accumulating them in a local array for batching of disk writes. As part of processing it also moves the buffer from checkpoint_list to checkpoint_io_list to indicate that io is pending on these buffers. Once the batch array is full or we have no more buffers to process __flush_batchis called to send those buffers to disk for writing.After the buffers are submitted to disk, __wait_cp_io() is called to wait on each of the buffers for write to complete. Once the buffers get cleaned, they are removed from the checkpoint_io_list. After all the buffers are freed, transaction itself is freed.


Revoked blocks
  • Revoke is a method of preventing journal from corrupting filesystem by not replaying ops corresponding to a deleted/freed block and overwriting the latest contents of a newer block. For example consider the following sequence of steps when the filesystem is mounted in metadata only journalling mode.
  • A metadata block 'B' is journalled and contents are copied to journal.
  • Later 'B' gets freed as part of some transaction.
  • Block 'B' is now reused to write contents of user data, and this operation is not journalled.
Now if we crash and replay, we need to avoid replaying the contents of block 'B' in journal over the user contents.

Revoke mechanism: During block frees, if the block is journalled, filesystem directs journal to insert the freed block in the revoke hash. When the transaction is committed, the list of all the blocks which are revoked are flushed on disk in journal. This record of revoked blocks is used during journal recovery and journal is scanned for the revoked blocks before any ops is replayed. If there are transactions for the block after the last revoke record of a block, these ops are safe to replay because this is the latest copy for the block. Any transactions which appear before the revoke record aren't replayed. The basic idea is that you don't want to replay ops corresponding to a block which may have been freed. Also note that if there are multiple revoke records corresponding to a block in a journal, we only need to worry about the latest record ie...one with highest transaction id.

From file fs/jbd/revoke.c
* We can get interactions between revokes and new log data within a
* single transaction:
*
* Block is revoked and then journaled:
* The desired end result is the journaling of the new block, so we
* cancel the revoke before the transaction commits.
*
* Block is journaled and then revoked:
* The revoke must take precedence over the write of the block, so we
* need either to cancel the journal entry or to write the revoke
* later in the log than the log block. In this case, we choose the
* latter: journaling a block cancels any revoke record for that block
* in the current transaction, so any revoke for that block in the
* transaction must have happened after the block was journaled and so
* the revoke must take precedence.
*
* Block is revoked and then written as data:
* The data write is allowed to succeed, but the revoke is _not_
* cancelled. We still need to prevent old log records from
* overwriting the new data. We don't even need to clear the revoke
* bit here.

There are two hash tables to store the revoked entries. These two tables are required one for the running transaction and one for the committing transaction (if any). As you can guess, new entries are always logged into the revoke table pointed by current journal->j_revoke pointer which points to the one corresponding to running transaction. You can think of it as a double buffering mechanism. These tables are switched alternately during the commit from kjounald. Access to these hash table entries is protected by the j_revoke_lock.The buffer head maintains two set of flags to indicate the revoke status of a buffer.
    • RevokeValid: The revoke status of this buffer is known and can be trusted. If this is not set we can't say much about the buffer and need to search for it in hash.
    • Revoke{set/clear}: These flags make sense only when above is set. They tell whether the block is revoked or not.
    • Important functions:-


    Initialize revoke hash : journal_init_revoke()
    Inserts in hash : insert_revoke_hash()
    Find in hash : find_revoke_record()
    Transfer the in-memory revoke table to ondisk journal : journal_write_revoke_records()

    NOTE: Note that you need to revoke a block before freeing it in bitmap and not the viceversa to prevent races.

    What if we crash?In case the filesystem crashes or its not unmounted cleanly, next mount of the filesystem will notice that there are entries in the journal which haven't been checkpointed to the appropriate blocks on disk and the journal needs to be replayed to ensure filesystem consistency, before the filesystem can be marked as mounted and open for operations to external world.Journal recovery code resides in fs/jbd/recovery.c. It basically consists of below steps.
    1. Readahead journal blocks in memory.
    2. Do first pass of the journal (PASS_SCAN) to see if we need a recovery. If yes what all transactions do we need to replay, if the journal is valid etc other sanity checks. After the first scan pass, an incore data structure about the journal (struct recovery_info) is populated which contains the required information about the recovery. The main interesting things during recovery are the block number to start looking at (identified by s_start) and the sequence number (s_sequence). For a journal which doesn't need recovery, s_start is always 0.
    3. Do second pass (PASS_REVOKE). This traverses all the revoke block types and builds the incore hash of block numbers which are revoked. This incore hash is consulted before replaying any operations for a particular block. This ensures that we don't overwrite the old contents of a block on newer contents, thereby corrupting the filesystem.
    4. Do the third/final pass (PASS_REPLAY) which actually does the job of replaying the journal and copies the data from journal to the real filesystem. Replaying a op simply consists of reading the corresponding block number from filesystem, copying the contents from journal to buffer and then marking the buffer dirty which would be written back to the actual location in filesystem.
    5. Once the replay is complete, in-memory revoke hash is destroyed so that the revoke tables can be used during normal filesystem operations.
    6. Sync the blockdevice.
    7. Once the recovery is done, journal_reset() is called to setup the in-memory fields of journal, and journal is ready for business again.
    Steps (2), (3) and (4) are done through a common function do_one_pass()
Code flow

journal_recover()
|
--> do_one_pass(PASS_SCAN)
|
--> do_one_pass(PASS_REVOKE)
|
--> do_one_pass(PASS_REPLAY)
|
--> journal_clear_revoke() (destroy the incore revoke hash)
|
--> sync_blockdev() and journal_reset()

References
Rate This Article: poorexcellent
 
Comments about this article

Comment title: * please do not put your response text here