Files
swift/doc/source/ring_partpower.rst
Tim Burke ae062f8b09 ring: Introduce a v2 ring format
There's a bunch of moving pieces here:

- Add a new RingWriter class.

  Stick it in a new swift.common.ring.io module. You *can* use it like
  the old gzip file, but you can also define named sections which can
  be referenced later on read. Section names may be arbitrary strings,
  but the "swift/" prefix is reserved for upstream use. Sections must
  contain a single length-value encoded BLOB. If sections are used, an
  additional BLOB is written at the end containing a JSON section-index,
  followed by an uncompressed offset for the index.

  Move RingReader to ring/io.py, too.

- Clean up some ring metadata handling:

  - Drop MD5 tracking in RingReader. It was brittle at best anyway, and
    nothing uses it. YAGNI

  - Fix size/raw_size attributes when loading only metadata.

- Add the ability to seek within RingReaders, though you need to know
  what you're doing and only seek to flush points.

- Let RingBuilder objects change how wide their replica2part2dev_id
  arrays are. Add a dev_id_bytes key to serialized ring metadata.

  dev_id_bytes may be either 2 or 4, but 4 requires v2 rings. We
  considered allowing dev_id_bytes of 1, but dropped it as unnecessary
  complexity for a niche use case.

- swift-ring-builder version subcommand added, which takes a ring. This
  lets operators see the serialization format of a ring on disk:

  $ swift-ring-builder object.ring.gz version
  object.ring.gz: Serialization version: 2 (2-byte IDs), build version: 54

Signed-off-by: Tim Burke <tim.burke@gmail.com>
Change-Id: Ia0ac4ea2006d8965d7fdb6659d355c77386adb70
2025-07-21 11:37:15 -07:00

8.1 KiB

Modifying Ring Partition Power

The ring partition power determines the on-disk location of data files and is selected when creating a new ring. In normal operation, it is a fixed value. This is because a different partition power results in a different on-disk location for all data files.

However, increasing the partition power by 1 can be done by choosing locations that are on the same disk. As a result, we can create hard-links for both the new and old locations, avoiding data movement without impacting availability.

To enable a partition power change without interrupting user access, object servers need to be aware of it in advance. Therefore a partition power change needs to be done in multiple steps.

Note

Do not increase the partition power on account and container rings. Increasing the partition power is only supported for object rings. Trying to increase the part_power for account and container rings will result in unavailability, maybe even data loss.

Caveats

Before increasing the partition power, consider the possible drawbacks. There are a few caveats when increasing the partition power:

  • Almost all diskfiles in the cluster need to be relinked then cleaned up, and all partition directories need to be rehashed. This imposes significant I/O load on object servers, which may impact client requests. Consider using cgroups, ionice, or even just the built-in --files-per-second rate-limiting to reduce client impact.
  • Object replicators and reconstructors will skip affected policies during the partition power increase. Replicators are not aware of hard-links, and would simply copy the content; this would result in heavy data movement and the worst case would be that all data is stored twice.
  • Due to the fact that each object will now be hard linked from two locations, many more inodes will be used temporarily - expect around twice the amount. You need to check the free inode count before increasing the partition power. Even after the increase is complete and extra hardlinks are cleaned up, expect increased inode usage since there will be twice as many partition and suffix directories.
  • Also, object auditors might read each object twice before cleanup removes the second hard link.
  • Due to the new inodes more memory is needed to cache them, and your object servers should have plenty of available memory to avoid running out of inode cache. Setting vfs_cache_pressure to 1 might help with that.
  • All nodes in the cluster must run at least Swift version 2.13.0 or later.

Due to these caveats you should only increase the partition power if really needed, i.e. if the number of partitions per disk is extremely low and the data is distributed unevenly across disks.

1. Prepare partition power increase

The swift-ring-builder is used to prepare the ring for an upcoming partition power increase. It will store a new variable next_part_power with the current partition power + 1. Object servers recognize this, and hard links to the new location will be created (or deleted) on every PUT or DELETE. This will make it possible to access newly written objects using the future partition power:

swift-ring-builder <builder-file> prepare_increase_partition_power
swift-ring-builder <builder-file> write_ring

Now you need to copy the updated .ring.gz to all nodes. Already existing data needs to be relinked too; therefore an operator has to run a relinker command on all object servers in this phase:

swift-object-relinker relink

Note

Start relinking after all the servers re-read the modified ring files, which normally happens within 15 seconds after writing a modified ring. Also, make sure the modified rings are pushed to all nodes running object services (replicators, reconstructors and reconcilers)- they have to skip the policy during relinking.

Note

The relinking command must run as the same user as the daemon processes (usually swift). It will create files and directories that must be manipulable by the daemon processes (server, auditor, replicator, ...). If necessary, the --user option may be used to drop privileges.

Relinking might take some time; while there is no data copied or actually moved, the tool still needs to walk the whole file system and create new hard links as required.

2. Increase partition power

Now that all existing data can be found using the new location, it's time to actually increase the partition power itself:

swift-ring-builder <builder-file> increase_partition_power
swift-ring-builder <builder-file> write_ring

Now you need to copy the updated .ring.gz again to all nodes. Object servers are now using the new, increased partition power and no longer create additional hard links.

Note

The object servers will create additional hard links for each modified or new object, and this requires more inodes.

Note

If you decide you don't want to increase the partition power, you should instead cancel the increase. It is not possible to revert this operation once started. To abort the partition power increase, execute the following commands, copy the updated .ring.gz files to all nodes and continue with 3. Cleanup afterwards:

swift-ring-builder <builder-file> cancel_increase_partition_power
swift-ring-builder <builder-file> write_ring

3. Cleanup

Existing hard links in the old locations need to be removed, and a cleanup tool is provided to do this. Run the following command on each storage node:

swift-object-relinker cleanup

Note

The cleanup must be finished within your object servers reclaim_age period (which is by default 1 week). Otherwise objects that have been overwritten between step #1 and step #2 and deleted afterwards can't be cleaned up anymore. You may want to increase your reclaim_age before or during relinking.

Afterwards it is required to update the rings one last time to inform servers that all steps to increase the partition power are done, and replicators should resume their job:

swift-ring-builder <builder-file> finish_increase_partition_power
swift-ring-builder <builder-file> write_ring

Now you need to copy the updated .ring.gz again to all nodes.

Background

An existing object that is currently located on partition X will be placed either on partition 2*X or 2*X+1 after the partition power is increased. The reason for this is the Ring.get_part() method, that does a bitwise shift to the right.

To avoid actual data movement to different disks or even nodes, the allocation of partitions to nodes needs to be changed. The allocation is pairwise due to the above mentioned new partition scheme. Therefore devices are allocated like this, with the partition being the index and the value being the device id:

old        new
part  dev   part  dev
----  ---   ----  ---
0     0     0     0
        1     0
1     3     2     3
        3     3
2     7     4     7
        5     7
3     5     6     5
        7     5
4     2     8     2
        9     2
5     1     10    1
        11    1

There is a helper method to compute the new path, and the following example shows the mapping between old and new location:

>>> from swift.common.utils import replace_partition_in_path
>>> old='objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
>>> replace_partition_in_path('', '/sda/' + old, 14)
'objects/16003/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'
>>> replace_partition_in_path('', '/sda/' + old, 15)
'objects/32007/a38/fa0fcec07328d068e24ccbf2a62f2a38/1467658208.57179.data'

Using the original partition power (14) it returned the same path; however after an increase to 15 it returns the new path, and the new partition is 2*X+1 in this case.