Operational procedures guide

This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
2016-02-10 17:58:05 +10:00
parent 30624a866a
commit 3c61ab4678
8 changed files with 2277 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -86,6 +86,7 @@ Administrator Documentation
    admin_guide
    replication_network
    logs
    ops_runbook/index
 Object Storage v1 REST API Documentation
 ========================================
--- a/doc/source/ops_runbook/diagnose.rst
+++ b/doc/source/ops_runbook/diagnose.rst
--- a/doc/source/ops_runbook/general.rst
+++ b/doc/source/ops_runbook/general.rst
@@ -0,0 +1,36 @@
 ==================
 General Procedures
 ==================
 Getting a swift account stats
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. note::
   ``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
   ``swifty`` for an alternate, this is an example.
 This procedure describes how you determine the swift usage for a given
 swift account, that is the number of containers, number of objects and
 total bytes used. To do this you will need the project ID.
 Log onto one of the swift proxy servers.
 Use swift-direct to show this accounts usage:
 .. code::
   $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
   Status: 200
         Content-Length: 0
         Accept-Ranges: bytes
         X-Timestamp: 1379698586.88364
         X-Account-Bytes-Used: 67440225625994
         X-Account-Container-Count: 1
         Content-Type: text/plain; charset=utf-8
         X-Account-Object-Count: 8436776
         Status: 200
         name: my_container  count: 8436776  bytes: 67440225625994
 This account has 1 container. That container has 8436776 objects. The
 total bytes used is 67440225625994.
--- a/doc/source/ops_runbook/index.rst
+++ b/doc/source/ops_runbook/index.rst
@@ -0,0 +1,79 @@
 =================
 Swift Ops Runbook
 =================
 This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate
 and monitor the Swift system within the HPE Helion Public Cloud. This
 document is an excerpt of a larger product-specific handbook. As such,
 the material may appear incomplete. The suggestions and recommendations
 made in this document are for our particular environment, and may not be
 suitable for your environment or situation. We make no representations
 concerning the accuracy, adequacy, completeness or suitability of the
 information, suggestions or recommendations. This document are provided
 for reference only. We are not responsible for your use of any
 information, suggestions or recommendations contained herein.
 This document also contains references to certain tools that we use to
 operate the Swift system within the HPE Helion Public Cloud.
 Descriptions of these tools are provided for reference only, as the tools themselves
 are not publically available at this time.
 -  ``swift-direct``: This is similar to the ``swiftly`` tool.
 .. toctree::
   :maxdepth: 2
   general.rst
   diagnose.rst
   procedures.rst
   maintenance.rst
   troubleshooting.rst
 Is the system up?
 ~~~~~~~~~~~~~~~~~
 If you have a report that Swift is down, perform the following basic checks:
 #. Run swift functional tests.
 #. From a server in your data center, use ``curl`` to check ``/healthcheck``.
 #. If you have a monitoring system, check your monitoring system.
 #. Check on your hardware load balancers infrastructure.
 #. Run swift-recon on a proxy node.
 Run swift function tests
 ------------------------
 We would recommend that you set up your function tests against your production
 system.
 A script for running the function tests is located in ``swift/.functests``.
 External monitoring
 -------------------
 -  We use pingdom.com to monitor the external Swift API. We suggest the
   following:
   -  Do a GET on ``/healthcheck``
   -  Create a container, make it public (x-container-read:
      .r\*,.rlistings), create a small file in the container; do a GET
      on the object
 Reference information
 ~~~~~~~~~~~~~~~~~~~~~
 Reference: Swift startup/shutdown
 ---------------------------------
 -  Use reload - not stop/start/restart.
 -  Try to roll sets of servers (especially proxy) in groups of less
   than 20% of your servers.
--- a/doc/source/ops_runbook/maintenance.rst
+++ b/doc/source/ops_runbook/maintenance.rst
@@ -0,0 +1,322 @@
 ==================
 Server maintenance
 ==================
 General assumptions
 ~~~~~~~~~~~~~~~~~~~
 -  It is assumed that anyone attempting to replace hardware components
   will have already read and understood the appropriate maintenance and
   service guides.
 -  It is assumed that where servers need to be taken off-line for
   hardware replacement, that this will be done in series, bringing the
   server back on-line before taking the next off-line.
 -  It is assumed that the operations directed procedure will be used for
   identifying hardware for replacement.
 Assessing the health of swift
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 You can run the swift-recon tool on a Swift proxy node to get a quick
 check of how Swift is doing. Please note that the numbers below are
 necessarily somewhat subjective. Sometimes parameters for which we
 say 'low values are good' will have pretty high values for a time. Often
 if you wait a while things get better.
 For example:
 .. code::
   sudo swift-recon -rla
   ===============================================================================
   [2012-03-10 12:57:21] Checking async pendings on 384 hosts...
   Async stats: low: 0, high: 1, avg: 0, total: 1
   ===============================================================================
   [2012-03-10 12:57:22] Checking replication times on 384 hosts...
   [Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
   ===============================================================================
   [2012-03-10 12:57:22] Checking load avg's on 384 hosts...
   [5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
   [15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
   [1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
   ===============================================================================
 In the example above we ask for information on replication times (-r),
 load averages (-l) and async pendings (-a). This is a healthy Swift
 system. Rules-of-thumb for 'good' recon output are:
 -  Nodes that respond are up and running Swift. If all nodes respond,
   that is a good sign. But some nodes may time out. For example:
   .. code::
      \-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
      \-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
 -  That could be okay or could require investigation.
 -  Low values (say < 10 for high and average) for async pendings are
   good. Higher values occur when disks are down and/or when the system
   is heavily loaded. Many simultaneous PUTs to the same container can
   drive async pendings up. This may be normal, and may resolve itself
   after a while. If it persists, one way to track down the problem is
   to find a node with high async pendings (with ``swift-recon -av | sort
   -n -k4``), then check its Swift logs, Often async pendings are high
   because a node cannot write to a container on another node. Often
   this is because the node or disk is offline or bad. This may be okay
   if we know about it.
 -  Low values for replication times are good. These values rise when new
   rings are pushed, and when nodes and devices are brought back on
   line.
 -  Our 'high' load average values are typically in the 9-15 range. If
   they are a lot bigger it is worth having a look at the systems
   pushing the average up. Run ``swift-recon -av`` to get the individual
   averages. To sort the entries with the highest at the end,
   run ``swift-recon -av | sort -n -k4``.
 For comparison here is the recon output for the same system above when
 two entire racks of Swift are down:
 .. code::
   [2012-03-10 16:56:33] Checking async pendings on 384 hosts...
   -> http://<redacted>.22:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.18:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.16:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.13:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.30:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.6:6000/recon/async: <urlopen error timed out>
   .........
   -> http://<redacted>.5:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.15:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.9:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.27:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.4:6000/recon/async: <urlopen error timed out>
   -> http://<redacted>.8:6000/recon/async: <urlopen error timed out>
   Async stats: low: 243, high: 659, avg: 413, total: 132275
   ===============================================================================
   [2012-03-10 16:57:48] Checking replication times on 384 hosts...
   -> http://<redacted>.22:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.18:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.16:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.13:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.30:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.6:6000/recon/replication: <urlopen error timed out>
   ............
   -> http://<redacted>.5:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.15:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.9:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.27:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.4:6000/recon/replication: <urlopen error timed out>
   -> http://<redacted>.8:6000/recon/replication: <urlopen error timed out>
   [Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
   9475361
   ===============================================================================
   [2012-03-10 16:59:03] Checking load avg's on 384 hosts...
   -> http://<redacted>.22:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.18:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.16:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.13:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.30:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.6:6000/recon/load: <urlopen error timed out>
   ............
   -> http://<redacted>.15:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.9:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.27:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.4:6000/recon/load: <urlopen error timed out>
   -> http://<redacted>.8:6000/recon/load: <urlopen error timed out>
   [5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
   [15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
   [1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
   ===============================================================================
 .. note::
   The replication times and load averages are within reasonable
   parameters, even with 80 object stores down. Async pendings, however is
   quite high. This is due to the fact that the containers on the servers
   which are down cannot be updated. When those servers come back up, async
   pendings should drop. If async pendings were at this level without an
   explanation, we have a problem.
 Recon examples
 ~~~~~~~~~~~~~~
 Here is an example of noting and tracking down a problem with recon.
 Running reccon shows some async pendings:
 .. code::
   bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
   ===============================================================================
   \[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
   Async stats: low: 0, high: 23, avg: 8, total: 3356
   ===============================================================================
   \[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
   \[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
   ===============================================================================
   \[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
   \[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
   \[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
   \[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
    ===============================================================================
 Why? Running recon again with -av swift (not shown here) tells us that
 the node with the highest (23) is <redacted>.72.61. Looking at the log
 files on <redacted>.72.61 we see:
 .. code::
   souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
   Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
   Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
 That is why this node has a lot of async pendings: a bunch of disks that
 are not mounted on <redacted> and <redacted>. There may be other issues,
 but clearing this up will likely drop the async pendings a fair bit, as
 other nodes will be having the same problem.
 Assessing the availability risk when multiple storage servers are down
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. note::
   This procedure will tell you if you have a problem, however, in practice
   you will find that you will not use this procedure frequently.
 If three storage nodes (or, more precisely, three disks on three
 different storage nodes) are down, there is a small but nonzero
 probability that user objects, containers, or accounts will not be
 available.
 Procedure
 ---------
 .. note::
   swift has three rings: one each for objects, containers and accounts.
   This procedure should be run three times, each time specifying the
   appropriate ``*.builder`` file.
 #. Determine whether all three nodes are different Swift zones by
   running the ring builder on a proxy node to determine which zones
   the storage nodes are in. For example:
   .. code::
      % sudo swift-ring-builder /etc/swift/object.builder
      /etc/swift/object.builder, build version 1467
      2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
      The minimum number of hours before a partition can be reassigned is 24
      Devices:    id  zone      ip address  port      name weight partitions balance meta
      0     1     <redacted>.4  6000     disk0 1708.00       4259   -0.00
      1     1     <redacted>.4  6000     disk1 1708.00       4260    0.02
      2     1     <redacted>.4  6000     disk2 1952.00       4868    0.01
      3     1     <redacted>.4  6000     disk3 1952.00       4868    0.01
      4     1     <redacted>.4  6000     disk4 1952.00       4867   -0.01
 #. Here, node <redacted>.4 is in zone 1. If two or more of the three
   nodes under consideration are in the same Swift zone, they do not
   have any ring partitions in common; there is little/no data
   availability risk if all three nodes are down.
 #. If the nodes are in three distinct Swift zonesit is necessary to
   whether the nodes have ring partitions in common. Run ``swift-ring``
   builder again, this time with the ``list_parts`` option and specify
   the nodes under consideration. For example (all on one line):
   .. code::
      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
      Partition   Matches
      91           2
      729          2
      3754         2
      3769         2
      3947         2
      5818         2
      7918         2
      8733         2
      9509         2
      10233        2
 #. The ``list_parts`` option to the ring builder indicates how many ring
   partitions the nodes have in common. If, as in this case,  the
   first entry in the list has a ‘Matches’ column of 2 or less,  there
   is no data availability risk if all three nodes are down.
 #. If the ‘Matches’ column has entries equal to 3, there is some data
   availability risk if all three nodes are down. The risk is generally
   small, and is proportional to the number of entries that have a 3 in
   the Matches column. For example:
   .. code::
      Partition   Matches
      26865          3
      362367         3
      745940         3
      778715         3
      797559         3
      820295         3
      822118         3
      839603         3
      852332         3
      855965         3
      858016         3
 #. A quick way to count the number of rows with 3 matches is:
   .. code::
      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
      30
 #. In this case the nodes have 30 out of a total of 2097152 partitions
   in common; about 0.001%. In this case the risk is small nonzero.
   Recall that a partition is simply a portion of the ring mapping
   space, not actual data. So having partitions in common is a necessary
   but not sufficient condition for data unavailability.
   .. note::
      We should not bring down a node for repair if it shows
      Matches entries of 3 with other nodes that are also down.
      If three nodes that have 3 partitions in common are all down, there is
      a nonzero probability that data are unavailable and we should work to
      bring some or all of the nodes up ASAP.
--- a/doc/source/ops_runbook/procedures.rst
+++ b/doc/source/ops_runbook/procedures.rst
@@ -0,0 +1,367 @@
 =================================
 Software configuration procedures
 =================================
 Fix broken GPT table (broken disk partition)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 -  If a GPT table is broken, a message like the following should be
   observed when the command...
   .. code::
      $ sudo parted -l
 -  ... is run.
   .. code::
      ...
      Error: The backup GPT table is corrupt, but the primary appears OK, so that will
      be used.
      OK/Cancel?
 #. To fix this, firstly install the ``gdisk`` program to fix this:
   .. code::
      $ sudo aptitude install gdisk
 #. Run ``gdisk`` for the particular drive with the damaged partition:
   .. code:
      $ sudo gdisk /dev/sd*a-l*
      GPT fdisk (gdisk) version 0.6.14
      Caution: invalid backup GPT header, but valid main header; regenerating
      backup header from main header.
      Warning! One or more CRCs don't match. You should repair the disk!
      Partition table scan:
         MBR: protective
         BSD: not present
         APM: not present
         GPT: damaged
      /dev/sd
      *****************************************************************************
      Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
      verification and recovery are STRONGLY recommended.
      *****************************************************************************
 #. On the command prompt, type ``r`` (recovery and transformation
   options), followed by ``d`` (use main GPT header) , ``v`` (verify disk)
   and finally ``w`` (write table to disk and exit). Will also need to
   enter ``Y`` when prompted in order to confirm actions.
   .. code::
      Command (? for help): r
      Recovery/transformation command (? for help): d
      Recovery/transformation command (? for help): v
      Caution: The CRC for the backup partition table is invalid. This table may
      be corrupt. This program will automatically create a new backup partition
      table when you save your partitions.
      Caution: Partition 1 doesn't begin on a 8-sector boundary. This may
      result in degraded performance on some modern (2009 and later) hard disks.
      Caution: Partition 2 doesn't begin on a 8-sector boundary. This may
      result in degraded performance on some modern (2009 and later) hard disks.
      Caution: Partition 3 doesn't begin on a 8-sector boundary. This may
      result in degraded performance on some modern (2009 and later) hard disks.
      Identified 1 problems!
      Recovery/transformation command (? for help): w
      Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
      PARTITIONS!!
      Do you want to proceed, possibly destroying your data? (Y/N): Y
      OK; writing new GUID partition table (GPT).
      The operation has completed successfully.
 #. Running the command:
   .. code::
      $ sudo parted /dev/sd#
 #. Should now show that the partition is recovered and healthy again.
 #. Finally, uninstall ``gdisk`` from the node:
   .. code::
      $ sudo aptitude remove gdisk
 Procedure: Fix broken XFS filesystem
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 #. A filesystem may be corrupt or broken if the following output is
   observed when checking its label:
   .. code::
      $ sudo xfs_admin -l /dev/sd#
        cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
        xfs_admin: cannot read root inode (117)
        cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
        xfs_admin: cannot read realtime bitmap inode (117)
        bad sb magic # 0 in AG 1
        failed to read label in AG 1
 #. Run the following commands to remove the broken/corrupt filesystem and replace.
   (This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
   .. code::
      $ sudo parted
      GNU Parted 2.3
      Using /dev/sda
      Welcome to GNU Parted! Type 'help' to view a list of commands.
      (parted) select /dev/sdb
      Using /dev/sdb
      (parted) p
      Model: HP LOGICAL VOLUME (scsi)
      Disk /dev/sdb: 2000GB
      Sector size (logical/physical): 512B/512B
      Partition Table: gpt
      Number  Start   End     Size    File system  Name   Flags
      1      17.4kB  1024MB  1024MB  ext3                 boot
      2      1024MB  1751GB  1750GB  xfs          sw-aw2az1-object045-disk1
      3      1751GB  2000GB  249GB                        lvm
      (parted) rm 2
      (parted) mkpart primary 2 -1
      Warning: You requested a partition from 2000kB to 2000GB.
      The closest location we can manage is 1024MB to 1751GB.
      Is this still acceptable to you?
      Yes/No? Yes
      Warning: The resulting partition is not properly aligned for best performance.
      Ignore/Cancel? Ignore
      (parted) p
      Model: HP LOGICAL VOLUME (scsi)
      Disk /dev/sdb: 2000GB
      Sector size (logical/physical): 512B/512B
      Partition Table: gpt
      Number  Start   End     Size    File system  Name     Flags
      1      17.4kB  1024MB  1024MB  ext3                  boot
      2      1024MB  1751GB  1750GB  xfs          primary
      3      1751GB  2000GB  249GB                         lvm
      (parted) quit
 #. Next step is to scrub the filesystem and format:
   .. code::
      $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
      $ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
      meta-data=/dev/sdb2              isize=1024   agcount=4, agsize=106811524 blks
             =                       sectsz=512   attr=2, projid32bit=0
    data     =                       bsize=4096   blocks=427246093, imaxpct=5
             =                       sunit=0      swidth=0 blks
    naming   =version 2              bsize=4096   ascii-ci=0
    log      =internal log           bsize=4096   blocks=208616, version=2
             =                       sectsz=512   sunit=0 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0
 #. You should now label and mount your filesystem.
 #. Can now check to see if the filesystem is mounted using the command:
   .. code::
      $ mount
 Procedure: Checking if an account is okay
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. note::
   ``swift-direct`` is only available in the HPE Helion Public Cloud.
   Use ``swiftly`` as an alternate.
 If you have a tenant ID you can check the account is okay as follows from a proxy.
 .. code::
   $ sudo -u swift  /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
 The response will either be similar to a swift list of the account
 containers, or an error indicating that the resource could not be found.
 In the latter case you can establish if a backend database exists for
 the tenantId by running the following on a proxy:
 .. code::
   $ sudo -u swift  swift-get-nodes /etc/swift/account.ring.gz  <Api-Auth-Hash-or-TenantId>
 The response will list ssh commands that will list the replicated
 account databases, if they exist.
 Procedure: Revive a deleted account
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Swift accounts are normally not recreated. If a tenant unsubscribes from
 Swift, the account is deleted. To re-subscribe to Swift, you can create
 a new tenant (new tenant ID), and subscribe to Swift. This creates a
 new Swift account with the new tenant ID.
 However, until the unsubscribe/new tenant process is supported, you may
 hit a situation where a Swift account is deleted and the user is locked
 out of Swift.
 Deleting the account database files
 -----------------------------------
 Here is one possible solution. The containers and objects may be lost
 forever. The solution is to delete the account database files and
 re-create the account. This may only be done once the containers and
 objects are completely deleted. This process is untested, but could
 work as follows:
 #. Use swift-get-nodes to locate the account's database file (on three
   servers).
 #. Rename the database files (on three servers).
 #. Use ``swiftly`` to create the account (use original name).
 Renaming account database so it can be revived
 ----------------------------------------------
 Get the locations of the database files that hold the account data.
   .. code::
      sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
      Account  AUTH_redacted-1856-44ae-97db-31242f7ad7a1
      Container None
      Object    None
      Partition 18914
      Hash        93c41ef56dd69173a9524193ab813e78
      Server:Port Device 15.184.9.126:6002 disk7
      Server:Port Device 15.184.9.94:6002 disk11
      Server:Port Device 15.184.9.103:6002 disk10
      Server:Port Device 15.184.9.80:6002 disk2  [Handoff]
      Server:Port Device 15.184.9.120:6002 disk2  [Handoff]
      Server:Port Device 15.184.9.98:6002 disk2  [Handoff]
      curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
      curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
      curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
      curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
      curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
      curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
      ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
      ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
      ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
      ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
      ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
      ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
      $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account  AUTH_redacted-1856-44ae-97db-
      31242f7ad7a1Container  NoneObject      NonePartition   18914Hash           93c41ef56dd69173a9524193ab813e78Server:Port Device  15.184.9.126:6002 disk7Server:Port Device   15.184.9.94:6002 disk11Server:Port Device   15.184.9.103:6002 disk10Server:Port Device  15.184.9.80:6002
      disk2   [Handoff]Server:Port Device    15.184.9.120:6002 disk2  [Handoff]Server:Port Device    15.184.9.98:6002 disk2   [Handoff]curl -I -XHEAD
      "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
      "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
      "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
      "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
      "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
      "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
      "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
      "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
      "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
 Check that the handoff nodes do not have account databases:
 .. code::
   $ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
   ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
 If the handoff node has a database, wait for rebalancing to occur.
 Procedure: Temporarily stop load balancers from directing traffic to a proxy server
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 You can stop the load balancers sending requests to a proxy server as
 follows. This can be useful when a proxy is misbehaving but you need
 Swift running to help diagnose the problem. By removing from the load
 balancers, customer's are not impacted by the misbehaving proxy.
 #. Ensure that in proxyserver.com the ``disable_path`` variable is set to
   ``/etc/swift/disabled-by-file``.
 #. Log onto the proxy node.
 #. Shut down Swift as follows:
   .. code::
      sudo swift-init proxy shutdown
      .. note::
         Shutdown, not stop.
 #. Create the ``/etc/swift/disabled-by-file`` file. For example:
   .. code::
      sudo touch /etc/swift/disabled-by-file
 #. Optional, restart Swift:
   .. code::
      sudo swift-init proxy start
 It works because the healthcheck middleware looks for this file. If it
 find it, it will return 503 error instead of 200/OK. This means the load balancer
 should stop sending traffic to the proxy.
 ``/healthcheck`` will report
 ``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
 Procedure: Ad-Hoc disk performance test
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 You can get an idea whether a disk drive is performing as follows:
 .. code::
   sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
 You can expect ~600MB/sec. If you get a low number, repeat many times as
 Swift itself may also read or write to the disk, hence giving a lower
 number.
--- a/doc/source/ops_runbook/sec-furtherdiagnose.rst
+++ b/doc/source/ops_runbook/sec-furtherdiagnose.rst
@@ -0,0 +1,177 @@
 ==============================
 Further issues and resolutions
 ==============================
 .. note::
   The urgency levels in each **Action** column indicates whether or
   not it is required to take immediate action, or if the problem can be worked
   on during business hours.
 .. list-table::
   :widths: 33 33 33
   :header-rows: 1
   * - **Scenario**
     - **Description**
     - **Action**
   * - ``/healthcheck`` latency is high.
     - The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
       network issues, rather than the proxies being very busy. A very slow proxy might impact the average
       number, but it would need to be very slow to shift the number that much.
     - Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
       IP address to see if you can pin point a problem in the network.
       Urgency: If there are other indications that your system is slow, you should treat
       this as an urgent problem.
   * - Swift process is not running.
     - You can use ``swift-init`` status to check if swift processes are running on any
       given server.
     - Run this command:
       .. code::
          sudo swift-init all start
       Examine messages in the swift log files to see if there are any
       error messages related to any of the swift processes since the time you
       ran the ``swift-init`` command.
       Take any corrective actions that seem necessary.
       Urgency: If this only affects one server, and you have more than one,
       identifying and fixing the problem can wait until business hours.
       If this same problem affects many servers, then you need to take corrective
       action immediately.
   * - ntpd is not running.
     - NTP is not running.
     - Configure and start NTP.
       Urgency: For proxy servers, this is vital.
   * - Host clock is not syncd to an NTP server.
     - Node time settings does not match NTP server time.
       This may take some time to sync after a reboot.
     - Assuming NTP is configured and running, you have to wait until the times sync.
   * - A swift process has hundreds, to thousands of open file descriptors.
     - May happen to any of the swift processes.
       Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
     - Restart the swift processes on the affected node:
       .. code::
          % sudo swift-init all reload
       Urgency:
                If known performance problem: Immediate
                If system seems fine: Medium
   * - A swift process is not owned by the swift user.
     - If the UID of the swift user has changed, then the processes might not be
       owned by that UID.
     - Urgency: If this only affects one server, and you have more than one,
       identifying and fixing the problem can wait until business hours.
       If this same problem affects many servers, then you need to take corrective
       action immediately.
   * - Object account or container files not owned by swift.
     - This typically happens if during a reinstall or a re-image of a server that the UID
       of the swift user was changed. The data files in the object account and container
       directories are owned by the original swift UID. As a result, the current swift
       user does not own these files.
     - Correct the UID of the swift user to reflect that of the original UID. An alternate
       action is to change the ownership of every file on all file systems. This alternate
       action is often impractical and will take considerable time.
       Urgency: If this only affects one server, and you have more than one,
       identifying and fixing the problem can wait until business hours.
       If this same problem affects many servers, then you need to take corrective
       action immediately.
   * - A disk drive has a high IO wait or service time.
     - If high wait IO times are seen for a single disk, then the disk drive is the problem.
       If most/all devices are slow, the controller is probably the source of the problem.
       The controller cache may also be miss configured – which will cause similar long
       wait or service times.
     - As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
       is working.
       Second, reboot the server.
       If problem persists, file a DC ticket to have the drive or controller replaced.
       See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
       Urgency: Medium
   * - The network interface is not up.
     - Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
     - You can try restarting the interface. However, generally the interface
       (or cable) is probably broken, especially if the interface is flapping.
       Urgency: If this only affects one server, and you have more than one,
       identifying and fixing the problem can wait until business hours.
       If this same problem affects many servers, then you need to take corrective
       action immediately.
   * - Network interface card (NIC) is not operating at the expected speed.
     - The NIC is running at a slower speed than its nominal rated speed.
       For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
     - 1. Try resetting the interface with:
       .. code::
          sudo ethtool -s eth0 speed 1000
       ... and then run:
       .. code::
          sudo lshw -class
       See if size goes to the expected speed. Failing
       that, check hardware (NIC cable/switch port).
       2. If persistent, consider shutting down the server (especially if a proxy)
          until the problem is identified and resolved. If you leave this server
          running it can have a large impact on overall performance.
       Urgency: High
   * - The interface RX/TX error count is non-zero.
     - A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
     - 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
          3-30 probably indicate that the error count has crept up slowly over a long time.
          Consider rebooting the server to remove the report from the noise.
          Typically, when a cable or interface is bad, the error count goes to 400+. For example,
          it stands out. There may be other symptoms such as the interface going up and down or
          not running at correct speed. A server with a high error count should be watched.
       2. If the error count continue to climb, consider taking the server down until
          it can be properly investigated. In any case, a reboot should be done to clear
          the error count.
       Urgency: High, if the error count increasing.
   * - In a swift log you see a message that a process has not replicated in over 24 hours.
     - The replicator has not successfully completed a run in the last 24 hours.
       This indicates that the replicator has probably hung.
     - Use ``swift-init`` to stop and then restart the replicator process.
       Urgency: Low (high if recent adding or replacement of disk drives), however if you
       recently added or replaced disk drives then you should treat this urgently.
   * - Container Updater has not run in 4 hour(s).
     - The service may appear to be running however, it may be hung. Examine their swift
       logs to see if there are any error messages relating to the container updater. This
       may potentially explain why the container is not running.
     - Urgency: Medium
       This may have been triggered by a recent restart of the  rsyslog daemon.
       Restart the service with:
       .. code::
          sudo swift-init <service> reload
   * - Object replicator: Reports the remaining time and that time is more than 100 hours.
     - Each replication cycle the object replicator writes a log message to its log
       reporting statistics about the current cycle. This includes an estimate for the
       remaining time needed to replicate all objects. If this time is longer than
       100 hours, there is a problem with the replication process.
     - Urgency: Medium
       Restart the service with:
       .. code::
          sudo swift-init object-replicator reload
       Check that the remaining replication time is going down.
--- a/doc/source/ops_runbook/troubleshooting.rst
+++ b/doc/source/ops_runbook/troubleshooting.rst
@@ -0,0 +1,264 @@
 ====================
 Troubleshooting tips
 ====================
 Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This entry is prompted by a real customer issue and exclusively focused on how
 that problem was identified.
 There are many reasons why a http status of 500 could be returned. If
 there are no obvious problems with the swift object store, then it may
 be necessary to take a closer look at the users transactions.
 After finding the users swift account, you can
 search the swift proxy logs on each swift proxy server for
 transactions from this user. The linux ``bzgrep`` command can be used to
 search all the proxy log files on a node including the ``.bz2`` compressed
 files. For example:
 .. code::
   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
    -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
    4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*' 
    dshbak -c
    .
    .
    \---------------\-
    <redacted>.132.6
    \---------------\-
    Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
    <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
    /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
    tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
 This shows a ``GET`` operation on the users account.
 .. note::
   The HTTP status returned is 404, not found, rather than 500 as reported by the user.
 Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
 search the swift object servers log files for this transaction ID:
 .. code::
   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
   -R ssh
   -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
   'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
      | dshbak -c
   .
   .
   \---------------\-
   <redacted>.72.16
   \---------------\-
   Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
   [29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
   0.0016 ""
    \---------------\-
    <redacted>.31
    \---------------\-
    Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
    \---------------\-
    <redacted>.204.70
    \---------------\-
    Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
 .. note::
   The 3 GET operations to 3 different object servers that hold the 3
   replicas of this users account. Each ``GET`` returns a HTTP status of 404,
   not found.
 Next, use the ``swift-get-nodes`` command to determine exactly where the
 users account data is stored:
 .. code::
   $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
   Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
   Container None
   Object None
   Partition 198875
   Hash 1846d99185f8a0edaf65cfbf37439696
   Server:Port Device <redacted>.31:6002 disk6
   Server:Port Device <redacted>.204.70:6002 disk6
   Server:Port Device <redacted>.72.16:6002 disk9
   Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
   Server:Port Device <redacted>.26:6002 disk11 [Handoff]
   Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
   curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
   curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
   curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
   ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
   ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
   ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
 Check each of the primary servers, <redacted>.31, <redacted>.204.70  and <redacted>.72.16, for
 this users account. For example on <redacted>.72.16:
 .. code::
   $ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
   total 1.0M
   drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
   drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
   -rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
   -rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
 So this users account db, an sqlite db is present. Use sqlite to
 checkout the account:
 .. code::
   $ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
   $ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
   sqlite> .mode line
   sqlite> select * from account_stat;
   account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
   created_at = 1328311738.42190
   put_timestamp = 1330000873.61411
   delete_timestamp = 1330001026.00514
   container_count = 0
   object_count = 0
   bytes_used = 0
   hash = eb7e5d0ea3544d9def940b19114e8b43
   id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
   status = DELETED
   status_changed_at = 1330001026.00514
   metadata =
 .. note::
   The status is ``DELETED``. So this account was deleted. This explains
   why the GET operations are returning 404, not found. Check the account
   delete date/time:
   .. code::
      $ python
      >>> import time
      >>> time.ctime(1330001026.00514)
      'Thu Feb 23 12:43:46 2012'
 Next try and find the ``DELETE`` operation for this account in the proxy
 server logs:
 .. code::
   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
   139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
   DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
   .
   .
   Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
   52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
   tx4471188b0b87406899973d297c55ab53 - 0.0086
 From this you can see the operation that resulted in the account being deleted.
 Procedure: Deleting objects
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Simple case - deleting small number of objects and containers
 -------------------------------------------------------------
 .. note::
   ``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
   Use ``swiftly`` as an alternative.
 .. note::
   Object and container names are in UTF8. Swift direct accepts UTF8
   directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
   URL-encoded). In practice cut and paste of foreign language strings to
   a terminal window will produce the right result.
   Hint: Use the ``head`` command before any destructive commands.
 To delete a small number of objects, log into any proxy node and proceed
 as follows:
 Examine the object in question:
 .. code::
   $ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
 See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
 then this is the manifest object and segment objects may be in another
 container.
 If the ``X-Object-Manifest`` attribute is set, you need to find the
 name of the objects this means it is a DLO. For example,
 if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
 of the container container2 as follows:
 .. code::
   $ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
 Pick out the objects whose names start with ``seg-blah``.
 Delete the segment objects as follows:
 .. code::
   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
   etc
 If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:
 -  Using swift-get-nodes to get the details of the object's location.
 -  Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
 -  This lists a json body listing containers and object names
 -  Delete the objects as described above for DLO segments
 Once the segments are deleted, you can delete the object using
 ``swift-direct`` as described above.
 Finally, use ``swift-direct`` to delete the container.
 Procedure: Decommissioning swift nodes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Should Swift nodes need to be decommissioned. For example, where they are being
 re-purposed, it is very important to follow the following steps.
 #. In the case of object servers, follow the procedure for removing
   the node from the rings.
 #. In the case of swift proxy servers, have the network team remove
   the node from the load balancers.
 #. Open a network ticket to have the node removed from network
   firewalls.
 #. Make sure that you remove the ``/etc/swift`` directory and everything in it.