Operational procedures guide

This is the operational procedures guide that HPE used
to operate and monitor their public Swift systems.
It has been made publicly available.

Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
This commit is contained in:
asettle
2016-02-10 17:58:05 +10:00
committed by John Dickinson
parent 30624a866a
commit 3c61ab4678
8 changed files with 2277 additions and 0 deletions

View File

@@ -86,6 +86,7 @@ Administrator Documentation
admin_guide admin_guide
replication_network replication_network
logs logs
ops_runbook/index
Object Storage v1 REST API Documentation Object Storage v1 REST API Documentation
======================================== ========================================

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,36 @@
==================
General Procedures
==================
Getting a swift account stats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
``swifty`` for an alternate, this is an example.
This procedure describes how you determine the swift usage for a given
swift account, that is the number of containers, number of objects and
total bytes used. To do this you will need the project ID.
Log onto one of the swift proxy servers.
Use swift-direct to show this accounts usage:
.. code::
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
Status: 200
Content-Length: 0
Accept-Ranges: bytes
X-Timestamp: 1379698586.88364
X-Account-Bytes-Used: 67440225625994
X-Account-Container-Count: 1
Content-Type: text/plain; charset=utf-8
X-Account-Object-Count: 8436776
Status: 200
name: my_container count: 8436776 bytes: 67440225625994
This account has 1 container. That container has 8436776 objects. The
total bytes used is 67440225625994.

View File

@@ -0,0 +1,79 @@
=================
Swift Ops Runbook
=================
This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate
and monitor the Swift system within the HPE Helion Public Cloud. This
document is an excerpt of a larger product-specific handbook. As such,
the material may appear incomplete. The suggestions and recommendations
made in this document are for our particular environment, and may not be
suitable for your environment or situation. We make no representations
concerning the accuracy, adequacy, completeness or suitability of the
information, suggestions or recommendations. This document are provided
for reference only. We are not responsible for your use of any
information, suggestions or recommendations contained herein.
This document also contains references to certain tools that we use to
operate the Swift system within the HPE Helion Public Cloud.
Descriptions of these tools are provided for reference only, as the tools themselves
are not publically available at this time.
- ``swift-direct``: This is similar to the ``swiftly`` tool.
.. toctree::
:maxdepth: 2
general.rst
diagnose.rst
procedures.rst
maintenance.rst
troubleshooting.rst
Is the system up?
~~~~~~~~~~~~~~~~~
If you have a report that Swift is down, perform the following basic checks:
#. Run swift functional tests.
#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
#. If you have a monitoring system, check your monitoring system.
#. Check on your hardware load balancers infrastructure.
#. Run swift-recon on a proxy node.
Run swift function tests
------------------------
We would recommend that you set up your function tests against your production
system.
A script for running the function tests is located in ``swift/.functests``.
External monitoring
-------------------
- We use pingdom.com to monitor the external Swift API. We suggest the
following:
- Do a GET on ``/healthcheck``
- Create a container, make it public (x-container-read:
.r\*,.rlistings), create a small file in the container; do a GET
on the object
Reference information
~~~~~~~~~~~~~~~~~~~~~
Reference: Swift startup/shutdown
---------------------------------
- Use reload - not stop/start/restart.
- Try to roll sets of servers (especially proxy) in groups of less
than 20% of your servers.

View File

@@ -0,0 +1,322 @@
==================
Server maintenance
==================
General assumptions
~~~~~~~~~~~~~~~~~~~
- It is assumed that anyone attempting to replace hardware components
will have already read and understood the appropriate maintenance and
service guides.
- It is assumed that where servers need to be taken off-line for
hardware replacement, that this will be done in series, bringing the
server back on-line before taking the next off-line.
- It is assumed that the operations directed procedure will be used for
identifying hardware for replacement.
Assessing the health of swift
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can run the swift-recon tool on a Swift proxy node to get a quick
check of how Swift is doing. Please note that the numbers below are
necessarily somewhat subjective. Sometimes parameters for which we
say 'low values are good' will have pretty high values for a time. Often
if you wait a while things get better.
For example:
.. code::
sudo swift-recon -rla
===============================================================================
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 1, avg: 0, total: 1
===============================================================================
[2012-03-10 12:57:22] Checking replication times on 384 hosts...
[Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
===============================================================================
[2012-03-10 12:57:22] Checking load avg's on 384 hosts...
[5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
[15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
[1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
===============================================================================
In the example above we ask for information on replication times (-r),
load averages (-l) and async pendings (-a). This is a healthy Swift
system. Rules-of-thumb for 'good' recon output are:
- Nodes that respond are up and running Swift. If all nodes respond,
that is a good sign. But some nodes may time out. For example:
.. code::
\-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
\-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
- That could be okay or could require investigation.
- Low values (say < 10 for high and average) for async pendings are
good. Higher values occur when disks are down and/or when the system
is heavily loaded. Many simultaneous PUTs to the same container can
drive async pendings up. This may be normal, and may resolve itself
after a while. If it persists, one way to track down the problem is
to find a node with high async pendings (with ``swift-recon -av | sort
-n -k4``), then check its Swift logs, Often async pendings are high
because a node cannot write to a container on another node. Often
this is because the node or disk is offline or bad. This may be okay
if we know about it.
- Low values for replication times are good. These values rise when new
rings are pushed, and when nodes and devices are brought back on
line.
- Our 'high' load average values are typically in the 9-15 range. If
they are a lot bigger it is worth having a look at the systems
pushing the average up. Run ``swift-recon -av`` to get the individual
averages. To sort the entries with the highest at the end,
run ``swift-recon -av | sort -n -k4``.
For comparison here is the recon output for the same system above when
two entire racks of Swift are down:
.. code::
[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
-> http://<redacted>.22:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.18:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.16:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.13:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.30:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.6:6000/recon/async: <urlopen error timed out>
.........
-> http://<redacted>.5:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.15:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.9:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.27:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.4:6000/recon/async: <urlopen error timed out>
-> http://<redacted>.8:6000/recon/async: <urlopen error timed out>
Async stats: low: 243, high: 659, avg: 413, total: 132275
===============================================================================
[2012-03-10 16:57:48] Checking replication times on 384 hosts...
-> http://<redacted>.22:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.18:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.16:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.13:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.30:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.6:6000/recon/replication: <urlopen error timed out>
............
-> http://<redacted>.5:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.15:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.9:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.27:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.4:6000/recon/replication: <urlopen error timed out>
-> http://<redacted>.8:6000/recon/replication: <urlopen error timed out>
[Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
9475361
===============================================================================
[2012-03-10 16:59:03] Checking load avg's on 384 hosts...
-> http://<redacted>.22:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.18:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.16:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.13:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.30:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.6:6000/recon/load: <urlopen error timed out>
............
-> http://<redacted>.15:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.9:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.27:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.4:6000/recon/load: <urlopen error timed out>
-> http://<redacted>.8:6000/recon/load: <urlopen error timed out>
[5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
[15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
[1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
===============================================================================
.. note::
The replication times and load averages are within reasonable
parameters, even with 80 object stores down. Async pendings, however is
quite high. This is due to the fact that the containers on the servers
which are down cannot be updated. When those servers come back up, async
pendings should drop. If async pendings were at this level without an
explanation, we have a problem.
Recon examples
~~~~~~~~~~~~~~
Here is an example of noting and tracking down a problem with recon.
Running reccon shows some async pendings:
.. code::
bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
===============================================================================
\[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 23, avg: 8, total: 3356
===============================================================================
\[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
\[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
===============================================================================
\[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
\[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
\[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
\[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
===============================================================================
Why? Running recon again with -av swift (not shown here) tells us that
the node with the highest (23) is <redacted>.72.61. Looking at the log
files on <redacted>.72.61 we see:
.. code::
souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
That is why this node has a lot of async pendings: a bunch of disks that
are not mounted on <redacted> and <redacted>. There may be other issues,
but clearing this up will likely drop the async pendings a fair bit, as
other nodes will be having the same problem.
Assessing the availability risk when multiple storage servers are down
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
This procedure will tell you if you have a problem, however, in practice
you will find that you will not use this procedure frequently.
If three storage nodes (or, more precisely, three disks on three
different storage nodes) are down, there is a small but nonzero
probability that user objects, containers, or accounts will not be
available.
Procedure
---------
.. note::
swift has three rings: one each for objects, containers and accounts.
This procedure should be run three times, each time specifying the
appropriate ``*.builder`` file.
#. Determine whether all three nodes are different Swift zones by
running the ring builder on a proxy node to determine which zones
the storage nodes are in. For example:
.. code::
% sudo swift-ring-builder /etc/swift/object.builder
/etc/swift/object.builder, build version 1467
2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
The minimum number of hours before a partition can be reassigned is 24
Devices: id zone ip address port name weight partitions balance meta
0 1 <redacted>.4 6000 disk0 1708.00 4259 -0.00
1 1 <redacted>.4 6000 disk1 1708.00 4260 0.02
2 1 <redacted>.4 6000 disk2 1952.00 4868 0.01
3 1 <redacted>.4 6000 disk3 1952.00 4868 0.01
4 1 <redacted>.4 6000 disk4 1952.00 4867 -0.01
#. Here, node <redacted>.4 is in zone 1. If two or more of the three
nodes under consideration are in the same Swift zone, they do not
have any ring partitions in common; there is little/no data
availability risk if all three nodes are down.
#. If the nodes are in three distinct Swift zonesit is necessary to
whether the nodes have ring partitions in common. Run ``swift-ring``
builder again, this time with the ``list_parts`` option and specify
the nodes under consideration. For example (all on one line):
.. code::
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
Partition Matches
91 2
729 2
3754 2
3769 2
3947 2
5818 2
7918 2
8733 2
9509 2
10233 2
#. The ``list_parts`` option to the ring builder indicates how many ring
partitions the nodes have in common. If, as in this case, the
first entry in the list has a Matches column of 2 or less, there
is no data availability risk if all three nodes are down.
#. If the Matches column has entries equal to 3, there is some data
availability risk if all three nodes are down. The risk is generally
small, and is proportional to the number of entries that have a 3 in
the Matches column. For example:
.. code::
Partition Matches
26865 3
362367 3
745940 3
778715 3
797559 3
820295 3
822118 3
839603 3
852332 3
855965 3
858016 3
#. A quick way to count the number of rows with 3 matches is:
.. code::
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
30
#. In this case the nodes have 30 out of a total of 2097152 partitions
in common; about 0.001%. In this case the risk is small nonzero.
Recall that a partition is simply a portion of the ring mapping
space, not actual data. So having partitions in common is a necessary
but not sufficient condition for data unavailability.
.. note::
We should not bring down a node for repair if it shows
Matches entries of 3 with other nodes that are also down.
If three nodes that have 3 partitions in common are all down, there is
a nonzero probability that data are unavailable and we should work to
bring some or all of the nodes up ASAP.

View File

@@ -0,0 +1,367 @@
=================================
Software configuration procedures
=================================
Fix broken GPT table (broken disk partition)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- If a GPT table is broken, a message like the following should be
observed when the command...
.. code::
$ sudo parted -l
- ... is run.
.. code::
...
Error: The backup GPT table is corrupt, but the primary appears OK, so that will
be used.
OK/Cancel?
#. To fix this, firstly install the ``gdisk`` program to fix this:
.. code::
$ sudo aptitude install gdisk
#. Run ``gdisk`` for the particular drive with the damaged partition:
.. code:
$ sudo gdisk /dev/sd*a-l*
GPT fdisk (gdisk) version 0.6.14
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.
Warning! One or more CRCs don't match. You should repair the disk!
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: damaged
/dev/sd
*****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
*****************************************************************************
#. On the command prompt, type ``r`` (recovery and transformation
options), followed by ``d`` (use main GPT header) , ``v`` (verify disk)
and finally ``w`` (write table to disk and exit). Will also need to
enter ``Y`` when prompted in order to confirm actions.
.. code::
Command (? for help): r
Recovery/transformation command (? for help): d
Recovery/transformation command (? for help): v
Caution: The CRC for the backup partition table is invalid. This table may
be corrupt. This program will automatically create a new backup partition
table when you save your partitions.
Caution: Partition 1 doesn't begin on a 8-sector boundary. This may
result in degraded performance on some modern (2009 and later) hard disks.
Caution: Partition 2 doesn't begin on a 8-sector boundary. This may
result in degraded performance on some modern (2009 and later) hard disks.
Caution: Partition 3 doesn't begin on a 8-sector boundary. This may
result in degraded performance on some modern (2009 and later) hard disks.
Identified 1 problems!
Recovery/transformation command (? for help): w
Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!
Do you want to proceed, possibly destroying your data? (Y/N): Y
OK; writing new GUID partition table (GPT).
The operation has completed successfully.
#. Running the command:
.. code::
$ sudo parted /dev/sd#
#. Should now show that the partition is recovered and healthy again.
#. Finally, uninstall ``gdisk`` from the node:
.. code::
$ sudo aptitude remove gdisk
Procedure: Fix broken XFS filesystem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. A filesystem may be corrupt or broken if the following output is
observed when checking its label:
.. code::
$ sudo xfs_admin -l /dev/sd#
cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
xfs_admin: cannot read root inode (117)
cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
xfs_admin: cannot read realtime bitmap inode (117)
bad sb magic # 0 in AG 1
failed to read label in AG 1
#. Run the following commands to remove the broken/corrupt filesystem and replace.
(This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
.. code::
$ sudo parted
GNU Parted 2.3
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) select /dev/sdb
Using /dev/sdb
(parted) p
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 1024MB 1024MB ext3 boot
2 1024MB 1751GB 1750GB xfs sw-aw2az1-object045-disk1
3 1751GB 2000GB 249GB lvm
(parted) rm 2
(parted) mkpart primary 2 -1
Warning: You requested a partition from 2000kB to 2000GB.
The closest location we can manage is 1024MB to 1751GB.
Is this still acceptable to you?
Yes/No? Yes
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? Ignore
(parted) p
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 1024MB 1024MB ext3 boot
2 1024MB 1751GB 1750GB xfs primary
3 1751GB 2000GB 249GB lvm
(parted) quit
#. Next step is to scrub the filesystem and format:
.. code::
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
$ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=427246093, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=208616, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
#. You should now label and mount your filesystem.
#. Can now check to see if the filesystem is mounted using the command:
.. code::
$ mount
Procedure: Checking if an account is okay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
``swift-direct`` is only available in the HPE Helion Public Cloud.
Use ``swiftly`` as an alternate.
If you have a tenant ID you can check the account is okay as follows from a proxy.
.. code::
$ sudo -u swift /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
The response will either be similar to a swift list of the account
containers, or an error indicating that the resource could not be found.
In the latter case you can establish if a backend database exists for
the tenantId by running the following on a proxy:
.. code::
$ sudo -u swift swift-get-nodes /etc/swift/account.ring.gz <Api-Auth-Hash-or-TenantId>
The response will list ssh commands that will list the replicated
account databases, if they exist.
Procedure: Revive a deleted account
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Swift accounts are normally not recreated. If a tenant unsubscribes from
Swift, the account is deleted. To re-subscribe to Swift, you can create
a new tenant (new tenant ID), and subscribe to Swift. This creates a
new Swift account with the new tenant ID.
However, until the unsubscribe/new tenant process is supported, you may
hit a situation where a Swift account is deleted and the user is locked
out of Swift.
Deleting the account database files
-----------------------------------
Here is one possible solution. The containers and objects may be lost
forever. The solution is to delete the account database files and
re-create the account. This may only be done once the containers and
objects are completely deleted. This process is untested, but could
work as follows:
#. Use swift-get-nodes to locate the account's database file (on three
servers).
#. Rename the database files (on three servers).
#. Use ``swiftly`` to create the account (use original name).
Renaming account database so it can be revived
----------------------------------------------
Get the locations of the database files that hold the account data.
.. code::
sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1
Container None
Object None
Partition 18914
Hash 93c41ef56dd69173a9524193ab813e78
Server:Port Device 15.184.9.126:6002 disk7
Server:Port Device 15.184.9.94:6002 disk11
Server:Port Device 15.184.9.103:6002 disk10
Server:Port Device 15.184.9.80:6002 disk2 [Handoff]
Server:Port Device 15.184.9.120:6002 disk2 [Handoff]
Server:Port Device 15.184.9.98:6002 disk2 [Handoff]
curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db-
31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002
disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD
"`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
"`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
"`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
"`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
"`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
"`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
"ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
"ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
"ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
Check that the handoff nodes do not have account databases:
.. code::
$ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
If the handoff node has a database, wait for rebalancing to occur.
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can stop the load balancers sending requests to a proxy server as
follows. This can be useful when a proxy is misbehaving but you need
Swift running to help diagnose the problem. By removing from the load
balancers, customer's are not impacted by the misbehaving proxy.
#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
``/etc/swift/disabled-by-file``.
#. Log onto the proxy node.
#. Shut down Swift as follows:
.. code::
sudo swift-init proxy shutdown
.. note::
Shutdown, not stop.
#. Create the ``/etc/swift/disabled-by-file`` file. For example:
.. code::
sudo touch /etc/swift/disabled-by-file
#. Optional, restart Swift:
.. code::
sudo swift-init proxy start
It works because the healthcheck middleware looks for this file. If it
find it, it will return 503 error instead of 200/OK. This means the load balancer
should stop sending traffic to the proxy.
``/healthcheck`` will report
``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
Procedure: Ad-Hoc disk performance test
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can get an idea whether a disk drive is performing as follows:
.. code::
sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
You can expect ~600MB/sec. If you get a low number, repeat many times as
Swift itself may also read or write to the disk, hence giving a lower
number.

View File

@@ -0,0 +1,177 @@
==============================
Further issues and resolutions
==============================
.. note::
The urgency levels in each **Action** column indicates whether or
not it is required to take immediate action, or if the problem can be worked
on during business hours.
.. list-table::
:widths: 33 33 33
:header-rows: 1
* - **Scenario**
- **Description**
- **Action**
* - ``/healthcheck`` latency is high.
- The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
network issues, rather than the proxies being very busy. A very slow proxy might impact the average
number, but it would need to be very slow to shift the number that much.
- Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
IP address to see if you can pin point a problem in the network.
Urgency: If there are other indications that your system is slow, you should treat
this as an urgent problem.
* - Swift process is not running.
- You can use ``swift-init`` status to check if swift processes are running on any
given server.
- Run this command:
.. code::
sudo swift-init all start
Examine messages in the swift log files to see if there are any
error messages related to any of the swift processes since the time you
ran the ``swift-init`` command.
Take any corrective actions that seem necessary.
Urgency: If this only affects one server, and you have more than one,
identifying and fixing the problem can wait until business hours.
If this same problem affects many servers, then you need to take corrective
action immediately.
* - ntpd is not running.
- NTP is not running.
- Configure and start NTP.
Urgency: For proxy servers, this is vital.
* - Host clock is not syncd to an NTP server.
- Node time settings does not match NTP server time.
This may take some time to sync after a reboot.
- Assuming NTP is configured and running, you have to wait until the times sync.
* - A swift process has hundreds, to thousands of open file descriptors.
- May happen to any of the swift processes.
Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
- Restart the swift processes on the affected node:
.. code::
% sudo swift-init all reload
Urgency:
If known performance problem: Immediate
If system seems fine: Medium
* - A swift process is not owned by the swift user.
- If the UID of the swift user has changed, then the processes might not be
owned by that UID.
- Urgency: If this only affects one server, and you have more than one,
identifying and fixing the problem can wait until business hours.
If this same problem affects many servers, then you need to take corrective
action immediately.
* - Object account or container files not owned by swift.
- This typically happens if during a reinstall or a re-image of a server that the UID
of the swift user was changed. The data files in the object account and container
directories are owned by the original swift UID. As a result, the current swift
user does not own these files.
- Correct the UID of the swift user to reflect that of the original UID. An alternate
action is to change the ownership of every file on all file systems. This alternate
action is often impractical and will take considerable time.
Urgency: If this only affects one server, and you have more than one,
identifying and fixing the problem can wait until business hours.
If this same problem affects many servers, then you need to take corrective
action immediately.
* - A disk drive has a high IO wait or service time.
- If high wait IO times are seen for a single disk, then the disk drive is the problem.
If most/all devices are slow, the controller is probably the source of the problem.
The controller cache may also be miss configured which will cause similar long
wait or service times.
- As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
is working.
Second, reboot the server.
If problem persists, file a DC ticket to have the drive or controller replaced.
See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
Urgency: Medium
* - The network interface is not up.
- Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
- You can try restarting the interface. However, generally the interface
(or cable) is probably broken, especially if the interface is flapping.
Urgency: If this only affects one server, and you have more than one,
identifying and fixing the problem can wait until business hours.
If this same problem affects many servers, then you need to take corrective
action immediately.
* - Network interface card (NIC) is not operating at the expected speed.
- The NIC is running at a slower speed than its nominal rated speed.
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
- 1. Try resetting the interface with:
.. code::
sudo ethtool -s eth0 speed 1000
... and then run:
.. code::
sudo lshw -class
See if size goes to the expected speed. Failing
that, check hardware (NIC cable/switch port).
2. If persistent, consider shutting down the server (especially if a proxy)
until the problem is identified and resolved. If you leave this server
running it can have a large impact on overall performance.
Urgency: High
* - The interface RX/TX error count is non-zero.
- A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
- 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
3-30 probably indicate that the error count has crept up slowly over a long time.
Consider rebooting the server to remove the report from the noise.
Typically, when a cable or interface is bad, the error count goes to 400+. For example,
it stands out. There may be other symptoms such as the interface going up and down or
not running at correct speed. A server with a high error count should be watched.
2. If the error count continue to climb, consider taking the server down until
it can be properly investigated. In any case, a reboot should be done to clear
the error count.
Urgency: High, if the error count increasing.
* - In a swift log you see a message that a process has not replicated in over 24 hours.
- The replicator has not successfully completed a run in the last 24 hours.
This indicates that the replicator has probably hung.
- Use ``swift-init`` to stop and then restart the replicator process.
Urgency: Low (high if recent adding or replacement of disk drives), however if you
recently added or replaced disk drives then you should treat this urgently.
* - Container Updater has not run in 4 hour(s).
- The service may appear to be running however, it may be hung. Examine their swift
logs to see if there are any error messages relating to the container updater. This
may potentially explain why the container is not running.
- Urgency: Medium
This may have been triggered by a recent restart of the rsyslog daemon.
Restart the service with:
.. code::
sudo swift-init <service> reload
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
- Each replication cycle the object replicator writes a log message to its log
reporting statistics about the current cycle. This includes an estimate for the
remaining time needed to replicate all objects. If this time is longer than
100 hours, there is a problem with the replication process.
- Urgency: Medium
Restart the service with:
.. code::
sudo swift-init object-replicator reload
Check that the remaining replication time is going down.

View File

@@ -0,0 +1,264 @@
====================
Troubleshooting tips
====================
Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This entry is prompted by a real customer issue and exclusively focused on how
that problem was identified.
There are many reasons why a http status of 500 could be returned. If
there are no obvious problems with the swift object store, then it may
be necessary to take a closer look at the users transactions.
After finding the users swift account, you can
search the swift proxy logs on each swift proxy server for
transactions from this user. The linux ``bzgrep`` command can be used to
search all the proxy log files on a node including the ``.bz2`` compressed
files. For example:
.. code::
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*'
dshbak -c
.
.
\---------------\-
<redacted>.132.6
\---------------\-
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
This shows a ``GET`` operation on the users account.
.. note::
The HTTP status returned is 404, not found, rather than 500 as reported by the user.
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
search the swift object servers log files for this transaction ID:
.. code::
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
-R ssh
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
| dshbak -c
.
.
\---------------\-
<redacted>.72.16
\---------------\-
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
0.0016 ""
\---------------\-
<redacted>.31
\---------------\-
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
\---------------\-
<redacted>.204.70
\---------------\-
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
.. note::
The 3 GET operations to 3 different object servers that hold the 3
replicas of this users account. Each ``GET`` returns a HTTP status of 404,
not found.
Next, use the ``swift-get-nodes`` command to determine exactly where the
users account data is stored:
.. code::
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
Container None
Object None
Partition 198875
Hash 1846d99185f8a0edaf65cfbf37439696
Server:Port Device <redacted>.31:6002 disk6
Server:Port Device <redacted>.204.70:6002 disk6
Server:Port Device <redacted>.72.16:6002 disk9
Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
Server:Port Device <redacted>.26:6002 disk11 [Handoff]
Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
this users account. For example on <redacted>.72.16:
.. code::
$ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
total 1.0M
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
-rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
So this users account db, an sqlite db is present. Use sqlite to
checkout the account:
.. code::
$ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
$ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
sqlite> .mode line
sqlite> select * from account_stat;
account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
created_at = 1328311738.42190
put_timestamp = 1330000873.61411
delete_timestamp = 1330001026.00514
container_count = 0
object_count = 0
bytes_used = 0
hash = eb7e5d0ea3544d9def940b19114e8b43
id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
status = DELETED
status_changed_at = 1330001026.00514
metadata =
.. note::
The status is ``DELETED``. So this account was deleted. This explains
why the GET operations are returning 404, not found. Check the account
delete date/time:
.. code::
$ python
>>> import time
>>> time.ctime(1330001026.00514)
'Thu Feb 23 12:43:46 2012'
Next try and find the ``DELETE`` operation for this account in the proxy
server logs:
.. code::
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
.
.
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
tx4471188b0b87406899973d297c55ab53 - 0.0086
From this you can see the operation that resulted in the account being deleted.
Procedure: Deleting objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Simple case - deleting small number of objects and containers
-------------------------------------------------------------
.. note::
``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
Use ``swiftly`` as an alternative.
.. note::
Object and container names are in UTF8. Swift direct accepts UTF8
directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
URL-encoded). In practice cut and paste of foreign language strings to
a terminal window will produce the right result.
Hint: Use the ``head`` command before any destructive commands.
To delete a small number of objects, log into any proxy node and proceed
as follows:
Examine the object in question:
.. code::
$ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
then this is the manifest object and segment objects may be in another
container.
If the ``X-Object-Manifest`` attribute is set, you need to find the
name of the objects this means it is a DLO. For example,
if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
of the container container2 as follows:
.. code::
$ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
Pick out the objects whose names start with ``seg-blah``.
Delete the segment objects as follows:
.. code::
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
etc
If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:
- Using swift-get-nodes to get the details of the object's location.
- Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
- This lists a json body listing containers and object names
- Delete the objects as described above for DLO segments
Once the segments are deleted, you can delete the object using
``swift-direct`` as described above.
Finally, use ``swift-direct`` to delete the container.
Procedure: Decommissioning swift nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Should Swift nodes need to be decommissioned. For example, where they are being
re-purposed, it is very important to follow the following steps.
#. In the case of object servers, follow the procedure for removing
the node from the rings.
#. In the case of swift proxy servers, have the network team remove
the node from the load balancers.
#. Open a network ticket to have the node removed from network
firewalls.
#. Make sure that you remove the ``/etc/swift`` directory and everything in it.