Operational procedures guide
This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
This commit is contained in:
@@ -86,6 +86,7 @@ Administrator Documentation
|
|||||||
admin_guide
|
admin_guide
|
||||||
replication_network
|
replication_network
|
||||||
logs
|
logs
|
||||||
|
ops_runbook/index
|
||||||
|
|
||||||
Object Storage v1 REST API Documentation
|
Object Storage v1 REST API Documentation
|
||||||
========================================
|
========================================
|
||||||
|
|||||||
1031
doc/source/ops_runbook/diagnose.rst
Normal file
1031
doc/source/ops_runbook/diagnose.rst
Normal file
File diff suppressed because it is too large
Load Diff
36
doc/source/ops_runbook/general.rst
Normal file
36
doc/source/ops_runbook/general.rst
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
==================
|
||||||
|
General Procedures
|
||||||
|
==================
|
||||||
|
|
||||||
|
Getting a swift account stats
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
|
||||||
|
``swifty`` for an alternate, this is an example.
|
||||||
|
|
||||||
|
This procedure describes how you determine the swift usage for a given
|
||||||
|
swift account, that is the number of containers, number of objects and
|
||||||
|
total bytes used. To do this you will need the project ID.
|
||||||
|
|
||||||
|
Log onto one of the swift proxy servers.
|
||||||
|
|
||||||
|
Use swift-direct to show this accounts usage:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
|
||||||
|
Status: 200
|
||||||
|
Content-Length: 0
|
||||||
|
Accept-Ranges: bytes
|
||||||
|
X-Timestamp: 1379698586.88364
|
||||||
|
X-Account-Bytes-Used: 67440225625994
|
||||||
|
X-Account-Container-Count: 1
|
||||||
|
Content-Type: text/plain; charset=utf-8
|
||||||
|
X-Account-Object-Count: 8436776
|
||||||
|
Status: 200
|
||||||
|
name: my_container count: 8436776 bytes: 67440225625994
|
||||||
|
|
||||||
|
This account has 1 container. That container has 8436776 objects. The
|
||||||
|
total bytes used is 67440225625994.
|
||||||
79
doc/source/ops_runbook/index.rst
Normal file
79
doc/source/ops_runbook/index.rst
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
=================
|
||||||
|
Swift Ops Runbook
|
||||||
|
=================
|
||||||
|
|
||||||
|
This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate
|
||||||
|
and monitor the Swift system within the HPE Helion Public Cloud. This
|
||||||
|
document is an excerpt of a larger product-specific handbook. As such,
|
||||||
|
the material may appear incomplete. The suggestions and recommendations
|
||||||
|
made in this document are for our particular environment, and may not be
|
||||||
|
suitable for your environment or situation. We make no representations
|
||||||
|
concerning the accuracy, adequacy, completeness or suitability of the
|
||||||
|
information, suggestions or recommendations. This document are provided
|
||||||
|
for reference only. We are not responsible for your use of any
|
||||||
|
information, suggestions or recommendations contained herein.
|
||||||
|
|
||||||
|
This document also contains references to certain tools that we use to
|
||||||
|
operate the Swift system within the HPE Helion Public Cloud.
|
||||||
|
Descriptions of these tools are provided for reference only, as the tools themselves
|
||||||
|
are not publically available at this time.
|
||||||
|
|
||||||
|
- ``swift-direct``: This is similar to the ``swiftly`` tool.
|
||||||
|
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
general.rst
|
||||||
|
diagnose.rst
|
||||||
|
procedures.rst
|
||||||
|
maintenance.rst
|
||||||
|
troubleshooting.rst
|
||||||
|
|
||||||
|
Is the system up?
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you have a report that Swift is down, perform the following basic checks:
|
||||||
|
|
||||||
|
#. Run swift functional tests.
|
||||||
|
|
||||||
|
#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
|
||||||
|
|
||||||
|
#. If you have a monitoring system, check your monitoring system.
|
||||||
|
|
||||||
|
#. Check on your hardware load balancers infrastructure.
|
||||||
|
|
||||||
|
#. Run swift-recon on a proxy node.
|
||||||
|
|
||||||
|
Run swift function tests
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
We would recommend that you set up your function tests against your production
|
||||||
|
system.
|
||||||
|
|
||||||
|
A script for running the function tests is located in ``swift/.functests``.
|
||||||
|
|
||||||
|
|
||||||
|
External monitoring
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
- We use pingdom.com to monitor the external Swift API. We suggest the
|
||||||
|
following:
|
||||||
|
|
||||||
|
- Do a GET on ``/healthcheck``
|
||||||
|
|
||||||
|
- Create a container, make it public (x-container-read:
|
||||||
|
.r\*,.rlistings), create a small file in the container; do a GET
|
||||||
|
on the object
|
||||||
|
|
||||||
|
Reference information
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Reference: Swift startup/shutdown
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
- Use reload - not stop/start/restart.
|
||||||
|
|
||||||
|
- Try to roll sets of servers (especially proxy) in groups of less
|
||||||
|
than 20% of your servers.
|
||||||
|
|
||||||
322
doc/source/ops_runbook/maintenance.rst
Normal file
322
doc/source/ops_runbook/maintenance.rst
Normal file
@@ -0,0 +1,322 @@
|
|||||||
|
==================
|
||||||
|
Server maintenance
|
||||||
|
==================
|
||||||
|
|
||||||
|
General assumptions
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- It is assumed that anyone attempting to replace hardware components
|
||||||
|
will have already read and understood the appropriate maintenance and
|
||||||
|
service guides.
|
||||||
|
|
||||||
|
- It is assumed that where servers need to be taken off-line for
|
||||||
|
hardware replacement, that this will be done in series, bringing the
|
||||||
|
server back on-line before taking the next off-line.
|
||||||
|
|
||||||
|
- It is assumed that the operations directed procedure will be used for
|
||||||
|
identifying hardware for replacement.
|
||||||
|
|
||||||
|
Assessing the health of swift
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can run the swift-recon tool on a Swift proxy node to get a quick
|
||||||
|
check of how Swift is doing. Please note that the numbers below are
|
||||||
|
necessarily somewhat subjective. Sometimes parameters for which we
|
||||||
|
say 'low values are good' will have pretty high values for a time. Often
|
||||||
|
if you wait a while things get better.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-recon -rla
|
||||||
|
===============================================================================
|
||||||
|
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
|
||||||
|
Async stats: low: 0, high: 1, avg: 0, total: 1
|
||||||
|
===============================================================================
|
||||||
|
|
||||||
|
[2012-03-10 12:57:22] Checking replication times on 384 hosts...
|
||||||
|
[Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
|
||||||
|
===============================================================================
|
||||||
|
|
||||||
|
[2012-03-10 12:57:22] Checking load avg's on 384 hosts...
|
||||||
|
[5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
|
||||||
|
[15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
|
||||||
|
[1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
|
||||||
|
===============================================================================
|
||||||
|
|
||||||
|
In the example above we ask for information on replication times (-r),
|
||||||
|
load averages (-l) and async pendings (-a). This is a healthy Swift
|
||||||
|
system. Rules-of-thumb for 'good' recon output are:
|
||||||
|
|
||||||
|
- Nodes that respond are up and running Swift. If all nodes respond,
|
||||||
|
that is a good sign. But some nodes may time out. For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
\-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
|
||||||
|
\-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
|
||||||
|
|
||||||
|
- That could be okay or could require investigation.
|
||||||
|
|
||||||
|
- Low values (say < 10 for high and average) for async pendings are
|
||||||
|
good. Higher values occur when disks are down and/or when the system
|
||||||
|
is heavily loaded. Many simultaneous PUTs to the same container can
|
||||||
|
drive async pendings up. This may be normal, and may resolve itself
|
||||||
|
after a while. If it persists, one way to track down the problem is
|
||||||
|
to find a node with high async pendings (with ``swift-recon -av | sort
|
||||||
|
-n -k4``), then check its Swift logs, Often async pendings are high
|
||||||
|
because a node cannot write to a container on another node. Often
|
||||||
|
this is because the node or disk is offline or bad. This may be okay
|
||||||
|
if we know about it.
|
||||||
|
|
||||||
|
- Low values for replication times are good. These values rise when new
|
||||||
|
rings are pushed, and when nodes and devices are brought back on
|
||||||
|
line.
|
||||||
|
|
||||||
|
- Our 'high' load average values are typically in the 9-15 range. If
|
||||||
|
they are a lot bigger it is worth having a look at the systems
|
||||||
|
pushing the average up. Run ``swift-recon -av`` to get the individual
|
||||||
|
averages. To sort the entries with the highest at the end,
|
||||||
|
run ``swift-recon -av | sort -n -k4``.
|
||||||
|
|
||||||
|
For comparison here is the recon output for the same system above when
|
||||||
|
two entire racks of Swift are down:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
|
||||||
|
-> http://<redacted>.22:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.18:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.16:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.13:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.30:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.6:6000/recon/async: <urlopen error timed out>
|
||||||
|
.........
|
||||||
|
-> http://<redacted>.5:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.15:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.9:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.27:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.4:6000/recon/async: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.8:6000/recon/async: <urlopen error timed out>
|
||||||
|
Async stats: low: 243, high: 659, avg: 413, total: 132275
|
||||||
|
===============================================================================
|
||||||
|
[2012-03-10 16:57:48] Checking replication times on 384 hosts...
|
||||||
|
-> http://<redacted>.22:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.18:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.16:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.13:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.30:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.6:6000/recon/replication: <urlopen error timed out>
|
||||||
|
............
|
||||||
|
-> http://<redacted>.5:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.15:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.9:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.27:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.4:6000/recon/replication: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.8:6000/recon/replication: <urlopen error timed out>
|
||||||
|
[Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
|
||||||
|
9475361
|
||||||
|
===============================================================================
|
||||||
|
[2012-03-10 16:59:03] Checking load avg's on 384 hosts...
|
||||||
|
-> http://<redacted>.22:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.18:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.16:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.13:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.30:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.6:6000/recon/load: <urlopen error timed out>
|
||||||
|
............
|
||||||
|
-> http://<redacted>.15:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.9:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.27:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.4:6000/recon/load: <urlopen error timed out>
|
||||||
|
-> http://<redacted>.8:6000/recon/load: <urlopen error timed out>
|
||||||
|
[5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
|
||||||
|
[15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
|
||||||
|
[1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
|
||||||
|
===============================================================================
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The replication times and load averages are within reasonable
|
||||||
|
parameters, even with 80 object stores down. Async pendings, however is
|
||||||
|
quite high. This is due to the fact that the containers on the servers
|
||||||
|
which are down cannot be updated. When those servers come back up, async
|
||||||
|
pendings should drop. If async pendings were at this level without an
|
||||||
|
explanation, we have a problem.
|
||||||
|
|
||||||
|
Recon examples
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Here is an example of noting and tracking down a problem with recon.
|
||||||
|
|
||||||
|
Running reccon shows some async pendings:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
|
||||||
|
===============================================================================
|
||||||
|
\[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
|
||||||
|
Async stats: low: 0, high: 23, avg: 8, total: 3356
|
||||||
|
===============================================================================
|
||||||
|
\[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
|
||||||
|
\[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
|
||||||
|
===============================================================================
|
||||||
|
\[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
|
||||||
|
\[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
|
||||||
|
\[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
|
||||||
|
\[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
|
||||||
|
===============================================================================
|
||||||
|
|
||||||
|
Why? Running recon again with -av swift (not shown here) tells us that
|
||||||
|
the node with the highest (23) is <redacted>.72.61. Looking at the log
|
||||||
|
files on <redacted>.72.61 we see:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
|
||||||
|
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||||
|
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||||
|
Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||||
|
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||||
|
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||||
|
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
|
||||||
|
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||||
|
|
||||||
|
That is why this node has a lot of async pendings: a bunch of disks that
|
||||||
|
are not mounted on <redacted> and <redacted>. There may be other issues,
|
||||||
|
but clearing this up will likely drop the async pendings a fair bit, as
|
||||||
|
other nodes will be having the same problem.
|
||||||
|
|
||||||
|
Assessing the availability risk when multiple storage servers are down
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This procedure will tell you if you have a problem, however, in practice
|
||||||
|
you will find that you will not use this procedure frequently.
|
||||||
|
|
||||||
|
If three storage nodes (or, more precisely, three disks on three
|
||||||
|
different storage nodes) are down, there is a small but nonzero
|
||||||
|
probability that user objects, containers, or accounts will not be
|
||||||
|
available.
|
||||||
|
|
||||||
|
Procedure
|
||||||
|
---------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
swift has three rings: one each for objects, containers and accounts.
|
||||||
|
This procedure should be run three times, each time specifying the
|
||||||
|
appropriate ``*.builder`` file.
|
||||||
|
|
||||||
|
#. Determine whether all three nodes are different Swift zones by
|
||||||
|
running the ring builder on a proxy node to determine which zones
|
||||||
|
the storage nodes are in. For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
% sudo swift-ring-builder /etc/swift/object.builder
|
||||||
|
/etc/swift/object.builder, build version 1467
|
||||||
|
2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
|
||||||
|
The minimum number of hours before a partition can be reassigned is 24
|
||||||
|
Devices: id zone ip address port name weight partitions balance meta
|
||||||
|
0 1 <redacted>.4 6000 disk0 1708.00 4259 -0.00
|
||||||
|
1 1 <redacted>.4 6000 disk1 1708.00 4260 0.02
|
||||||
|
2 1 <redacted>.4 6000 disk2 1952.00 4868 0.01
|
||||||
|
3 1 <redacted>.4 6000 disk3 1952.00 4868 0.01
|
||||||
|
4 1 <redacted>.4 6000 disk4 1952.00 4867 -0.01
|
||||||
|
|
||||||
|
#. Here, node <redacted>.4 is in zone 1. If two or more of the three
|
||||||
|
nodes under consideration are in the same Swift zone, they do not
|
||||||
|
have any ring partitions in common; there is little/no data
|
||||||
|
availability risk if all three nodes are down.
|
||||||
|
|
||||||
|
#. If the nodes are in three distinct Swift zonesit is necessary to
|
||||||
|
whether the nodes have ring partitions in common. Run ``swift-ring``
|
||||||
|
builder again, this time with the ``list_parts`` option and specify
|
||||||
|
the nodes under consideration. For example (all on one line):
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
|
||||||
|
Partition Matches
|
||||||
|
91 2
|
||||||
|
729 2
|
||||||
|
3754 2
|
||||||
|
3769 2
|
||||||
|
3947 2
|
||||||
|
5818 2
|
||||||
|
7918 2
|
||||||
|
8733 2
|
||||||
|
9509 2
|
||||||
|
10233 2
|
||||||
|
|
||||||
|
#. The ``list_parts`` option to the ring builder indicates how many ring
|
||||||
|
partitions the nodes have in common. If, as in this case, the
|
||||||
|
first entry in the list has a ‘Matches’ column of 2 or less, there
|
||||||
|
is no data availability risk if all three nodes are down.
|
||||||
|
|
||||||
|
#. If the ‘Matches’ column has entries equal to 3, there is some data
|
||||||
|
availability risk if all three nodes are down. The risk is generally
|
||||||
|
small, and is proportional to the number of entries that have a 3 in
|
||||||
|
the Matches column. For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
Partition Matches
|
||||||
|
26865 3
|
||||||
|
362367 3
|
||||||
|
745940 3
|
||||||
|
778715 3
|
||||||
|
797559 3
|
||||||
|
820295 3
|
||||||
|
822118 3
|
||||||
|
839603 3
|
||||||
|
852332 3
|
||||||
|
855965 3
|
||||||
|
858016 3
|
||||||
|
|
||||||
|
#. A quick way to count the number of rows with 3 matches is:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
|
||||||
|
|
||||||
|
30
|
||||||
|
|
||||||
|
#. In this case the nodes have 30 out of a total of 2097152 partitions
|
||||||
|
in common; about 0.001%. In this case the risk is small nonzero.
|
||||||
|
Recall that a partition is simply a portion of the ring mapping
|
||||||
|
space, not actual data. So having partitions in common is a necessary
|
||||||
|
but not sufficient condition for data unavailability.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
We should not bring down a node for repair if it shows
|
||||||
|
Matches entries of 3 with other nodes that are also down.
|
||||||
|
|
||||||
|
If three nodes that have 3 partitions in common are all down, there is
|
||||||
|
a nonzero probability that data are unavailable and we should work to
|
||||||
|
bring some or all of the nodes up ASAP.
|
||||||
367
doc/source/ops_runbook/procedures.rst
Normal file
367
doc/source/ops_runbook/procedures.rst
Normal file
@@ -0,0 +1,367 @@
|
|||||||
|
=================================
|
||||||
|
Software configuration procedures
|
||||||
|
=================================
|
||||||
|
|
||||||
|
Fix broken GPT table (broken disk partition)
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- If a GPT table is broken, a message like the following should be
|
||||||
|
observed when the command...
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo parted -l
|
||||||
|
|
||||||
|
- ... is run.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
...
|
||||||
|
Error: The backup GPT table is corrupt, but the primary appears OK, so that will
|
||||||
|
be used.
|
||||||
|
OK/Cancel?
|
||||||
|
|
||||||
|
#. To fix this, firstly install the ``gdisk`` program to fix this:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo aptitude install gdisk
|
||||||
|
|
||||||
|
#. Run ``gdisk`` for the particular drive with the damaged partition:
|
||||||
|
|
||||||
|
.. code:
|
||||||
|
|
||||||
|
$ sudo gdisk /dev/sd*a-l*
|
||||||
|
GPT fdisk (gdisk) version 0.6.14
|
||||||
|
|
||||||
|
Caution: invalid backup GPT header, but valid main header; regenerating
|
||||||
|
backup header from main header.
|
||||||
|
|
||||||
|
Warning! One or more CRCs don't match. You should repair the disk!
|
||||||
|
|
||||||
|
Partition table scan:
|
||||||
|
MBR: protective
|
||||||
|
BSD: not present
|
||||||
|
APM: not present
|
||||||
|
GPT: damaged
|
||||||
|
/dev/sd
|
||||||
|
*****************************************************************************
|
||||||
|
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
|
||||||
|
verification and recovery are STRONGLY recommended.
|
||||||
|
*****************************************************************************
|
||||||
|
|
||||||
|
#. On the command prompt, type ``r`` (recovery and transformation
|
||||||
|
options), followed by ``d`` (use main GPT header) , ``v`` (verify disk)
|
||||||
|
and finally ``w`` (write table to disk and exit). Will also need to
|
||||||
|
enter ``Y`` when prompted in order to confirm actions.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
Command (? for help): r
|
||||||
|
|
||||||
|
Recovery/transformation command (? for help): d
|
||||||
|
|
||||||
|
Recovery/transformation command (? for help): v
|
||||||
|
|
||||||
|
Caution: The CRC for the backup partition table is invalid. This table may
|
||||||
|
be corrupt. This program will automatically create a new backup partition
|
||||||
|
table when you save your partitions.
|
||||||
|
|
||||||
|
Caution: Partition 1 doesn't begin on a 8-sector boundary. This may
|
||||||
|
result in degraded performance on some modern (2009 and later) hard disks.
|
||||||
|
|
||||||
|
Caution: Partition 2 doesn't begin on a 8-sector boundary. This may
|
||||||
|
result in degraded performance on some modern (2009 and later) hard disks.
|
||||||
|
|
||||||
|
Caution: Partition 3 doesn't begin on a 8-sector boundary. This may
|
||||||
|
result in degraded performance on some modern (2009 and later) hard disks.
|
||||||
|
|
||||||
|
Identified 1 problems!
|
||||||
|
|
||||||
|
Recovery/transformation command (? for help): w
|
||||||
|
|
||||||
|
Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
|
||||||
|
PARTITIONS!!
|
||||||
|
|
||||||
|
Do you want to proceed, possibly destroying your data? (Y/N): Y
|
||||||
|
|
||||||
|
OK; writing new GUID partition table (GPT).
|
||||||
|
The operation has completed successfully.
|
||||||
|
|
||||||
|
#. Running the command:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo parted /dev/sd#
|
||||||
|
|
||||||
|
#. Should now show that the partition is recovered and healthy again.
|
||||||
|
|
||||||
|
#. Finally, uninstall ``gdisk`` from the node:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo aptitude remove gdisk
|
||||||
|
|
||||||
|
Procedure: Fix broken XFS filesystem
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
#. A filesystem may be corrupt or broken if the following output is
|
||||||
|
observed when checking its label:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo xfs_admin -l /dev/sd#
|
||||||
|
cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
|
||||||
|
xfs_admin: cannot read root inode (117)
|
||||||
|
cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
|
||||||
|
xfs_admin: cannot read realtime bitmap inode (117)
|
||||||
|
bad sb magic # 0 in AG 1
|
||||||
|
failed to read label in AG 1
|
||||||
|
|
||||||
|
#. Run the following commands to remove the broken/corrupt filesystem and replace.
|
||||||
|
(This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo parted
|
||||||
|
GNU Parted 2.3
|
||||||
|
Using /dev/sda
|
||||||
|
Welcome to GNU Parted! Type 'help' to view a list of commands.
|
||||||
|
(parted) select /dev/sdb
|
||||||
|
Using /dev/sdb
|
||||||
|
(parted) p
|
||||||
|
Model: HP LOGICAL VOLUME (scsi)
|
||||||
|
Disk /dev/sdb: 2000GB
|
||||||
|
Sector size (logical/physical): 512B/512B
|
||||||
|
Partition Table: gpt
|
||||||
|
|
||||||
|
Number Start End Size File system Name Flags
|
||||||
|
1 17.4kB 1024MB 1024MB ext3 boot
|
||||||
|
2 1024MB 1751GB 1750GB xfs sw-aw2az1-object045-disk1
|
||||||
|
3 1751GB 2000GB 249GB lvm
|
||||||
|
|
||||||
|
(parted) rm 2
|
||||||
|
(parted) mkpart primary 2 -1
|
||||||
|
Warning: You requested a partition from 2000kB to 2000GB.
|
||||||
|
The closest location we can manage is 1024MB to 1751GB.
|
||||||
|
Is this still acceptable to you?
|
||||||
|
Yes/No? Yes
|
||||||
|
Warning: The resulting partition is not properly aligned for best performance.
|
||||||
|
Ignore/Cancel? Ignore
|
||||||
|
(parted) p
|
||||||
|
Model: HP LOGICAL VOLUME (scsi)
|
||||||
|
Disk /dev/sdb: 2000GB
|
||||||
|
Sector size (logical/physical): 512B/512B
|
||||||
|
Partition Table: gpt
|
||||||
|
|
||||||
|
Number Start End Size File system Name Flags
|
||||||
|
1 17.4kB 1024MB 1024MB ext3 boot
|
||||||
|
2 1024MB 1751GB 1750GB xfs primary
|
||||||
|
3 1751GB 2000GB 249GB lvm
|
||||||
|
|
||||||
|
(parted) quit
|
||||||
|
|
||||||
|
#. Next step is to scrub the filesystem and format:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
|
||||||
|
1+0 records in
|
||||||
|
1+0 records out
|
||||||
|
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
|
||||||
|
$ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
|
||||||
|
meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks
|
||||||
|
= sectsz=512 attr=2, projid32bit=0
|
||||||
|
data = bsize=4096 blocks=427246093, imaxpct=5
|
||||||
|
= sunit=0 swidth=0 blks
|
||||||
|
naming =version 2 bsize=4096 ascii-ci=0
|
||||||
|
log =internal log bsize=4096 blocks=208616, version=2
|
||||||
|
= sectsz=512 sunit=0 blks, lazy-count=1
|
||||||
|
realtime =none extsz=4096 blocks=0, rtextents=0
|
||||||
|
|
||||||
|
#. You should now label and mount your filesystem.
|
||||||
|
|
||||||
|
#. Can now check to see if the filesystem is mounted using the command:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ mount
|
||||||
|
|
||||||
|
Procedure: Checking if an account is okay
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``swift-direct`` is only available in the HPE Helion Public Cloud.
|
||||||
|
Use ``swiftly`` as an alternate.
|
||||||
|
|
||||||
|
If you have a tenant ID you can check the account is okay as follows from a proxy.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
|
||||||
|
|
||||||
|
The response will either be similar to a swift list of the account
|
||||||
|
containers, or an error indicating that the resource could not be found.
|
||||||
|
|
||||||
|
In the latter case you can establish if a backend database exists for
|
||||||
|
the tenantId by running the following on a proxy:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift swift-get-nodes /etc/swift/account.ring.gz <Api-Auth-Hash-or-TenantId>
|
||||||
|
|
||||||
|
The response will list ssh commands that will list the replicated
|
||||||
|
account databases, if they exist.
|
||||||
|
|
||||||
|
Procedure: Revive a deleted account
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Swift accounts are normally not recreated. If a tenant unsubscribes from
|
||||||
|
Swift, the account is deleted. To re-subscribe to Swift, you can create
|
||||||
|
a new tenant (new tenant ID), and subscribe to Swift. This creates a
|
||||||
|
new Swift account with the new tenant ID.
|
||||||
|
|
||||||
|
However, until the unsubscribe/new tenant process is supported, you may
|
||||||
|
hit a situation where a Swift account is deleted and the user is locked
|
||||||
|
out of Swift.
|
||||||
|
|
||||||
|
Deleting the account database files
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
Here is one possible solution. The containers and objects may be lost
|
||||||
|
forever. The solution is to delete the account database files and
|
||||||
|
re-create the account. This may only be done once the containers and
|
||||||
|
objects are completely deleted. This process is untested, but could
|
||||||
|
work as follows:
|
||||||
|
|
||||||
|
#. Use swift-get-nodes to locate the account's database file (on three
|
||||||
|
servers).
|
||||||
|
|
||||||
|
#. Rename the database files (on three servers).
|
||||||
|
|
||||||
|
#. Use ``swiftly`` to create the account (use original name).
|
||||||
|
|
||||||
|
Renaming account database so it can be revived
|
||||||
|
----------------------------------------------
|
||||||
|
|
||||||
|
Get the locations of the database files that hold the account data.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
||||||
|
|
||||||
|
Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
||||||
|
Container None
|
||||||
|
|
||||||
|
Object None
|
||||||
|
|
||||||
|
Partition 18914
|
||||||
|
|
||||||
|
Hash 93c41ef56dd69173a9524193ab813e78
|
||||||
|
|
||||||
|
Server:Port Device 15.184.9.126:6002 disk7
|
||||||
|
Server:Port Device 15.184.9.94:6002 disk11
|
||||||
|
Server:Port Device 15.184.9.103:6002 disk10
|
||||||
|
Server:Port Device 15.184.9.80:6002 disk2 [Handoff]
|
||||||
|
Server:Port Device 15.184.9.120:6002 disk2 [Handoff]
|
||||||
|
Server:Port Device 15.184.9.98:6002 disk2 [Handoff]
|
||||||
|
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||||
|
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||||
|
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||||
|
curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||||
|
|
||||||
|
ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||||
|
ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||||
|
ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||||
|
ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||||
|
ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||||
|
ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||||
|
|
||||||
|
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db-
|
||||||
|
31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002
|
||||||
|
disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD
|
||||||
|
"`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||||
|
|
||||||
|
"`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||||
|
|
||||||
|
"`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||||
|
|
||||||
|
"`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
||||||
|
|
||||||
|
"`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
||||||
|
|
||||||
|
"`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
|
||||||
|
|
||||||
|
"ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
|
||||||
|
"ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
|
||||||
|
"ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||||
|
|
||||||
|
Check that the handoff nodes do not have account databases:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||||
|
ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
|
||||||
|
|
||||||
|
If the handoff node has a database, wait for rebalancing to occur.
|
||||||
|
|
||||||
|
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can stop the load balancers sending requests to a proxy server as
|
||||||
|
follows. This can be useful when a proxy is misbehaving but you need
|
||||||
|
Swift running to help diagnose the problem. By removing from the load
|
||||||
|
balancers, customer's are not impacted by the misbehaving proxy.
|
||||||
|
|
||||||
|
#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
|
||||||
|
``/etc/swift/disabled-by-file``.
|
||||||
|
|
||||||
|
#. Log onto the proxy node.
|
||||||
|
|
||||||
|
#. Shut down Swift as follows:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init proxy shutdown
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Shutdown, not stop.
|
||||||
|
|
||||||
|
#. Create the ``/etc/swift/disabled-by-file`` file. For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo touch /etc/swift/disabled-by-file
|
||||||
|
|
||||||
|
#. Optional, restart Swift:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init proxy start
|
||||||
|
|
||||||
|
It works because the healthcheck middleware looks for this file. If it
|
||||||
|
find it, it will return 503 error instead of 200/OK. This means the load balancer
|
||||||
|
should stop sending traffic to the proxy.
|
||||||
|
|
||||||
|
``/healthcheck`` will report
|
||||||
|
``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
|
||||||
|
|
||||||
|
Procedure: Ad-Hoc disk performance test
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can get an idea whether a disk drive is performing as follows:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
|
||||||
|
|
||||||
|
You can expect ~600MB/sec. If you get a low number, repeat many times as
|
||||||
|
Swift itself may also read or write to the disk, hence giving a lower
|
||||||
|
number.
|
||||||
177
doc/source/ops_runbook/sec-furtherdiagnose.rst
Normal file
177
doc/source/ops_runbook/sec-furtherdiagnose.rst
Normal file
@@ -0,0 +1,177 @@
|
|||||||
|
==============================
|
||||||
|
Further issues and resolutions
|
||||||
|
==============================
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The urgency levels in each **Action** column indicates whether or
|
||||||
|
not it is required to take immediate action, or if the problem can be worked
|
||||||
|
on during business hours.
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 33 33 33
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - **Scenario**
|
||||||
|
- **Description**
|
||||||
|
- **Action**
|
||||||
|
* - ``/healthcheck`` latency is high.
|
||||||
|
- The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
|
||||||
|
network issues, rather than the proxies being very busy. A very slow proxy might impact the average
|
||||||
|
number, but it would need to be very slow to shift the number that much.
|
||||||
|
- Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
|
||||||
|
IP address to see if you can pin point a problem in the network.
|
||||||
|
|
||||||
|
Urgency: If there are other indications that your system is slow, you should treat
|
||||||
|
this as an urgent problem.
|
||||||
|
* - Swift process is not running.
|
||||||
|
- You can use ``swift-init`` status to check if swift processes are running on any
|
||||||
|
given server.
|
||||||
|
- Run this command:
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init all start
|
||||||
|
|
||||||
|
Examine messages in the swift log files to see if there are any
|
||||||
|
error messages related to any of the swift processes since the time you
|
||||||
|
ran the ``swift-init`` command.
|
||||||
|
|
||||||
|
Take any corrective actions that seem necessary.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - ntpd is not running.
|
||||||
|
- NTP is not running.
|
||||||
|
- Configure and start NTP.
|
||||||
|
Urgency: For proxy servers, this is vital.
|
||||||
|
|
||||||
|
* - Host clock is not syncd to an NTP server.
|
||||||
|
- Node time settings does not match NTP server time.
|
||||||
|
This may take some time to sync after a reboot.
|
||||||
|
- Assuming NTP is configured and running, you have to wait until the times sync.
|
||||||
|
* - A swift process has hundreds, to thousands of open file descriptors.
|
||||||
|
- May happen to any of the swift processes.
|
||||||
|
Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
|
||||||
|
|
||||||
|
- Restart the swift processes on the affected node:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
% sudo swift-init all reload
|
||||||
|
|
||||||
|
Urgency:
|
||||||
|
If known performance problem: Immediate
|
||||||
|
|
||||||
|
If system seems fine: Medium
|
||||||
|
* - A swift process is not owned by the swift user.
|
||||||
|
- If the UID of the swift user has changed, then the processes might not be
|
||||||
|
owned by that UID.
|
||||||
|
- Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - Object account or container files not owned by swift.
|
||||||
|
- This typically happens if during a reinstall or a re-image of a server that the UID
|
||||||
|
of the swift user was changed. The data files in the object account and container
|
||||||
|
directories are owned by the original swift UID. As a result, the current swift
|
||||||
|
user does not own these files.
|
||||||
|
- Correct the UID of the swift user to reflect that of the original UID. An alternate
|
||||||
|
action is to change the ownership of every file on all file systems. This alternate
|
||||||
|
action is often impractical and will take considerable time.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - A disk drive has a high IO wait or service time.
|
||||||
|
- If high wait IO times are seen for a single disk, then the disk drive is the problem.
|
||||||
|
If most/all devices are slow, the controller is probably the source of the problem.
|
||||||
|
The controller cache may also be miss configured – which will cause similar long
|
||||||
|
wait or service times.
|
||||||
|
- As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
|
||||||
|
is working.
|
||||||
|
|
||||||
|
Second, reboot the server.
|
||||||
|
If problem persists, file a DC ticket to have the drive or controller replaced.
|
||||||
|
See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
|
||||||
|
|
||||||
|
Urgency: Medium
|
||||||
|
* - The network interface is not up.
|
||||||
|
- Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
|
||||||
|
- You can try restarting the interface. However, generally the interface
|
||||||
|
(or cable) is probably broken, especially if the interface is flapping.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - Network interface card (NIC) is not operating at the expected speed.
|
||||||
|
- The NIC is running at a slower speed than its nominal rated speed.
|
||||||
|
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
|
||||||
|
- 1. Try resetting the interface with:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo ethtool -s eth0 speed 1000
|
||||||
|
|
||||||
|
... and then run:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo lshw -class
|
||||||
|
|
||||||
|
See if size goes to the expected speed. Failing
|
||||||
|
that, check hardware (NIC cable/switch port).
|
||||||
|
|
||||||
|
2. If persistent, consider shutting down the server (especially if a proxy)
|
||||||
|
until the problem is identified and resolved. If you leave this server
|
||||||
|
running it can have a large impact on overall performance.
|
||||||
|
|
||||||
|
Urgency: High
|
||||||
|
* - The interface RX/TX error count is non-zero.
|
||||||
|
- A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
|
||||||
|
- 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
|
||||||
|
3-30 probably indicate that the error count has crept up slowly over a long time.
|
||||||
|
Consider rebooting the server to remove the report from the noise.
|
||||||
|
|
||||||
|
Typically, when a cable or interface is bad, the error count goes to 400+. For example,
|
||||||
|
it stands out. There may be other symptoms such as the interface going up and down or
|
||||||
|
not running at correct speed. A server with a high error count should be watched.
|
||||||
|
|
||||||
|
2. If the error count continue to climb, consider taking the server down until
|
||||||
|
it can be properly investigated. In any case, a reboot should be done to clear
|
||||||
|
the error count.
|
||||||
|
|
||||||
|
Urgency: High, if the error count increasing.
|
||||||
|
|
||||||
|
* - In a swift log you see a message that a process has not replicated in over 24 hours.
|
||||||
|
- The replicator has not successfully completed a run in the last 24 hours.
|
||||||
|
This indicates that the replicator has probably hung.
|
||||||
|
- Use ``swift-init`` to stop and then restart the replicator process.
|
||||||
|
|
||||||
|
Urgency: Low (high if recent adding or replacement of disk drives), however if you
|
||||||
|
recently added or replaced disk drives then you should treat this urgently.
|
||||||
|
* - Container Updater has not run in 4 hour(s).
|
||||||
|
- The service may appear to be running however, it may be hung. Examine their swift
|
||||||
|
logs to see if there are any error messages relating to the container updater. This
|
||||||
|
may potentially explain why the container is not running.
|
||||||
|
- Urgency: Medium
|
||||||
|
This may have been triggered by a recent restart of the rsyslog daemon.
|
||||||
|
Restart the service with:
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init <service> reload
|
||||||
|
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
|
||||||
|
- Each replication cycle the object replicator writes a log message to its log
|
||||||
|
reporting statistics about the current cycle. This includes an estimate for the
|
||||||
|
remaining time needed to replicate all objects. If this time is longer than
|
||||||
|
100 hours, there is a problem with the replication process.
|
||||||
|
- Urgency: Medium
|
||||||
|
Restart the service with:
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init object-replicator reload
|
||||||
|
|
||||||
|
Check that the remaining replication time is going down.
|
||||||
264
doc/source/ops_runbook/troubleshooting.rst
Normal file
264
doc/source/ops_runbook/troubleshooting.rst
Normal file
@@ -0,0 +1,264 @@
|
|||||||
|
====================
|
||||||
|
Troubleshooting tips
|
||||||
|
====================
|
||||||
|
|
||||||
|
Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This entry is prompted by a real customer issue and exclusively focused on how
|
||||||
|
that problem was identified.
|
||||||
|
There are many reasons why a http status of 500 could be returned. If
|
||||||
|
there are no obvious problems with the swift object store, then it may
|
||||||
|
be necessary to take a closer look at the users transactions.
|
||||||
|
After finding the users swift account, you can
|
||||||
|
search the swift proxy logs on each swift proxy server for
|
||||||
|
transactions from this user. The linux ``bzgrep`` command can be used to
|
||||||
|
search all the proxy log files on a node including the ``.bz2`` compressed
|
||||||
|
files. For example:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
|
||||||
|
|
||||||
|
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
|
||||||
|
4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*'
|
||||||
|
dshbak -c
|
||||||
|
.
|
||||||
|
.
|
||||||
|
\---------------\-
|
||||||
|
<redacted>.132.6
|
||||||
|
\---------------\-
|
||||||
|
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
|
||||||
|
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||||
|
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
|
||||||
|
tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
|
||||||
|
|
||||||
|
This shows a ``GET`` operation on the users account.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The HTTP status returned is 404, not found, rather than 500 as reported by the user.
|
||||||
|
|
||||||
|
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
|
||||||
|
search the swift object servers log files for this transaction ID:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
|
||||||
|
|
||||||
|
-R ssh
|
||||||
|
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
|
||||||
|
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
|
||||||
|
| dshbak -c
|
||||||
|
.
|
||||||
|
.
|
||||||
|
\---------------\-
|
||||||
|
<redacted>.72.16
|
||||||
|
\---------------\-
|
||||||
|
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
|
||||||
|
|
||||||
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
|
||||||
|
|
||||||
|
0.0016 ""
|
||||||
|
\---------------\-
|
||||||
|
<redacted>.31
|
||||||
|
\---------------\-
|
||||||
|
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
|
||||||
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||||
|
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
|
||||||
|
\---------------\-
|
||||||
|
<redacted>.204.70
|
||||||
|
\---------------\-
|
||||||
|
|
||||||
|
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
|
||||||
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||||
|
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The 3 GET operations to 3 different object servers that hold the 3
|
||||||
|
replicas of this users account. Each ``GET`` returns a HTTP status of 404,
|
||||||
|
not found.
|
||||||
|
|
||||||
|
Next, use the ``swift-get-nodes`` command to determine exactly where the
|
||||||
|
users account data is stored:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||||
|
Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||||
|
Container None
|
||||||
|
Object None
|
||||||
|
|
||||||
|
Partition 198875
|
||||||
|
Hash 1846d99185f8a0edaf65cfbf37439696
|
||||||
|
|
||||||
|
Server:Port Device <redacted>.31:6002 disk6
|
||||||
|
Server:Port Device <redacted>.204.70:6002 disk6
|
||||||
|
Server:Port Device <redacted>.72.16:6002 disk9
|
||||||
|
Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
|
||||||
|
Server:Port Device <redacted>.26:6002 disk11 [Handoff]
|
||||||
|
Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
|
||||||
|
|
||||||
|
curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||||
|
curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||||
|
curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||||
|
curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||||
|
curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||||
|
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
|
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||||
|
|
||||||
|
ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
|
ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
|
ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
|
ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
|
ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
|
ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
|
|
||||||
|
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
|
||||||
|
this users account. For example on <redacted>.72.16:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
|
||||||
|
total 1.0M
|
||||||
|
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
|
||||||
|
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
|
||||||
|
-rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
|
||||||
|
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
|
||||||
|
|
||||||
|
So this users account db, an sqlite db is present. Use sqlite to
|
||||||
|
checkout the account:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
|
||||||
|
$ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
|
||||||
|
sqlite> .mode line
|
||||||
|
sqlite> select * from account_stat;
|
||||||
|
account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||||
|
created_at = 1328311738.42190
|
||||||
|
put_timestamp = 1330000873.61411
|
||||||
|
delete_timestamp = 1330001026.00514
|
||||||
|
container_count = 0
|
||||||
|
object_count = 0
|
||||||
|
bytes_used = 0
|
||||||
|
hash = eb7e5d0ea3544d9def940b19114e8b43
|
||||||
|
id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
|
||||||
|
status = DELETED
|
||||||
|
status_changed_at = 1330001026.00514
|
||||||
|
metadata =
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The status is ``DELETED``. So this account was deleted. This explains
|
||||||
|
why the GET operations are returning 404, not found. Check the account
|
||||||
|
delete date/time:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ python
|
||||||
|
|
||||||
|
>>> import time
|
||||||
|
>>> time.ctime(1330001026.00514)
|
||||||
|
'Thu Feb 23 12:43:46 2012'
|
||||||
|
|
||||||
|
Next try and find the ``DELETE`` operation for this account in the proxy
|
||||||
|
server logs:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
|
||||||
|
139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
|
||||||
|
DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
|
||||||
|
.
|
||||||
|
.
|
||||||
|
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
|
||||||
|
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
|
||||||
|
|
||||||
|
tx4471188b0b87406899973d297c55ab53 - 0.0086
|
||||||
|
|
||||||
|
From this you can see the operation that resulted in the account being deleted.
|
||||||
|
|
||||||
|
Procedure: Deleting objects
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Simple case - deleting small number of objects and containers
|
||||||
|
-------------------------------------------------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
|
||||||
|
Use ``swiftly`` as an alternative.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Object and container names are in UTF8. Swift direct accepts UTF8
|
||||||
|
directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
|
||||||
|
URL-encoded). In practice cut and paste of foreign language strings to
|
||||||
|
a terminal window will produce the right result.
|
||||||
|
|
||||||
|
Hint: Use the ``head`` command before any destructive commands.
|
||||||
|
|
||||||
|
To delete a small number of objects, log into any proxy node and proceed
|
||||||
|
as follows:
|
||||||
|
|
||||||
|
Examine the object in question:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
|
||||||
|
|
||||||
|
See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
|
||||||
|
then this is the manifest object and segment objects may be in another
|
||||||
|
container.
|
||||||
|
|
||||||
|
If the ``X-Object-Manifest`` attribute is set, you need to find the
|
||||||
|
name of the objects this means it is a DLO. For example,
|
||||||
|
if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
|
||||||
|
of the container container2 as follows:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
|
||||||
|
|
||||||
|
Pick out the objects whose names start with ``seg-blah``.
|
||||||
|
Delete the segment objects as follows:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
|
||||||
|
etc
|
||||||
|
|
||||||
|
If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:
|
||||||
|
|
||||||
|
- Using swift-get-nodes to get the details of the object's location.
|
||||||
|
- Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
|
||||||
|
- This lists a json body listing containers and object names
|
||||||
|
- Delete the objects as described above for DLO segments
|
||||||
|
|
||||||
|
Once the segments are deleted, you can delete the object using
|
||||||
|
``swift-direct`` as described above.
|
||||||
|
|
||||||
|
Finally, use ``swift-direct`` to delete the container.
|
||||||
|
|
||||||
|
Procedure: Decommissioning swift nodes
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Should Swift nodes need to be decommissioned. For example, where they are being
|
||||||
|
re-purposed, it is very important to follow the following steps.
|
||||||
|
|
||||||
|
#. In the case of object servers, follow the procedure for removing
|
||||||
|
the node from the rings.
|
||||||
|
#. In the case of swift proxy servers, have the network team remove
|
||||||
|
the node from the load balancers.
|
||||||
|
#. Open a network ticket to have the node removed from network
|
||||||
|
firewalls.
|
||||||
|
#. Make sure that you remove the ``/etc/swift`` directory and everything in it.
|
||||||
Reference in New Issue
Block a user