
This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
11 KiB
Troubleshooting tips
Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
This entry is prompted by a real customer issue and exclusively
focused on how that problem was identified. There are many reasons why a
http status of 500 could be returned. If there are no obvious problems
with the swift object store, then it may be necessary to take a closer
look at the users transactions. After finding the users swift account,
you can search the swift proxy logs on each swift proxy server for
transactions from this user. The linux bzgrep
command can
be used to search all the proxy log files on a node including the
.bz2
compressed files. For example:
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*'
dshbak -c
.
.
\---------------\-
<redacted>.132.6
\---------------\-
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
This shows a GET
operation on the users account.
Note
The HTTP status returned is 404, not found, rather than 500 as reported by the user.
Using the transaction ID,
tx429fc3be354f434ab7f9c6c4206c1dc3
you can search the swift
object servers log files for this transaction ID:
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
-R ssh
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
| dshbak -c
.
.
\---------------\-
<redacted>.72.16
\---------------\-
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
0.0016 ""
\---------------\-
<redacted>.31
\---------------\-
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
\---------------\-
<redacted>.204.70
\---------------\-
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
Note
The 3 GET operations to 3 different object servers that hold the 3
replicas of this users account. Each GET
returns a HTTP
status of 404, not found.
Next, use the swift-get-nodes
command to determine
exactly where the users account data is stored:
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
Container None
Object None
Partition 198875
Hash 1846d99185f8a0edaf65cfbf37439696
Server:Port Device <redacted>.31:6002 disk6
Server:Port Device <redacted>.204.70:6002 disk6
Server:Port Device <redacted>.72.16:6002 disk9
Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
Server:Port Device <redacted>.26:6002 disk11 [Handoff]
Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for this users account. For example on <redacted>.72.16:
$ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
total 1.0M
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
-rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
So this users account db, an sqlite db is present. Use sqlite to checkout the account:
$ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
$ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
sqlite> .mode line
sqlite> select * from account_stat;
account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
created_at = 1328311738.42190
put_timestamp = 1330000873.61411
delete_timestamp = 1330001026.00514
container_count = 0
object_count = 0
bytes_used = 0
hash = eb7e5d0ea3544d9def940b19114e8b43
id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
status = DELETED
status_changed_at = 1330001026.00514
metadata =
Note
The status is DELETED
. So this account was deleted. This
explains why the GET operations are returning 404, not found. Check the
account delete date/time:
$ python
>>> import time
>>> time.ctime(1330001026.00514)
'Thu Feb 23 12:43:46 2012'
Next try and find the DELETE
operation for this account
in the proxy server logs:
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
.
.
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
tx4471188b0b87406899973d297c55ab53 - 0.0086
From this you can see the operation that resulted in the account being deleted.
Procedure: Deleting objects
Simple case - deleting small number of objects and containers
Note
swift-direct
is specific to the Hewlett Packard
Enterprise Helion Public Cloud. Use swiftly
as an
alternative.
Note
Object and container names are in UTF8. Swift direct accepts UTF8 directly, not URL-encoded UTF8 (the REST API expects UTF8 and then URL-encoded). In practice cut and paste of foreign language strings to a terminal window will produce the right result.
Hint: Use the head
command before any destructive
commands.
To delete a small number of objects, log into any proxy node and proceed as follows:
Examine the object in question:
$ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
See if X-Object-Manifest
or
X-Static-Large-Object
is set, then this is the manifest
object and segment objects may be in another container.
If the X-Object-Manifest
attribute is set, you need to
find the name of the objects this means it is a DLO. For example, if
X-Object-Manifest
is container2/seg-blah
, list
the contents of the container container2 as follows:
$ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
Pick out the objects whose names start with seg-blah
.
Delete the segment objects as follows:
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
etc
If X-Static-Large-Object
is set, you need to read the
contents. Do this by:
- Using swift-get-nodes to get the details of the object's location.
- Change the
-X HEAD
to-X GET
and runcurl
against one copy. - This lists a json body listing containers and object names
- Delete the objects as described above for DLO segments
Once the segments are deleted, you can delete the object using
swift-direct
as described above.
Finally, use swift-direct
to delete the container.
Procedure: Decommissioning swift nodes
Should Swift nodes need to be decommissioned. For example, where they are being re-purposed, it is very important to follow the following steps.
- In the case of object servers, follow the procedure for removing the node from the rings.
- In the case of swift proxy servers, have the network team remove the node from the load balancers.
- Open a network ticket to have the node removed from network firewalls.
- Make sure that you remove the
/etc/swift
directory and everything in it.