 3c61ab4678
			
		
	
	3c61ab4678
	
	
	
		
			
			This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
		
			
				
	
	
	
		
			15 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Software configuration procedures
Fix broken GPT table (broken disk partition)
- If a GPT table is broken, a message like the following should be observed when the command... - $ sudo parted -l
- ... is run. - ... Error: The backup GPT table is corrupt, but the primary appears OK, so that will be used. OK/Cancel?
- To fix this, firstly install the - gdiskprogram to fix this:- $ sudo aptitude install gdisk
- Run - gdiskfor the particular drive with the damaged partition:- $ sudo gdisk /dev/sd*a-l* GPT fdisk (gdisk) version 0.6.14 - Caution: invalid backup GPT header, but valid main header; regenerating backup header from main header. - Warning! One or more CRCs don't match. You should repair the disk! - Partition table scan:
- 
MBR: protective BSD: not present APM: not present GPT: damaged 
 - /dev/sd- Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk verification and recovery are STRONGLY recommended. ***************************************************************************** 
- On the command prompt, type - r(recovery and transformation options), followed by- d(use main GPT header) ,- v(verify disk) and finally- w(write table to disk and exit). Will also need to enter- Ywhen prompted in order to confirm actions.- Command (? for help): r Recovery/transformation command (? for help): d Recovery/transformation command (? for help): v Caution: The CRC for the backup partition table is invalid. This table may be corrupt. This program will automatically create a new backup partition table when you save your partitions. Caution: Partition 1 doesn't begin on a 8-sector boundary. This may result in degraded performance on some modern (2009 and later) hard disks. Caution: Partition 2 doesn't begin on a 8-sector boundary. This may result in degraded performance on some modern (2009 and later) hard disks. Caution: Partition 3 doesn't begin on a 8-sector boundary. This may result in degraded performance on some modern (2009 and later) hard disks. Identified 1 problems! Recovery/transformation command (? for help): w Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING PARTITIONS!! Do you want to proceed, possibly destroying your data? (Y/N): Y OK; writing new GUID partition table (GPT). The operation has completed successfully.
- Running the command: - $ sudo parted /dev/sd#
- Should now show that the partition is recovered and healthy again. 
- Finally, uninstall - gdiskfrom the node:- $ sudo aptitude remove gdisk
Procedure: Fix broken XFS filesystem
- A filesystem may be corrupt or broken if the following output is observed when checking its label: - $ sudo xfs_admin -l /dev/sd# cache_node_purge: refcount was 1, not zero (node=0x25d5ee0) xfs_admin: cannot read root inode (117) cache_node_purge: refcount was 1, not zero (node=0x25d92b0) xfs_admin: cannot read realtime bitmap inode (117) bad sb magic # 0 in AG 1 failed to read label in AG 1
- Run the following commands to remove the broken/corrupt filesystem and replace. (This example uses the filesystem - /dev/sdb2) Firstly need to replace the partition:- $ sudo parted GNU Parted 2.3 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) select /dev/sdb Using /dev/sdb (parted) p Model: HP LOGICAL VOLUME (scsi) Disk /dev/sdb: 2000GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 17.4kB 1024MB 1024MB ext3 boot 2 1024MB 1751GB 1750GB xfs sw-aw2az1-object045-disk1 3 1751GB 2000GB 249GB lvm (parted) rm 2 (parted) mkpart primary 2 -1 Warning: You requested a partition from 2000kB to 2000GB. The closest location we can manage is 1024MB to 1751GB. Is this still acceptable to you? Yes/No? Yes Warning: The resulting partition is not properly aligned for best performance. Ignore/Cancel? Ignore (parted) p Model: HP LOGICAL VOLUME (scsi) Disk /dev/sdb: 2000GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 17.4kB 1024MB 1024MB ext3 boot 2 1024MB 1751GB 1750GB xfs primary 3 1751GB 2000GB 249GB lvm (parted) quit
- Next step is to scrub the filesystem and format: - $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s $ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2 meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=427246093, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=208616, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
- You should now label and mount your filesystem. 
- Can now check to see if the filesystem is mounted using the command: - $ mount
Procedure: Checking if an account is okay
Note
swift-direct is only available in the HPE Helion Public
Cloud. Use swiftly as an alternate.
If you have a tenant ID you can check the account is okay as follows from a proxy.
$ sudo -u swift  /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>The response will either be similar to a swift list of the account containers, or an error indicating that the resource could not be found.
In the latter case you can establish if a backend database exists for the tenantId by running the following on a proxy:
$ sudo -u swift  swift-get-nodes /etc/swift/account.ring.gz  <Api-Auth-Hash-or-TenantId>The response will list ssh commands that will list the replicated account databases, if they exist.
Procedure: Revive a deleted account
Swift accounts are normally not recreated. If a tenant unsubscribes from Swift, the account is deleted. To re-subscribe to Swift, you can create a new tenant (new tenant ID), and subscribe to Swift. This creates a new Swift account with the new tenant ID.
However, until the unsubscribe/new tenant process is supported, you may hit a situation where a Swift account is deleted and the user is locked out of Swift.
Deleting the account database files
Here is one possible solution. The containers and objects may be lost forever. The solution is to delete the account database files and re-create the account. This may only be done once the containers and objects are completely deleted. This process is untested, but could work as follows:
- Use swift-get-nodes to locate the account's database file (on three servers).
- Rename the database files (on three servers).
- Use swiftlyto create the account (use original name).
Renaming account database so it can be revived
Get the locations of the database files that hold the account data.
sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1 Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1 Container None Object None Partition 18914 Hash 93c41ef56dd69173a9524193ab813e78 Server:Port Device 15.184.9.126:6002 disk7 Server:Port Device 15.184.9.94:6002 disk11 Server:Port Device 15.184.9.103:6002 disk10 Server:Port Device 15.184.9.80:6002 disk2 [Handoff] Server:Port Device 15.184.9.120:6002 disk2 [Handoff] Server:Port Device 15.184.9.98:6002 disk2 [Handoff] curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff] curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff] curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff] ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db- 31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002 disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
Check that the handoff nodes do not have account databases:
$ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directoryIf the handoff node has a database, wait for rebalancing to occur.
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
You can stop the load balancers sending requests to a proxy server as follows. This can be useful when a proxy is misbehaving but you need Swift running to help diagnose the problem. By removing from the load balancers, customer's are not impacted by the misbehaving proxy.
- Ensure that in proxyserver.com the - disable_pathvariable is set to- /etc/swift/disabled-by-file.
- Log onto the proxy node. 
- Shut down Swift as follows: - sudo swift-init proxy shutdown .. note:: Shutdown, not stop.
- Create the - /etc/swift/disabled-by-filefile. For example:- sudo touch /etc/swift/disabled-by-file
- Optional, restart Swift: - sudo swift-init proxy start
It works because the healthcheck middleware looks for this file. If it find it, it will return 503 error instead of 200/OK. This means the load balancer should stop sending traffic to the proxy.
/healthcheck will report
FAIL: disabled by file if the disabled-by-file
file exists.
Procedure: Ad-Hoc disk performance test
You can get an idea whether a disk drive is performing as follows:
sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-laterYou can expect ~600MB/sec. If you get a low number, repeat many times as Swift itself may also read or write to the disk, hence giving a lower number.