From 3c61ab4678a7aa9ff256ace4bc97ab449607fd49 Mon Sep 17 00:00:00 2001 From: asettle Date: Wed, 10 Feb 2016 17:58:05 +1000 Subject: [PATCH] Operational procedures guide This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b --- doc/source/index.rst | 1 + doc/source/ops_runbook/diagnose.rst | 1031 +++++++++++++++++ doc/source/ops_runbook/general.rst | 36 + doc/source/ops_runbook/index.rst | 79 ++ doc/source/ops_runbook/maintenance.rst | 322 +++++ doc/source/ops_runbook/procedures.rst | 367 ++++++ .../ops_runbook/sec-furtherdiagnose.rst | 177 +++ doc/source/ops_runbook/troubleshooting.rst | 264 +++++ 8 files changed, 2277 insertions(+) create mode 100644 doc/source/ops_runbook/diagnose.rst create mode 100644 doc/source/ops_runbook/general.rst create mode 100644 doc/source/ops_runbook/index.rst create mode 100644 doc/source/ops_runbook/maintenance.rst create mode 100644 doc/source/ops_runbook/procedures.rst create mode 100644 doc/source/ops_runbook/sec-furtherdiagnose.rst create mode 100644 doc/source/ops_runbook/troubleshooting.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 839de9c694..8f045cfb18 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -86,6 +86,7 @@ Administrator Documentation admin_guide replication_network logs + ops_runbook/index Object Storage v1 REST API Documentation ======================================== diff --git a/doc/source/ops_runbook/diagnose.rst b/doc/source/ops_runbook/diagnose.rst new file mode 100644 index 0000000000..d34b38c52b --- /dev/null +++ b/doc/source/ops_runbook/diagnose.rst @@ -0,0 +1,1031 @@ +================================== +Identifying issues and resolutions +================================== + +Diagnose: General approach +-------------------------- + +- Look at service status in your monitoring system. + +- In addition to system monitoring tools and issue logging by users, + swift errors will often result in log entries in the ``/var/log/swift`` + files: ``proxy.log``, ``server.log`` and ``background.log`` (see:``Swift + logs``). + +- Look at any logs your deployment tool produces. + +- Log files should be reviewed for error signatures (see below) that + may point to a known issue, or root cause issues reported by the + diagnostics tools, prior to escalation. + +Dependencies +^^^^^^^^^^^^ + +The Swift software is dependent on overall system health. Operating +system level issues with network connectivity, domain name resolution, +user management, hardware and system configuration and capacity in terms +of memory and free disk space, may result is secondary Swift issues. +System level issues should be resolved prior to diagnosis of swift +issues. + + +Diagnose: Swift-dispersion-report +--------------------------------- + +The swift-dispersion-report is a useful tool to gauge the general +health of the system. Configure the ``swift-dispersion`` report for +100% coverage. The dispersion report regularly monitors +these and gives a report of the amount of objects/containers are still +available as well as how many copies of them are also there. + +The dispersion-report output is logged on the first proxy of the first +AZ or each system (proxy with the monitoring role) under +``/var/log/swift/swift-dispersion-report.log``. + +Diagnose: Is swift running? +--------------------------- + +When you want to establish if a swift endpoint is running, run ``curl -k`` +against either: https://*[REPLACEABLE]*./healthcheck OR +https:*[REPLACEABLE]*.crossdomain.xml + + +Diagnose: Interpreting messages in ``/var/log/swift/`` files +------------------------------------------------------------ + +.. note:: + + In the Hewlett Packard Enterprise Helion Public Cloud we send logs to + ``proxy.log`` (proxy-server logs), ``server.log`` (object-server, + account-server, container-server logs), ``background.log`` (all + other servers [object-replicator, etc]). + +The following table lists known issues: + +.. list-table:: + :widths: 25 25 25 25 + :header-rows: 1 + + * - **Logfile** + - **Signature** + - **Issue** + - **Steps to take** + * - /var/log/syslog + - kernel: [] hpsa .... .... .... has check condition: unknown type: + Sense: 0x5, ASC: 0x20, ASC Q: 0x0 .... + - An unsupported command was issued to the storage hardware + - Understood to be a benign monitoring issue, ignore + * - /var/log/syslog + - kernel: [] sd .... [csbu:sd...] Sense Key: Medium Error + - Suggests disk surface issues + - Run swift diagnostics on the target node to check for disk errors, + repair disk errors + * - /var/log/syslog + - kernel: [] sd .... [csbu:sd...] Sense Key: Hardware Error + - Suggests storage hardware issues + - Run swift diagnostics on the target node to check for disk failures, + replace failed disks + * - /var/log/syslog + - kernel: [] .... I/O error, dev sd.... ,sector .... + - + - Run swift diagnostics on the target node to check for disk errors + * - /var/log/syslog + - pound: NULL get_thr_arg + - Multiple threads woke up + - Noise, safe to ignore + * - /var/log/swift/proxy.log + - .... ERROR .... ConnectionTimeout .... + - A storage node is not responding in a timely fashion + - Run swift diagnostics on the target node to check for node down, + node unconfigured, storage off-line or network issues between the + proxy and non responding node + * - /var/log/swift/proxy.log + - proxy-server .... HTTP/1.0 500 .... + - A proxy server has reported an internal server error + - Run swift diagnostics on the target node to check for issues + * - /var/log/swift/server.log + - .... ERROR .... ConnectionTimeout .... + - A storage server is not responding in a timely fashion + - Run swift diagnostics on the target node to check for a node or + service, down, unconfigured, storage off-line or network issues + between the two nodes + * - /var/log/swift/server.log + - .... ERROR .... Remote I/O error: '/srv/node/disk.... + - A storage device is not responding as expected + - Run swift diagnostics and check the filesystem named in the error + for corruption (unmount & xfs_repair) + * - /var/log/swift/background.log + - object-server ERROR container update failed .... Connection refused + - Peer node is not responding + - Check status of the network and peer node + * - /var/log/swift/background.log + - object-updater ERROR with remote .... ConnectionTimeout + - + - Check status of the network and peer node + * - /var/log/swift/background.log + - account-reaper STDOUT: .... error: ECONNREFUSED + - Network connectivity issue + - Resolve network issue and re-run diagnostics + * - /var/log/swift/background.log + - .... ERROR .... ConnectionTimeout + - A storage server is not responding in a timely fashion + - Run swift diagnostics on the target node to check for a node + or service, down, unconfigured, storage off-line or network issues + between the two nodes + * - /var/log/swift/background.log + - .... ERROR syncing .... Timeout + - A storage server is not responding in a timely fashion + - Run swift diagnostics on the target node to check for a node + or service, down, unconfigured, storage off-line or network issues + between the two nodes + * - /var/log/swift/background.log + - .... ERROR Remote drive not mounted .... + - A storage server disk is unavailable + - Run swift diagnostics on the target node to check for a node or + service, failed or unmounted disk on the target, or a network issue + * - /var/log/swift/background.log + - object-replicator .... responded as unmounted + - A storage server disk is unavailable + - Run swift diagnostics on the target node to check for a node or + service, failed or unmounted disk on the target, or a network issue + * - /var/log/swift/\*.log + - STDOUT: EXCEPTION IN + - A unexpected error occurred + - Read the Traceback details, if it matches known issues + (e.g. active network/disk issues), check for re-ocurrences + after the primary issues have been resolved + * - /var/log/rsyncd.log + - rsync: mkdir "/disk....failed: No such file or directory.... + - A local storage server disk is unavailable + - Run swift diagnostics on the node to check for a failed or + unmounted disk + * - /var/log/swift* + - Exception: Could not bind to 0.0.0.0:600xxx + - Possible Swift process restart issue. This indicates an old swift + process is still running. + - Run swift diagnostics, if some swift services are reported down, + check if they left residual process behind. + * - /var/log/rsyncd.log + - rsync: recv_generator: failed to stat "/disk....." (in object) + failed: Not a directory (20) + - Swift directory structure issues + - Run swift diagnostics on the node to check for issues + +Diagnose: Parted reports the backup GPT table is corrupt +-------------------------------------------------------- + +- If a GPT table is broken, a message like the following should be + observed when the following command is run: + + .. code:: + + $ sudo parted -l + + .. code:: + + Error: The backup GPT table is corrupt, but the primary appears OK, + so that will be used. + + OK/Cancel? + +To fix, go to: Fix broken GPT table (broken disk partition) + + +Diagnose: Drives diagnostic reports a FS label is not acceptable +---------------------------------------------------------------- + +If diagnostics reports something like "FS label: obj001dsk011 is not +acceptable", it indicates that a partition has a valid disk label, but an +invalid filesystem label. In such cases proceed as follows: + +#. Verify that the disk labels are correct: + + .. code:: + + FS=/dev/sd#1 + + sudo parted -l | grep object + +#. If partition labels are inconsistent then, resolve the disk label issues + before proceeding: + + .. code:: + + sudo parted -s ${FS} name ${PART_NO} ${PART_NAME} #Partition Label + #PART_NO is 1 for object disks and 3 for OS disks + #PART_NAME follows the convention seen in "sudo parted -l | grep object" + +#. If the Filesystem label is missing then create it with care: + + .. code:: + + sudo xfs_admin -l ${FS} #Filesystem label (12 Char limit) + + #Check for the existence of a FS label + + OBJNO=<3 Length Object No.> + + #I.E OBJNO for sw-stbaz3-object0007 would be 007 + + DISKNO=<3 Length Disk No.> + + #I.E DISKNO for /dev/sdb would be 001, /dev/sdc would be 002 etc. + + sudo xfs_admin -L "obj${OBJNO}dsk${DISKNO}" ${FS} + + #Create a FS Label + +Diagnose: Failed LUNs +--------------------- + +.. note:: + + The HPE Helion Public Cloud uses direct attach SmartArry + controllers/drives. The information here is specific to that + environment. + +The ``swift_diagnostics`` mount checks may return a warning that a LUN has +failed, typically accompanied by DriveAudit check failures and device +errors. + +Such cases are typically caused by a drive failure, and if drive check +also reports a failed status for the underlying drive, then follow +the procedure to replace the disk. + +Otherwise the lun can be re-enabled as follows: + +#. Generate a hpssacli diagnostic report. This report allows the swift + team to troubleshoot potential cabling or hardware issues so it is + imperative that you run it immediately when troubleshooting a failed + LUN. You will come back later and grep this file for more details, but + just generate it for now. + + .. code:: + + sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on \ + xml=off zip=off + +Export the following variables using the below instructions before +proceeding further. + +#. Print a list of logical drives and their numbers and take note of the + failed drive's number and array value (example output: "array A + logicaldrive 1..." would be exported as LDRIVE=1): + + .. code:: + + sudo hpssacli controller slot=1 ld all show + +#. Export the number of the logical drive that was retrieved from the + previous command into the LDRIVE variable: + + .. code:: + + export LDRIVE= + +#. Print the array value and Port:Box:Bay for all drives and take note of + the Port:Box:Bay for the failed drive (example output: " array A + physicaldrive 2C:1:1..." would be exported as PBOX=2C:1:1). Match the + array value of this output with the array value obtained from the + previous command to be sure you are working on the same drive. Also, + the array value usually matches the device name (For example, /dev/sdc + in the case of "array c"), but we will run a different command to be sure + we are operating on the correct device. + + .. code:: + + sudo hpssacli controller slot=1 pd all show + +.. note:: + + Sometimes a LUN may appear to be failed as it is not and cannot + be mounted but the hpssacli/parted commands may show no problems with + the LUNS/drives. In this case, the filesystem may be corrupt and may be + necessary to run ``sudo xfs_check /dev/sd[a-l][1-2]`` to see if there is + an xfs issue. The results of running this command may require that + ``xfs_repair`` is run. + +#. Export the Port:Box:Bay for the failed drive into the PBOX variable: + + .. code:: + + export PBOX= + +#. Print the physical device information and take note of the Disk Name + (example output: "Disk Name: /dev/sdk" would be exported as + DEV=/dev/sdk): + + .. code:: + + sudo hpssacli controller slot=1 ld ${LDRIVE} show detail \ + grep -i "Disk Name" + +#. Export the device name variable from the preceding command (example: + /dev/sdk): + + .. code:: + + export DEV= + +#. Export the filesystem variable. Disks that are split between the + operating system and data storage, typically sda and sdb, should only + have repairs done on their data filesystem, usually /dev/sda2 and + /dev/sdb2, Other data only disks have just one partition on the device, + so the filesystem will be 1. In any case you should verify the data + filesystem by running ``df -h | grep /srv/node`` and using the listed + data filesystem for the device in question as the export. For example: + /dev/sdk1. + + .. code:: + + export FS= + +#. Verify the LUN is failed, and the device is not: + + .. code:: + + sudo hpssacli controller slot=1 ld all show + sudo hpssacli controller slot=1 pd all show + sudo hpssacli controller slot=1 ld ${LDRIVE} show detail + sudo hpssacli controller slot=1 pd ${PBOX} show detail + +#. Stop the swift and rsync service: + + .. code:: + + sudo service rsync stop + sudo swift-init shutdown all + +#. Unmount the problem drive, fix the LUN and the filesystem: + + .. code:: + + sudo umount ${FS} + +#. If umount fails, you should run lsof search for the mountpoint and + kill any lingering processes before repeating the unpount: + + .. code:: + + sudo hpacucli controller slot=1 ld ${LDRIVE} modify reenable + sudo xfs_repair ${FS} + +#. If the ``xfs_repair`` complains about possible journal data, use the + ``xfs_repair -L`` option to zeroise the journal log. + +#. Once complete test-mount the filesystem, and tidy up its lost and + found area. + + .. code:: + + sudo mount ${FS} /mnt + sudo rm -rf /mnt/lost+found/ + sudo umount /mnt + +#. Mount the filesystem and restart swift and rsync. + +#. Run the following to determine if a DC ticket is needed to check the + cables on the node: + + .. code:: + + grep -y media.exchanged /tmp/hpacu.diag + grep -y hot.plug.count /tmp/hpacu.diag + +#. If the output reports any non 0x00 values, it suggests that the cables + should be checked. For example, log a DC ticket to check the sas cables + between the drive and the expander. + +Diagnose: Slow disk devices +--------------------------- + +.. note:: + + collectl is an open-source performance gathering/analysis tool. + +If the diagnostics report a message such as ``sda: drive is slow``, you +should log onto the node and run the following comand: + +.. code:: + + $ /usr/bin/collectl -s D -c 1 + waiting for 1 second sample... + # DISK STATISTICS (/sec) + # <---------reads---------><---------writes---------><--------averages--------> Pct + #Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util + sdb 204 0 33 6 43 0 4 11 6 1 7 6 23 + sda 84 0 13 6 108 21 6 18 10 1 7 7 13 + sdc 100 0 16 6 0 0 0 0 6 1 7 6 9 + sdd 140 0 22 6 22 0 2 11 6 1 9 9 22 + sde 76 0 12 6 255 0 52 5 5 1 2 1 10 + sdf 276 0 44 6 0 0 0 0 6 1 11 8 38 + sdg 112 0 17 7 18 0 2 9 6 1 7 7 13 + sdh 3552 0 73 49 0 0 0 0 48 1 9 8 62 + sdi 72 0 12 6 0 0 0 0 6 1 8 8 10 + sdj 112 0 17 7 22 0 2 11 7 1 10 9 18 + sdk 120 0 19 6 21 0 2 11 6 1 8 8 16 + sdl 144 0 22 7 18 0 2 9 6 1 9 7 18 + dm-0 0 0 0 0 0 0 0 0 0 0 0 0 0 + dm-1 0 0 0 0 60 0 15 4 4 0 0 0 0 + dm-2 0 0 0 0 48 0 12 4 4 0 0 0 0 + dm-3 0 0 0 0 0 0 0 0 0 0 0 0 0 + dm-4 0 0 0 0 0 0 0 0 0 0 0 0 0 + dm-5 0 0 0 0 0 0 0 0 0 0 0 0 0 + ... + (repeats -- type Ctrl/C to stop) + +Look at the ``Wait`` and ``SvcTime`` values. It is not normal for +these values to exceed 50msec. This is known to impact customer +performance (upload/download. For a controller problem, many/all drives +will show how wait and service times. A reboot may correct the prblem; +otherwise hardware replacement is needed. + +Another way to look at the data is as follows: + +.. code:: + + $ /opt/hp/syseng/disk-anal.pl -d + Disk: sda Wait: 54580 371 65 25 12 6 6 0 1 2 0 46 + Disk: sdb Wait: 54532 374 96 36 16 7 4 1 0 2 0 46 + Disk: sdc Wait: 54345 554 105 29 15 4 7 1 4 4 0 46 + Disk: sdd Wait: 54175 553 254 31 20 11 6 6 2 2 1 53 + Disk: sde Wait: 54923 66 56 15 8 7 7 0 1 0 2 29 + Disk: sdf Wait: 50952 941 565 403 426 366 442 447 338 99 38 97 + Disk: sdg Wait: 50711 689 808 562 642 675 696 185 43 14 7 82 + Disk: sdh Wait: 51018 668 688 483 575 542 692 275 55 22 9 87 + Disk: sdi Wait: 51012 1011 849 672 568 240 344 280 38 13 6 81 + Disk: sdj Wait: 50724 743 770 586 662 509 684 283 46 17 11 79 + Disk: sdk Wait: 50886 700 585 517 633 511 729 352 89 23 8 81 + Disk: sdl Wait: 50106 617 794 553 604 504 532 501 288 234 165 216 + Disk: sda Time: 55040 22 16 6 1 1 13 0 0 0 3 12 + + Disk: sdb Time: 55014 41 19 8 3 1 8 0 0 0 3 17 + Disk: sdc Time: 55032 23 14 8 9 2 6 1 0 0 0 19 + Disk: sdd Time: 55022 29 17 12 6 2 11 0 0 0 1 14 + Disk: sde Time: 55018 34 15 11 12 1 9 0 0 0 2 12 + Disk: sdf Time: 54809 250 45 7 1 0 0 0 0 0 1 1 + Disk: sdg Time: 55070 36 6 2 0 0 0 0 0 0 0 0 + Disk: sdh Time: 55079 33 2 0 0 0 0 0 0 0 0 0 + Disk: sdi Time: 55074 28 7 2 0 0 2 0 0 0 0 1 + Disk: sdj Time: 55067 35 10 0 1 0 0 0 0 0 0 1 + Disk: sdk Time: 55068 31 10 3 0 0 1 0 0 0 0 1 + Disk: sdl Time: 54905 130 61 7 3 4 1 0 0 0 0 3 + +This shows the historical distribution of the wait and service times +over a day. This is how you read it: + +- sda did 54580 operations with a short wait time, 371 operations with + a longer wait time and 65 with an even longer wait time. + +- sdl did 50106 operations with a short wait time, but as you can see + many took longer. + +There is a clear pattern that sdf to sdl have a problem. Actually, sda +to sde would more normally have lots of zeros in their data. But maybe +this is a busy system. In this example it is worth changing the +controller as the individual drives may be ok. + +After the controller is changed, use collectl -s D as described above to +see if the problem has cleared. disk-anal.pl will continue to show +historical data. You can look at recent data as follows. It only looks +at data from 13:15 to 14:15. As you can see, this is a relatively clean +system (few if any long wait or service times): + +.. code:: + + $ /opt/hp/syseng/disk-anal.pl -d -t 13:15-14:15 + Disk: sda Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdb Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdc Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdd Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sde Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdf Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdg Wait: 3594 6 0 0 0 0 0 0 0 0 0 0 + Disk: sdh Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdi Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdj Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdk Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdl Wait: 3599 1 0 0 0 0 0 0 0 0 0 0 + Disk: sda Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdb Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdc Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdd Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sde Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdf Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdg Time: 3594 6 0 0 0 0 0 0 0 0 0 0 + Disk: sdh Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdi Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdj Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdk Time: 3600 0 0 0 0 0 0 0 0 0 0 0 + Disk: sdl Time: 3599 1 0 0 0 0 0 0 0 0 0 0 + +For long wait times, where the service time appears normal is to check +the logical drive cache status. While the cache may be enabled, it can +be disabled on a per-drive basis. + +Diagnose: Slow network link - Measuring network performance +----------------------------------------------------------- + +Network faults can cause performance between Swift nodes to degrade. The +following tests are recommended. Other methods (such as copying large +files) may also work, but can produce inconclusive results. + +Use netperf on all production systems. Install on all systems if not +already installed. And the UFW rules for its control port are in place. +However, there are no pre-opened ports for netperf's data connection. Pick a +port number. In this example, 12866 is used because it is one higher +than netperf's default control port number, 12865. If you get very +strange results including zero values, you may not have gotten the data +port opened in UFW at the target or may have gotten the netperf +command-line wrong. + +Pick a ``source`` and ``target`` node. The source is often a proxy node +and the target is often an object node. Using the same source proxy you +can test communication to different object nodes in different AZs to +identity possible bottlekecks. + +Running tests +^^^^^^^^^^^^^ + +#. Prepare the ``target`` node as follows: + + .. code:: + + sudo iptables -I INPUT -p tcp -j ACCEPT + + Or, do: + + .. code:: + + sudo ufw allow 12866/tcp + +#. On the ``source`` node, run the following command to check + throughput. Note the double-dash before the -P option. + The command takes 10 seconds to complete. + + .. code:: + + $ netperf -H .72.4 + MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to + .72.4 (.72.4) port 12866 AF_INET : demo + Recv Send Send + Socket Socket Message Elapsed + Size Size Size Time Throughput + bytes bytes bytes secs. 10^6bits/sec + 87380 16384 16384 10.02 923.69 + +#. On the ``source`` node, run the following command to check latency: + + .. code:: + + $ netperf -H .72.4 -t TCP_RR -- -P 12866 + MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866 + AF_INET to .72.4 (.72.4) port 12866 AF_INET : demo + : first burst 0 + Local Remote Socket Size Request Resp. Elapsed Trans. + Send Recv Size Size Time Rate + bytes Bytes bytes bytes secs. per sec + 16384 87380 1 1 10.00 11753.37 + 16384 87380 + +Expected results +^^^^^^^^^^^^^^^^ + +Faults will show up as differences between different pairs of nodes. +However, for reference, here are some expected numbers: + +- For throughput, proxy to proxy, expect ~9300 Mbit/sec (proxies have + a 10Ge link). + +- For throughout, proxy to object, expect ~920 Mbit/sec (at time of + writing this, object nodes have a 1Ge link). + +- For throughput, object to object, expect ~920 Mbit/sec. + +- For latency (all types), expect ~11000 transactions/sec. + +Diagnose: Remapping sectors experiencing UREs +--------------------------------------------- + +#. Find the bad sector, device, and filesystem in ``kern.log``. + +#. Set the environment variables SEC, DEV & FS, for example: + + .. code:: + + SEC=2930954256 + DEV=/dev/sdi + FS=/dev/sdi1 + +#. Verify that the sector is bad: + + .. code:: + + sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} + +#. If the sector is bad this command will output an input/output error: + + .. code:: + + dd: reading `/dev/sdi`: Input/output error + 0+0 records in + 0+0 records out + +#. Prevent chef from attempting to re-mount the filesystem while the + repair is in progress: + + .. code:: + + sudo mv /etc/chef/client.pem /etc/chef/xx-client.xx-pem + +#. Stop the swift and rsync service: + + .. code:: + + sudo service rsync stop + sudo swift-init shutdown all + +#. Unmount the problem drive: + + .. code:: + + sudo umount ${FS} + +#. Overwrite/remap the bad sector: + + .. code:: + + sudo dd_rescue -d -A -m8b -s ${SEC}b ${DEV} ${DEV} + +#. This command should report an input/output error the first time + it is run. Run the command a second time, if it successfully remapped + the bad sector it should not report an input/output error. + +#. Verify the sector is now readable: + + .. code:: + + sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} + +#. If the sector is now readable this command should not report an + input/output error. + +#. If more than one problem sector is listed, set the SEC environment + variable to the next sector in the list: + + .. code:: + + SEC=123456789 + +#. Repeat from step 8. + +#. Repair the filesystem: + + .. code:: + + sudo xfs_repair ${FS} + +#. If ``xfs_repair`` reports that the filesystem has valuable filesystem + changes: + + .. code:: + + sudo xfs_repair ${FS} + Phase 1 - find and verify superblock... + Phase 2 - using internal log + - zero log... + ERROR: The filesystem has valuable metadata changes in a log which + needs to be replayed. + Mount the filesystem to replay the log, and unmount it before + re-running xfs_repair. + If you are unable to mount the filesystem, then use the -L option to + destroy the log and attempt a repair. Note that destroying the log may + cause corruption -- please attempt a mount of the filesystem before + doing this. + +#. You should attempt to mount the filesystem, and clear the lost+found + area: + + .. code:: + + sudo mount $FS /mnt + sudo rm -rf /mnt/lost+found/* + sudo umount /mnt + +#. If the filesystem fails to mount then you will need to use the + ``xfs_repair -L`` option to force log zeroing. + Repeat step 11. + +#. If ``xfs_repair`` reports that an additional input/output error has been + encountered, get the sector details as follows: + + .. code:: + + sudo grep "I/O error" /var/log/kern.log | grep sector | tail -1 + +#. If new input/output error is reported then set the SEC environment + variable to the problem sector number: + + .. code:: + + SEC=234567890 + +#. Repeat from step 8 + + +#. Remount the filesystem and restart swift and rsync. + + - If all UREs in the kern.log have been fixed and you are still unable + to have xfs_repair disk, it is possible that the URE's have + corrupted the filesystem or possibly destroyed the drive altogether. + In this case, the first step is to re-format the filesystem and if + this fails, get the disk replaced. + + +Diagnose: High system latency +----------------------------- + +.. note:: + + The latency measurements described here are specific to the HPE + Helion Public Cloud. + +- A bad NIC on a proxy server. However, as explained above, this + usually causes the peak to rise, but average should remain near + normal parameters. A quick fix is to shutdown the proxy. + +- A stuck memcache server. Accepts connections, but then will not respond. + Expect to see timeout messages in ``/var/log/proxy.log`` (port 11211). + Swift Diags will also report this as a failed node/port. A quick fix + is to shutdown the proxy server. + +- A bad/broken object server can also cause problems if the accounts + used by the monitor program happen to live on the bad object server. + +- A general network problem within the data canter. Compare the results + with the Pingdom monitors too see if they also have a problem. + +Diagnose: Interface reports errors +---------------------------------- + +Should a network interface on a Swift node begin reporting network +errors, it may well indicate a cable, switch, or network issue. + +Get an overview of the interface with: + +.. code:: + + sudo ifconfig eth{n} + sudo ethtool eth{n} + +The ``Link Detected:`` indicator will read ``yes`` if the nic is +cabled. + +Establish the adapter type with: + +.. code:: + + sudo ethtool -i eth{n} + +Gather the interface statistics with: + +.. code:: + + sudo ethtool -S eth{n} + +If the nick supports self test, this can be performed with: + +.. code:: + + sudo ethtool -t eth{n} + +Self tests should read ``PASS`` if the nic is operating correctly. + +Nic module drivers can be re-initialised by carefully removing and +re-installing the modules. Case in point being the mellanox drivers on +Swift Proxy servers. which use a two part driver mlx4_en and +mlx4_core. To reload these you must carefully remove the mlx4_en +(ethernet) then the mlx4_core modules, and reinstall them in the +reverse order. + +As the interface will be disabled while the modules are unloaded, you +must be very careful not to lock the interface out. The following +script can be used to reload the melanox drivers, as a side effect, this +resets error counts on the interface. + + +Diagnose: CorruptDir diagnostic reports corrupt directories +----------------------------------------------------------- + +From time to time Swift data structures may become corrupted by +misplaced files in filesystem locations that swift would normally place +a directory. This causes issues for swift when directory creation is +attempted at said location, it may fail due to the pre-existent file. If +the CorruptDir diagnostic reports Corrupt directories, they should be +checked to see if they exist. + +Checking existence of entries +----------------------------- + +Swift data filesystems are located under the ``/srv/node/disk`` +mountpoints and contain accounts, containers and objects +subdirectories which in turn contain partition number subdirectories. +The partition number directories contain md5 hash subdirectories. md5 +hash directories contain md5sum subdirectories. md5sum directories +contain the Swift data payload as either a database (.db), for +accounts and containers, or a data file (.data) for objects. +If the entries reported in diagnostics correspond to a partition +number, md5 hash or md5sum directory, check the entry with ``ls +-ld *entry*``. +If it turns out to be a file rather than a directory, it should be +carefully removed. + +.. note:: + + Please do not ``ls`` the partition level directory contents, as + this *especially objects* may take a lot of time and system resources, + if you need to check the contents, use: + + .. code:: + + echo /srv/node/disk#/type/partition#/ + +Diagnose: Hung swift object replicator +-------------------------------------- + +The swift diagnostic message ``Object replicator: remaining exceeds +100hrs:`` may indicate that the swift ``object-replicator`` is stuck and not +making progress. Another useful way to check this is with the +'swift-recon -r' command on a swift proxy server: + +.. code:: + + sudo swift-recon -r + =============================================================================== + + --> Starting reconnaissance on 384 hosts + =============================================================================== + [2013-07-17 12:56:19] Checking on replication + http://.72.63:6000/recon/replication: + [replication_time] low: 2, high: 80, avg: 28.8, total: 11037, Failed: 0.0%, no_result: 0, reported: 383 + Oldest completion was 2013-06-12 22:46:50 (12 days ago) by .31:6000. + Most recent completion was 2013-07-17 12:56:19 (5 seconds ago) by .204.113:6000. + =============================================================================== + +The ``Oldest completion`` line in this example indicates that the +object-replicator on swift object server .31 has not completed +the replication cycle in 12 days. This replicator is stuck. The object +replicator cycle is generally less than 1 hour. Though an replicator +cycle of 15-20 hours can occur if nodes are added to the system and a +new ring has been deployed. + +You can further check if the object replicator is stuck by logging on +the the object server and checking the object replicator progress with +the following command: + +.. code:: + + # sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep" + Jul 16 06:25:46 object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining) + Jul 16 06:30:46 object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining) + Jul 16 06:35:46 object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining) + Jul 16 06:40:46 object-replicator 15344/16450 (93.28%) partitions replicated in 69918.73s (0.22/sec, 23h remaining) + Jul 16 06:45:46 object-replicator 15348/16450 (93.30%) partitions replicated in 70218.75s (0.22/sec, 24h remaining) + Jul 16 06:50:47 object-replicator 15348/16450 (93.30%) partitions replicated in 70518.85s (0.22/sec, 24h remaining) + Jul 16 06:55:47 object-replicator 15348/16450 (93.30%) partitions replicated in 70818.95s (0.22/sec, 25h remaining) + Jul 16 07:00:47 object-replicator 15348/16450 (93.30%) partitions replicated in 71119.05s (0.22/sec, 25h remaining) + Jul 16 07:05:47 object-replicator 15348/16450 (93.30%) partitions replicated in 71419.15s (0.21/sec, 26h remaining) + Jul 16 07:10:47 object-replicator 15348/16450 (93.30%) partitions replicated in 71719.25s (0.21/sec, 26h remaining) + Jul 16 07:15:47 object-replicator 15348/16450 (93.30%) partitions replicated in 72019.27s (0.21/sec, 27h remaining) + Jul 16 07:20:47 object-replicator 15348/16450 (93.30%) partitions replicated in 72319.37s (0.21/sec, 27h remaining) + Jul 16 07:25:47 object-replicator 15348/16450 (93.30%) partitions replicated in 72619.47s (0.21/sec, 28h remaining) + Jul 16 07:30:47 object-replicator 15348/16450 (93.30%) partitions replicated in 72919.56s (0.21/sec, 28h remaining) + Jul 16 07:35:47 object-replicator 15348/16450 (93.30%) partitions replicated in 73219.67s (0.21/sec, 29h remaining) + Jul 16 07:40:47 object-replicator 15348/16450 (93.30%) partitions replicated in 73519.76s (0.21/sec, 29h remaining) + +The above status is output every 5 minutes to ``/var/log/swift/background.log``. + +.. note:: + + The 'remaining' time is increasing as time goes on, normally the + time remaining should be decreasing. Also note the partition number. For example, + 15344 remains the same for several status lines. Eventually the object + replicator detects the hang and attempts to make progress by killing the + problem thread. The replicator then progresses to the next partition but + quite often it again gets stuck on the same partition. + +One of the reasons for the object replicator hanging like this is +filesystem corruption on the drive. The following is a typical log entry +of a corrupted filesystem detected by the object replicator: + +.. code:: + + # sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1 + Jul 12 03:33:30 object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File + "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir, + reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents = + sorted(os.listdir(path))#012OSError: [Errno 121] Remote I/O error: '/srv/node/disk4/objects/1643763/b51' + +An ``ls`` of the problem file or directory usually shows something like the following: + +.. code:: + + # ls -l /srv/node/disk4/objects/1643763/b51 + ls: cannot access /srv/node/disk4/objects/1643763/b51: Remote I/O error + +If no entry with ``Remote I/O error`` occurs in the ``background.log`` it is +not possible to determine why the object-replicator is hung. It may be +that the ``Remote I/O error`` entry is older than 7 days and so has been +rotated out of the logs. In this scenario it may be best to simply +restart the object-replicator. + +#. Stop the object-replicator: + + .. code:: + + # sudo swift-init object-replicator stop + +#. Make sure the object replicator has stopped, if it has hung, the stop + command will not stop the hung process: + + .. code:: + + # ps auxww | - grep swift-object-replicator + +#. If the previous ps shows the object-replicator is still running, kill + the process: + + .. code:: + + # kill -9 + +#. Start the object-replicator: + + .. code:: + + # sudo swift-init object-replicator start + +If the above grep did find an ``Remote I/O error`` then it may be possible +to repair the problem filesystem. + +#. Stop swift and rsync: + + .. code:: + + # sudo swift-init all shutdown + # sudo service rsync stop + +#. Make sure all swift process have stopped: + + .. code:: + + # ps auxww | grep swift | grep python + +#. Kill any swift processes still running. + +#. Unmount the problem filesystem: + + .. code:: + + # sudo umount /srv/node/disk4 + +#. Repair the filesystem: + + .. code:: + + # sudo xfs_repair -P /dev/sde1 + +#. If the ``xfs_repair`` fails then it may be necessary to re-format the + filesystem. See Procedure: fix broken XFS filesystem. If the + ``xfs_repair`` is successful, re-enable chef using the following command + and replication should commence again. + + +Diagnose: High CPU load +----------------------- + +The CPU load average on an object server, as shown with the +'uptime' command, is typically under 10 when the server is +lightly-moderately loaded: + +.. code:: + + $ uptime + 07:59:26 up 99 days, 5:57, 1 user, load average: 8.59, 8.39, 8.32 + +During times of increased activity, due to user transactions or object +replication, the CPU load average can increase to to around 30. + +However, sometimes the CPU load average can increase significantly. The +following is an example of an object server that has extremely high CPU +load: + +.. code:: + + $ uptime + 07:44:02 up 18:22, 1 user, load average: 407.12, 406.36, 404.59 + +.. toctree:: + :maxdepth: 2 + + sec-furtherdiagnose.rst diff --git a/doc/source/ops_runbook/general.rst b/doc/source/ops_runbook/general.rst new file mode 100644 index 0000000000..60d19badee --- /dev/null +++ b/doc/source/ops_runbook/general.rst @@ -0,0 +1,36 @@ +================== +General Procedures +================== + +Getting a swift account stats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + ``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at + ``swifty`` for an alternate, this is an example. + +This procedure describes how you determine the swift usage for a given +swift account, that is the number of containers, number of objects and +total bytes used. To do this you will need the project ID. + +Log onto one of the swift proxy servers. + +Use swift-direct to show this accounts usage: + +.. code:: + + $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8 + Status: 200 + Content-Length: 0 + Accept-Ranges: bytes + X-Timestamp: 1379698586.88364 + X-Account-Bytes-Used: 67440225625994 + X-Account-Container-Count: 1 + Content-Type: text/plain; charset=utf-8 + X-Account-Object-Count: 8436776 + Status: 200 + name: my_container count: 8436776 bytes: 67440225625994 + +This account has 1 container. That container has 8436776 objects. The +total bytes used is 67440225625994. \ No newline at end of file diff --git a/doc/source/ops_runbook/index.rst b/doc/source/ops_runbook/index.rst new file mode 100644 index 0000000000..6fdb9c8c90 --- /dev/null +++ b/doc/source/ops_runbook/index.rst @@ -0,0 +1,79 @@ +================= +Swift Ops Runbook +================= + +This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate +and monitor the Swift system within the HPE Helion Public Cloud. This +document is an excerpt of a larger product-specific handbook. As such, +the material may appear incomplete. The suggestions and recommendations +made in this document are for our particular environment, and may not be +suitable for your environment or situation. We make no representations +concerning the accuracy, adequacy, completeness or suitability of the +information, suggestions or recommendations. This document are provided +for reference only. We are not responsible for your use of any +information, suggestions or recommendations contained herein. + +This document also contains references to certain tools that we use to +operate the Swift system within the HPE Helion Public Cloud. +Descriptions of these tools are provided for reference only, as the tools themselves +are not publically available at this time. + +- ``swift-direct``: This is similar to the ``swiftly`` tool. + + +.. toctree:: + :maxdepth: 2 + + general.rst + diagnose.rst + procedures.rst + maintenance.rst + troubleshooting.rst + +Is the system up? +~~~~~~~~~~~~~~~~~ + +If you have a report that Swift is down, perform the following basic checks: + +#. Run swift functional tests. + +#. From a server in your data center, use ``curl`` to check ``/healthcheck``. + +#. If you have a monitoring system, check your monitoring system. + +#. Check on your hardware load balancers infrastructure. + +#. Run swift-recon on a proxy node. + +Run swift function tests +------------------------ + +We would recommend that you set up your function tests against your production +system. + +A script for running the function tests is located in ``swift/.functests``. + + +External monitoring +------------------- + +- We use pingdom.com to monitor the external Swift API. We suggest the + following: + + - Do a GET on ``/healthcheck`` + + - Create a container, make it public (x-container-read: + .r\*,.rlistings), create a small file in the container; do a GET + on the object + +Reference information +~~~~~~~~~~~~~~~~~~~~~ + +Reference: Swift startup/shutdown +--------------------------------- + +- Use reload - not stop/start/restart. + +- Try to roll sets of servers (especially proxy) in groups of less + than 20% of your servers. + diff --git a/doc/source/ops_runbook/maintenance.rst b/doc/source/ops_runbook/maintenance.rst new file mode 100644 index 0000000000..b3c9e582ac --- /dev/null +++ b/doc/source/ops_runbook/maintenance.rst @@ -0,0 +1,322 @@ +================== +Server maintenance +================== + +General assumptions +~~~~~~~~~~~~~~~~~~~ + +- It is assumed that anyone attempting to replace hardware components + will have already read and understood the appropriate maintenance and + service guides. + +- It is assumed that where servers need to be taken off-line for + hardware replacement, that this will be done in series, bringing the + server back on-line before taking the next off-line. + +- It is assumed that the operations directed procedure will be used for + identifying hardware for replacement. + +Assessing the health of swift +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can run the swift-recon tool on a Swift proxy node to get a quick +check of how Swift is doing. Please note that the numbers below are +necessarily somewhat subjective. Sometimes parameters for which we +say 'low values are good' will have pretty high values for a time. Often +if you wait a while things get better. + +For example: + +.. code:: + + sudo swift-recon -rla + =============================================================================== + [2012-03-10 12:57:21] Checking async pendings on 384 hosts... + Async stats: low: 0, high: 1, avg: 0, total: 1 + =============================================================================== + + [2012-03-10 12:57:22] Checking replication times on 384 hosts... + [Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749 + =============================================================================== + + [2012-03-10 12:57:22] Checking load avg's on 384 hosts... + [5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125 + [15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833 + [1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875 + =============================================================================== + +In the example above we ask for information on replication times (-r), +load averages (-l) and async pendings (-a). This is a healthy Swift +system. Rules-of-thumb for 'good' recon output are: + +- Nodes that respond are up and running Swift. If all nodes respond, + that is a good sign. But some nodes may time out. For example: + + .. code:: + + \-> [http://.29:6000/recon/load:] + \-> [http://.31:6000/recon/load:] + +- That could be okay or could require investigation. + +- Low values (say < 10 for high and average) for async pendings are + good. Higher values occur when disks are down and/or when the system + is heavily loaded. Many simultaneous PUTs to the same container can + drive async pendings up. This may be normal, and may resolve itself + after a while. If it persists, one way to track down the problem is + to find a node with high async pendings (with ``swift-recon -av | sort + -n -k4``), then check its Swift logs, Often async pendings are high + because a node cannot write to a container on another node. Often + this is because the node or disk is offline or bad. This may be okay + if we know about it. + +- Low values for replication times are good. These values rise when new + rings are pushed, and when nodes and devices are brought back on + line. + +- Our 'high' load average values are typically in the 9-15 range. If + they are a lot bigger it is worth having a look at the systems + pushing the average up. Run ``swift-recon -av`` to get the individual + averages. To sort the entries with the highest at the end, + run ``swift-recon -av | sort -n -k4``. + +For comparison here is the recon output for the same system above when +two entire racks of Swift are down: + +.. code:: + + [2012-03-10 16:56:33] Checking async pendings on 384 hosts... + -> http://.22:6000/recon/async: + -> http://.18:6000/recon/async: + -> http://.16:6000/recon/async: + -> http://.13:6000/recon/async: + -> http://.30:6000/recon/async: + -> http://.6:6000/recon/async: + ......... + -> http://.5:6000/recon/async: + -> http://.15:6000/recon/async: + -> http://.9:6000/recon/async: + -> http://.27:6000/recon/async: + -> http://.4:6000/recon/async: + -> http://.8:6000/recon/async: + Async stats: low: 243, high: 659, avg: 413, total: 132275 + =============================================================================== + [2012-03-10 16:57:48] Checking replication times on 384 hosts... + -> http://.22:6000/recon/replication: + -> http://.18:6000/recon/replication: + -> http://.16:6000/recon/replication: + -> http://.13:6000/recon/replication: + -> http://.30:6000/recon/replication: + -> http://.6:6000/recon/replication: + ............ + -> http://.5:6000/recon/replication: + -> http://.15:6000/recon/replication: + -> http://.9:6000/recon/replication: + -> http://.27:6000/recon/replication: + -> http://.4:6000/recon/replication: + -> http://.8:6000/recon/replication: + [Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285 + 9475361 + =============================================================================== + [2012-03-10 16:59:03] Checking load avg's on 384 hosts... + -> http://.22:6000/recon/load: + -> http://.18:6000/recon/load: + -> http://.16:6000/recon/load: + -> http://.13:6000/recon/load: + -> http://.30:6000/recon/load: + -> http://.6:6000/recon/load: + ............ + -> http://.15:6000/recon/load: + -> http://.9:6000/recon/load: + -> http://.27:6000/recon/load: + -> http://.4:6000/recon/load: + -> http://.8:6000/recon/load: + [5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375 + [15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125 + [1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375 + =============================================================================== + +.. note:: + + The replication times and load averages are within reasonable + parameters, even with 80 object stores down. Async pendings, however is + quite high. This is due to the fact that the containers on the servers + which are down cannot be updated. When those servers come back up, async + pendings should drop. If async pendings were at this level without an + explanation, we have a problem. + +Recon examples +~~~~~~~~~~~~~~ + +Here is an example of noting and tracking down a problem with recon. + +Running reccon shows some async pendings: + +.. code:: + + bob@notso:~/swift-1.4.4/swift$ ssh \\-q .132.7 sudo swift-recon \\-alr + =============================================================================== + \[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts... + Async stats: low: 0, high: 23, avg: 8, total: 3356 + =============================================================================== + \[2012-03-14 17:25:55\\] Checking replication times on 384 hosts... + \[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066 + =============================================================================== + \[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts... + \[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333 + \[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625 + \[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667 + =============================================================================== + +Why? Running recon again with -av swift (not shown here) tells us that +the node with the highest (23) is .72.61. Looking at the log +files on .72.61 we see: + +.. code:: + + souzab@:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR + Mar 14 17:28:06 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001} + Mar 14 17:28:06 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001} + Mar 14 17:28:09 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:11 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:13 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001} + Mar 14 17:28:13 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001} + Mar 14 17:28:15 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:15 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:19 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:19 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:20 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001} + Mar 14 17:28:21 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:21 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + Mar 14 17:28:22 container-replicator ERROR Remote drive not mounted + {'zone': 5, 'weight': 1952.0, 'ip': '.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001} + +That is why this node has a lot of async pendings: a bunch of disks that +are not mounted on and . There may be other issues, +but clearing this up will likely drop the async pendings a fair bit, as +other nodes will be having the same problem. + +Assessing the availability risk when multiple storage servers are down +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + This procedure will tell you if you have a problem, however, in practice + you will find that you will not use this procedure frequently. + +If three storage nodes (or, more precisely, three disks on three +different storage nodes) are down, there is a small but nonzero +probability that user objects, containers, or accounts will not be +available. + +Procedure +--------- + +.. note:: + + swift has three rings: one each for objects, containers and accounts. + This procedure should be run three times, each time specifying the + appropriate ``*.builder`` file. + +#. Determine whether all three nodes are different Swift zones by + running the ring builder on a proxy node to determine which zones + the storage nodes are in. For example: + + .. code:: + + % sudo swift-ring-builder /etc/swift/object.builder + /etc/swift/object.builder, build version 1467 + 2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance + The minimum number of hours before a partition can be reassigned is 24 + Devices: id zone ip address port name weight partitions balance meta + 0 1 .4 6000 disk0 1708.00 4259 -0.00 + 1 1 .4 6000 disk1 1708.00 4260 0.02 + 2 1 .4 6000 disk2 1952.00 4868 0.01 + 3 1 .4 6000 disk3 1952.00 4868 0.01 + 4 1 .4 6000 disk4 1952.00 4867 -0.01 + +#. Here, node .4 is in zone 1. If two or more of the three + nodes under consideration are in the same Swift zone, they do not + have any ring partitions in common; there is little/no data + availability risk if all three nodes are down. + +#. If the nodes are in three distinct Swift zonesit is necessary to + whether the nodes have ring partitions in common. Run ``swift-ring`` + builder again, this time with the ``list_parts`` option and specify + the nodes under consideration. For example (all on one line): + + .. code:: + + % sudo swift-ring-builder /etc/swift/object.builder list_parts .8 .15 .72.2 + Partition Matches + 91 2 + 729 2 + 3754 2 + 3769 2 + 3947 2 + 5818 2 + 7918 2 + 8733 2 + 9509 2 + 10233 2 + +#. The ``list_parts`` option to the ring builder indicates how many ring + partitions the nodes have in common. If, as in this case, the + first entry in the list has a ‘Matches’ column of 2 or less, there + is no data availability risk if all three nodes are down. + +#. If the ‘Matches’ column has entries equal to 3, there is some data + availability risk if all three nodes are down. The risk is generally + small, and is proportional to the number of entries that have a 3 in + the Matches column. For example: + + .. code:: + + Partition Matches + 26865 3 + 362367 3 + 745940 3 + 778715 3 + 797559 3 + 820295 3 + 822118 3 + 839603 3 + 852332 3 + 855965 3 + 858016 3 + +#. A quick way to count the number of rows with 3 matches is: + + .. code:: + + % sudo swift-ring-builder /etc/swift/object.builder list_parts .8 .15 .72.2 | grep “3$” - wc \\-l + + 30 + +#. In this case the nodes have 30 out of a total of 2097152 partitions + in common; about 0.001%. In this case the risk is small nonzero. + Recall that a partition is simply a portion of the ring mapping + space, not actual data. So having partitions in common is a necessary + but not sufficient condition for data unavailability. + + .. note:: + + We should not bring down a node for repair if it shows + Matches entries of 3 with other nodes that are also down. + + If three nodes that have 3 partitions in common are all down, there is + a nonzero probability that data are unavailable and we should work to + bring some or all of the nodes up ASAP. diff --git a/doc/source/ops_runbook/procedures.rst b/doc/source/ops_runbook/procedures.rst new file mode 100644 index 0000000000..899df6d694 --- /dev/null +++ b/doc/source/ops_runbook/procedures.rst @@ -0,0 +1,367 @@ +================================= +Software configuration procedures +================================= + +Fix broken GPT table (broken disk partition) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- If a GPT table is broken, a message like the following should be + observed when the command... + + .. code:: + + $ sudo parted -l + +- ... is run. + + .. code:: + + ... + Error: The backup GPT table is corrupt, but the primary appears OK, so that will + be used. + OK/Cancel? + +#. To fix this, firstly install the ``gdisk`` program to fix this: + + .. code:: + + $ sudo aptitude install gdisk + +#. Run ``gdisk`` for the particular drive with the damaged partition: + + .. code: + + $ sudo gdisk /dev/sd*a-l* + GPT fdisk (gdisk) version 0.6.14 + + Caution: invalid backup GPT header, but valid main header; regenerating + backup header from main header. + + Warning! One or more CRCs don't match. You should repair the disk! + + Partition table scan: + MBR: protective + BSD: not present + APM: not present + GPT: damaged + /dev/sd + ***************************************************************************** + Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk + verification and recovery are STRONGLY recommended. + ***************************************************************************** + +#. On the command prompt, type ``r`` (recovery and transformation + options), followed by ``d`` (use main GPT header) , ``v`` (verify disk) + and finally ``w`` (write table to disk and exit). Will also need to + enter ``Y`` when prompted in order to confirm actions. + + .. code:: + + Command (? for help): r + + Recovery/transformation command (? for help): d + + Recovery/transformation command (? for help): v + + Caution: The CRC for the backup partition table is invalid. This table may + be corrupt. This program will automatically create a new backup partition + table when you save your partitions. + + Caution: Partition 1 doesn't begin on a 8-sector boundary. This may + result in degraded performance on some modern (2009 and later) hard disks. + + Caution: Partition 2 doesn't begin on a 8-sector boundary. This may + result in degraded performance on some modern (2009 and later) hard disks. + + Caution: Partition 3 doesn't begin on a 8-sector boundary. This may + result in degraded performance on some modern (2009 and later) hard disks. + + Identified 1 problems! + + Recovery/transformation command (? for help): w + + Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING + PARTITIONS!! + + Do you want to proceed, possibly destroying your data? (Y/N): Y + + OK; writing new GUID partition table (GPT). + The operation has completed successfully. + +#. Running the command: + + .. code:: + + $ sudo parted /dev/sd# + +#. Should now show that the partition is recovered and healthy again. + +#. Finally, uninstall ``gdisk`` from the node: + + .. code:: + + $ sudo aptitude remove gdisk + +Procedure: Fix broken XFS filesystem +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. A filesystem may be corrupt or broken if the following output is + observed when checking its label: + + .. code:: + + $ sudo xfs_admin -l /dev/sd# + cache_node_purge: refcount was 1, not zero (node=0x25d5ee0) + xfs_admin: cannot read root inode (117) + cache_node_purge: refcount was 1, not zero (node=0x25d92b0) + xfs_admin: cannot read realtime bitmap inode (117) + bad sb magic # 0 in AG 1 + failed to read label in AG 1 + +#. Run the following commands to remove the broken/corrupt filesystem and replace. + (This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition: + + .. code:: + + $ sudo parted + GNU Parted 2.3 + Using /dev/sda + Welcome to GNU Parted! Type 'help' to view a list of commands. + (parted) select /dev/sdb + Using /dev/sdb + (parted) p + Model: HP LOGICAL VOLUME (scsi) + Disk /dev/sdb: 2000GB + Sector size (logical/physical): 512B/512B + Partition Table: gpt + + Number Start End Size File system Name Flags + 1 17.4kB 1024MB 1024MB ext3 boot + 2 1024MB 1751GB 1750GB xfs sw-aw2az1-object045-disk1 + 3 1751GB 2000GB 249GB lvm + + (parted) rm 2 + (parted) mkpart primary 2 -1 + Warning: You requested a partition from 2000kB to 2000GB. + The closest location we can manage is 1024MB to 1751GB. + Is this still acceptable to you? + Yes/No? Yes + Warning: The resulting partition is not properly aligned for best performance. + Ignore/Cancel? Ignore + (parted) p + Model: HP LOGICAL VOLUME (scsi) + Disk /dev/sdb: 2000GB + Sector size (logical/physical): 512B/512B + Partition Table: gpt + + Number Start End Size File system Name Flags + 1 17.4kB 1024MB 1024MB ext3 boot + 2 1024MB 1751GB 1750GB xfs primary + 3 1751GB 2000GB 249GB lvm + + (parted) quit + +#. Next step is to scrub the filesystem and format: + + .. code:: + + $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1 + 1+0 records in + 1+0 records out + 1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s + $ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2 + meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks + = sectsz=512 attr=2, projid32bit=0 + data = bsize=4096 blocks=427246093, imaxpct=5 + = sunit=0 swidth=0 blks + naming =version 2 bsize=4096 ascii-ci=0 + log =internal log bsize=4096 blocks=208616, version=2 + = sectsz=512 sunit=0 blks, lazy-count=1 + realtime =none extsz=4096 blocks=0, rtextents=0 + +#. You should now label and mount your filesystem. + +#. Can now check to see if the filesystem is mounted using the command: + + .. code:: + + $ mount + +Procedure: Checking if an account is okay +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + ``swift-direct`` is only available in the HPE Helion Public Cloud. + Use ``swiftly`` as an alternate. + +If you have a tenant ID you can check the account is okay as follows from a proxy. + +.. code:: + + $ sudo -u swift /opt/hp/swift/bin/swift-direct show + +The response will either be similar to a swift list of the account +containers, or an error indicating that the resource could not be found. + +In the latter case you can establish if a backend database exists for +the tenantId by running the following on a proxy: + +.. code:: + + $ sudo -u swift swift-get-nodes /etc/swift/account.ring.gz + +The response will list ssh commands that will list the replicated +account databases, if they exist. + +Procedure: Revive a deleted account +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Swift accounts are normally not recreated. If a tenant unsubscribes from +Swift, the account is deleted. To re-subscribe to Swift, you can create +a new tenant (new tenant ID), and subscribe to Swift. This creates a +new Swift account with the new tenant ID. + +However, until the unsubscribe/new tenant process is supported, you may +hit a situation where a Swift account is deleted and the user is locked +out of Swift. + +Deleting the account database files +----------------------------------- + +Here is one possible solution. The containers and objects may be lost +forever. The solution is to delete the account database files and +re-create the account. This may only be done once the containers and +objects are completely deleted. This process is untested, but could +work as follows: + +#. Use swift-get-nodes to locate the account's database file (on three + servers). + +#. Rename the database files (on three servers). + +#. Use ``swiftly`` to create the account (use original name). + +Renaming account database so it can be revived +---------------------------------------------- + +Get the locations of the database files that hold the account data. + + .. code:: + + sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1 + + Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1 + Container None + + Object None + + Partition 18914 + + Hash 93c41ef56dd69173a9524193ab813e78 + + Server:Port Device 15.184.9.126:6002 disk7 + Server:Port Device 15.184.9.94:6002 disk11 + Server:Port Device 15.184.9.103:6002 disk10 + Server:Port Device 15.184.9.80:6002 disk2 [Handoff] + Server:Port Device 15.184.9.120:6002 disk2 [Handoff] + Server:Port Device 15.184.9.98:6002 disk2 [Handoff] + + curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ + curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ + + curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ + + curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff] + curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff] + curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff] + + ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" + ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" + ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" + ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] + ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] + ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] + + $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db- + 31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002 + disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD + "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*`_ curl -I -XHEAD + + "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ curl -I -XHEAD + + "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ curl -I -XHEAD + + "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff]curl -I -XHEAD + + "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff]curl -I -XHEAD + + "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* `_ # [Handoff]ssh 15.184.9.126 + + "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103 + "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120 + "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff] + +Check that the handoff nodes do not have account databases: + +.. code:: + + $ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" + ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory + +If the handoff node has a database, wait for rebalancing to occur. + +Procedure: Temporarily stop load balancers from directing traffic to a proxy server +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can stop the load balancers sending requests to a proxy server as +follows. This can be useful when a proxy is misbehaving but you need +Swift running to help diagnose the problem. By removing from the load +balancers, customer's are not impacted by the misbehaving proxy. + +#. Ensure that in proxyserver.com the ``disable_path`` variable is set to + ``/etc/swift/disabled-by-file``. + +#. Log onto the proxy node. + +#. Shut down Swift as follows: + + .. code:: + + sudo swift-init proxy shutdown + + .. note:: + + Shutdown, not stop. + +#. Create the ``/etc/swift/disabled-by-file`` file. For example: + + .. code:: + + sudo touch /etc/swift/disabled-by-file + +#. Optional, restart Swift: + + .. code:: + + sudo swift-init proxy start + +It works because the healthcheck middleware looks for this file. If it +find it, it will return 503 error instead of 200/OK. This means the load balancer +should stop sending traffic to the proxy. + +``/healthcheck`` will report +``FAIL: disabled by file`` if the ``disabled-by-file`` file exists. + +Procedure: Ad-Hoc disk performance test +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can get an idea whether a disk drive is performing as follows: + +.. code:: + + sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later + +You can expect ~600MB/sec. If you get a low number, repeat many times as +Swift itself may also read or write to the disk, hence giving a lower +number. diff --git a/doc/source/ops_runbook/sec-furtherdiagnose.rst b/doc/source/ops_runbook/sec-furtherdiagnose.rst new file mode 100644 index 0000000000..dd8154a3d9 --- /dev/null +++ b/doc/source/ops_runbook/sec-furtherdiagnose.rst @@ -0,0 +1,177 @@ +============================== +Further issues and resolutions +============================== + +.. note:: + + The urgency levels in each **Action** column indicates whether or + not it is required to take immediate action, or if the problem can be worked + on during business hours. + +.. list-table:: + :widths: 33 33 33 + :header-rows: 1 + + * - **Scenario** + - **Description** + - **Action** + * - ``/healthcheck`` latency is high. + - The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to + network issues, rather than the proxies being very busy. A very slow proxy might impact the average + number, but it would need to be very slow to shift the number that much. + - Check networks. Do a ``curl https:///healthcheck where ip-address`` is individual proxy + IP address to see if you can pin point a problem in the network. + + Urgency: If there are other indications that your system is slow, you should treat + this as an urgent problem. + * - Swift process is not running. + - You can use ``swift-init`` status to check if swift processes are running on any + given server. + - Run this command: + .. code:: + + sudo swift-init all start + + Examine messages in the swift log files to see if there are any + error messages related to any of the swift processes since the time you + ran the ``swift-init`` command. + + Take any corrective actions that seem necessary. + + Urgency: If this only affects one server, and you have more than one, + identifying and fixing the problem can wait until business hours. + If this same problem affects many servers, then you need to take corrective + action immediately. + * - ntpd is not running. + - NTP is not running. + - Configure and start NTP. + Urgency: For proxy servers, this is vital. + + * - Host clock is not syncd to an NTP server. + - Node time settings does not match NTP server time. + This may take some time to sync after a reboot. + - Assuming NTP is configured and running, you have to wait until the times sync. + * - A swift process has hundreds, to thousands of open file descriptors. + - May happen to any of the swift processes. + Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging. + + - Restart the swift processes on the affected node: + + .. code:: + + % sudo swift-init all reload + + Urgency: + If known performance problem: Immediate + + If system seems fine: Medium + * - A swift process is not owned by the swift user. + - If the UID of the swift user has changed, then the processes might not be + owned by that UID. + - Urgency: If this only affects one server, and you have more than one, + identifying and fixing the problem can wait until business hours. + If this same problem affects many servers, then you need to take corrective + action immediately. + * - Object account or container files not owned by swift. + - This typically happens if during a reinstall or a re-image of a server that the UID + of the swift user was changed. The data files in the object account and container + directories are owned by the original swift UID. As a result, the current swift + user does not own these files. + - Correct the UID of the swift user to reflect that of the original UID. An alternate + action is to change the ownership of every file on all file systems. This alternate + action is often impractical and will take considerable time. + + Urgency: If this only affects one server, and you have more than one, + identifying and fixing the problem can wait until business hours. + If this same problem affects many servers, then you need to take corrective + action immediately. + * - A disk drive has a high IO wait or service time. + - If high wait IO times are seen for a single disk, then the disk drive is the problem. + If most/all devices are slow, the controller is probably the source of the problem. + The controller cache may also be miss configured – which will cause similar long + wait or service times. + - As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor + is working. + + Second, reboot the server. + If problem persists, file a DC ticket to have the drive or controller replaced. + See `Diagnose: Slow disk devices` on how to check the drive wait or service times. + + Urgency: Medium + * - The network interface is not up. + - Use the ``ifconfig`` and ``ethtool`` commands to determine the network state. + - You can try restarting the interface. However, generally the interface + (or cable) is probably broken, especially if the interface is flapping. + + Urgency: If this only affects one server, and you have more than one, + identifying and fixing the problem can wait until business hours. + If this same problem affects many servers, then you need to take corrective + action immediately. + * - Network interface card (NIC) is not operating at the expected speed. + - The NIC is running at a slower speed than its nominal rated speed. + For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC. + - 1. Try resetting the interface with: + + .. code:: + + sudo ethtool -s eth0 speed 1000 + + ... and then run: + + .. code:: + + sudo lshw -class + + See if size goes to the expected speed. Failing + that, check hardware (NIC cable/switch port). + + 2. If persistent, consider shutting down the server (especially if a proxy) + until the problem is identified and resolved. If you leave this server + running it can have a large impact on overall performance. + + Urgency: High + * - The interface RX/TX error count is non-zero. + - A value of 0 is typical, but counts of 1 or 2 do not indicate a problem. + - 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range + 3-30 probably indicate that the error count has crept up slowly over a long time. + Consider rebooting the server to remove the report from the noise. + + Typically, when a cable or interface is bad, the error count goes to 400+. For example, + it stands out. There may be other symptoms such as the interface going up and down or + not running at correct speed. A server with a high error count should be watched. + + 2. If the error count continue to climb, consider taking the server down until + it can be properly investigated. In any case, a reboot should be done to clear + the error count. + + Urgency: High, if the error count increasing. + + * - In a swift log you see a message that a process has not replicated in over 24 hours. + - The replicator has not successfully completed a run in the last 24 hours. + This indicates that the replicator has probably hung. + - Use ``swift-init`` to stop and then restart the replicator process. + + Urgency: Low (high if recent adding or replacement of disk drives), however if you + recently added or replaced disk drives then you should treat this urgently. + * - Container Updater has not run in 4 hour(s). + - The service may appear to be running however, it may be hung. Examine their swift + logs to see if there are any error messages relating to the container updater. This + may potentially explain why the container is not running. + - Urgency: Medium + This may have been triggered by a recent restart of the rsyslog daemon. + Restart the service with: + .. code:: + + sudo swift-init reload + * - Object replicator: Reports the remaining time and that time is more than 100 hours. + - Each replication cycle the object replicator writes a log message to its log + reporting statistics about the current cycle. This includes an estimate for the + remaining time needed to replicate all objects. If this time is longer than + 100 hours, there is a problem with the replication process. + - Urgency: Medium + Restart the service with: + .. code:: + + sudo swift-init object-replicator reload + + Check that the remaining replication time is going down. diff --git a/doc/source/ops_runbook/troubleshooting.rst b/doc/source/ops_runbook/troubleshooting.rst new file mode 100644 index 0000000000..d097ce0673 --- /dev/null +++ b/doc/source/ops_runbook/troubleshooting.rst @@ -0,0 +1,264 @@ +==================== +Troubleshooting tips +==================== + +Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This entry is prompted by a real customer issue and exclusively focused on how +that problem was identified. +There are many reasons why a http status of 500 could be returned. If +there are no obvious problems with the swift object store, then it may +be necessary to take a closer look at the users transactions. +After finding the users swift account, you can +search the swift proxy logs on each swift proxy server for +transactions from this user. The linux ``bzgrep`` command can be used to +search all the proxy log files on a node including the ``.bz2`` compressed +files. For example: + +.. code:: + + $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l -R ssh + + -w .68.[4-11,132-139 4-11,132-139],.132.[4-11,132-139 + 4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*' + dshbak -c + . + . + \---------------\- + .132.6 + \---------------\- + Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server .16.132 + .66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af + /%3Fformat%3Djson HTTP/1.0 404 - - _4f4d50c5e4b064d88bd7ab82 - - - + tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130 + +This shows a ``GET`` operation on the users account. + +.. note:: + + The HTTP status returned is 404, not found, rather than 500 as reported by the user. + +Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can +search the swift object servers log files for this transaction ID: + +.. code:: + + $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l + + -R ssh + -w .72.[4-67|4-67],.[4-67|4-67],.[4-67|4-67],.204.[4-131| 4-131] + 'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*' + | dshbak -c + . + . + \---------------\- + .72.16 + \---------------\- + Feb 29 08:51:57 sw-aw2az1-object013 account-server .132.6 - - + + [29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" + + 0.0016 "" + \---------------\- + .31 + \---------------\- + Feb 29 08:51:57 node-az2-object060 account-server .132.6 - - + [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962- + 4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 "" + \---------------\- + .204.70 + \---------------\- + + Feb 29 08:51:57 sw-aw2az3-object0067 account-server .132.6 - - + [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962- + 4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 "" + +.. note:: + + The 3 GET operations to 3 different object servers that hold the 3 + replicas of this users account. Each ``GET`` returns a HTTP status of 404, + not found. + +Next, use the ``swift-get-nodes`` command to determine exactly where the +users account data is stored: + +.. code:: + + $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af + Account AUTH_redacted-4962-4692-98fb-52ddda82a5af + Container None + Object None + + Partition 198875 + Hash 1846d99185f8a0edaf65cfbf37439696 + + Server:Port Device .31:6002 disk6 + Server:Port Device .204.70:6002 disk6 + Server:Port Device .72.16:6002 disk9 + Server:Port Device .204.64:6002 disk11 [Handoff] + Server:Port Device .26:6002 disk11 [Handoff] + Server:Port Device .72.27:6002 disk11 [Handoff] + + curl -I -XHEAD "`http://.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ + curl -I -XHEAD "`http://.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ + curl -I -XHEAD "`http://.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ + curl -I -XHEAD "`http://.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ # [Handoff] + curl -I -XHEAD "`http://.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ # [Handoff] + curl -I -XHEAD "`http://.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af" + `_ # [Handoff] + + ssh .31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" + ssh .204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" + ssh .72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" + ssh .204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff] + ssh .26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff] + ssh .72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff] + +Check each of the primary servers, .31, .204.70 and .72.16, for +this users account. For example on .72.16: + +.. code:: + + $ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/ + total 1.0M + drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 . + drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 .. + -rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db + -rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending + +So this users account db, an sqlite db is present. Use sqlite to +checkout the account: + +.. code:: + + $ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp + $ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db + sqlite> .mode line + sqlite> select * from account_stat; + account = AUTH_redacted-4962-4692-98fb-52ddda82a5af + created_at = 1328311738.42190 + put_timestamp = 1330000873.61411 + delete_timestamp = 1330001026.00514 + container_count = 0 + object_count = 0 + bytes_used = 0 + hash = eb7e5d0ea3544d9def940b19114e8b43 + id = 2de8c8a8-cef9-4a94-a421-2f845802fe90 + status = DELETED + status_changed_at = 1330001026.00514 + metadata = + +.. note:: + + The status is ``DELETED``. So this account was deleted. This explains + why the GET operations are returning 404, not found. Check the account + delete date/time: + + .. code:: + + $ python + + >>> import time + >>> time.ctime(1330001026.00514) + 'Thu Feb 23 12:43:46 2012' + +Next try and find the ``DELETE`` operation for this account in the proxy +server logs: + +.. code:: + + $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l -R ssh -w .68.[4-11,132-139 4-11,132- + 139],.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w + DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c + . + . + Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 .66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb- + 52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 _4f458ee4e4b02a869c3aad02 - - - + + tx4471188b0b87406899973d297c55ab53 - 0.0086 + +From this you can see the operation that resulted in the account being deleted. + +Procedure: Deleting objects +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Simple case - deleting small number of objects and containers +------------------------------------------------------------- + +.. note:: + + ``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud. + Use ``swiftly`` as an alternative. + +.. note:: + + Object and container names are in UTF8. Swift direct accepts UTF8 + directly, not URL-encoded UTF8 (the REST API expects UTF8 and then + URL-encoded). In practice cut and paste of foreign language strings to + a terminal window will produce the right result. + + Hint: Use the ``head`` command before any destructive commands. + +To delete a small number of objects, log into any proxy node and proceed +as follows: + +Examine the object in question: + +.. code:: + + $ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name + +See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set, +then this is the manifest object and segment objects may be in another +container. + +If the ``X-Object-Manifest`` attribute is set, you need to find the +name of the objects this means it is a DLO. For example, +if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents +of the container container2 as follows: + +.. code:: + + $ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2 + +Pick out the objects whose names start with ``seg-blah``. +Delete the segment objects as follows: + +.. code:: + + $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01 + $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02 + etc + +If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by: + +- Using swift-get-nodes to get the details of the object's location. +- Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy. +- This lists a json body listing containers and object names +- Delete the objects as described above for DLO segments + +Once the segments are deleted, you can delete the object using +``swift-direct`` as described above. + +Finally, use ``swift-direct`` to delete the container. + +Procedure: Decommissioning swift nodes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Should Swift nodes need to be decommissioned. For example, where they are being +re-purposed, it is very important to follow the following steps. + +#. In the case of object servers, follow the procedure for removing + the node from the rings. +#. In the case of swift proxy servers, have the network team remove + the node from the load balancers. +#. Open a network ticket to have the node removed from network + firewalls. +#. Make sure that you remove the ``/etc/swift`` directory and everything in it.