Visitors

HOW TO: Configure Shared Diagnostic Partition on VMware ESX host

I meant to write this article for a very long time, even before I started this blog. Even took the screenshots and made a note of commands but never found time to do it. Well, better late than never! Besides, you don’t do it very often. Most likely only once, when you configure your ESX cluster. Unless shared diagnostic partition needs to be migrated to another storage array, gets deleted or re-formated as VMFS… :)

Diagnostic Partition or a Dump Partition used to store core dumps for debugging and technical support. There are plenty of information on capturing VMware ESX diagnostic information after PSOD (Purple Screen Of Death) on the Internet and VMware Knowledge Base. Here is some light reading for you:

  • Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen. VMware KB 1004128;
  • Configuring an ESX/ESXi host to capture a VMkernel coredump from a purple diagnostic screen. VMware KB 1000328;
  • Collecting diagnostic information for VMware ESX/ESXi using the vSphere Client. VMware KB 653;
  • Collecting diagnostic information for VMware ESX/ESXi using the vm-support command. VMware KB 1010705.

OK, let’s get started:

  1. Create a LUN big enough to store VMKernel coredumps from all hosts in your cluster. VMKernel coredump is 110Mb in size, so do the maths;
  2. Make a note of LUN’s Unique ID.
    I am working on EMC VNX, screenshot is from EMC Unisphere:
  3. Present the LUN to all hosts, it is ‘shared‘ diagnostic partition after all. Make a note of the Host LUN ID – the LUN ID this LUN will be known to all hosts;
  4. In vShere Client right click on the cluster, select Rescan for Datastores and tick ‘Scan for New Storage Devices’ only, click OK;
  5. Select any cluster host in the cluster, Configuration, Storage, Devices and check that the host can see the LUN;
  6. Click on Datastores and select Add Storage…;
  7. Select Diagnostic, click Next;
  8. Select Shared SAN storage, click Next,
  9. Select the LUN from the list, make sure LUN identifier and LUN ID are correct and click Next;
  10. Theoretically, you should get a confirmation that the shared diagnostic partition has been created and there would be the end of the story… BUT! I received an error message:
    The object or item refereed to could not be found
    Call "HostDiagnosticSystem.QueryPartitionCreateDesc" for object "diagnosticSystem-3869" on vCenter Server "vCenter.vStrong.info" failed.


    I got the same error message when I tried to configure Shared Diagnostic partition in two vCenters, although configured in similar way, using two different VNX storage arrays. I logged a call with VMware but they could not get it sorted and suggested to configure Share Diagnostic Partition via CLI.

OK, lets configure Share Diagnostic Partition via CLI. This is much more interesting way of doing it and the REAL reason to write this article.
There are two VMware Knowledge Base articles that can help us:

  • Configuring an ESXi 5.0 host to capture a VMkernel coredump from a purple diagnostic screen to a diagnostic partition, KB 2004299;
  • Using the partedUtil command line utility on ESX and ESXi, KB 1036609.

Here is how to do it:

  1. On any cluster host run the following commands.
    List all disks / LUNs:

    ~ # cd /vmfs/devices/disks
    /dev/disks # ls

    Make sure you can see the disk with LUN’s Unique ID in the list or run the same command and filter the output with grep:

    /dev/disks # ls | grep 6006016055d02c00c0815b17cc78e111
    naa.6006016055d02c00c0815b17cc78e111
    vml.02002f00006006016055d02c00c0815b17cc78e111565241494420
  2. Let’s have a look if the disk has any partitions
    partedUtil getptbl "/vmfs/devices/disks/mpx.vmhba0:C0:T0:L0"There are no partitions configured:

    /dev/disks # partedUtil getptbl /vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111
    gpt
    1305 255 63 20971520

    1305 255 63 20971520
    |     |   |  |
    |     |   |  ----- quantity of sectors
    |     |   -------- quantity of sectors per track
    |     ------------ quantity of heads
    ------------------ quantity of cylinders

  3. Create a diagnostic partition:
    partedUtil setptbl "/vmfs/devices/disks/DeviceName" DiskLabel ["partNum startSector endSector type/guid attribute"]
    DiskLabel – common labels are bsd, dvh, gpt, loop, mac, msdos, pc98, sun. ESXi 5.0 and higher supports both the msdos and gpt label and partitioning schemes;
    partNum – partition number, start from 1;
    startSector and endSectorstartSector and endSector - specify how much contiguous disk space a partition occupies;
    type/GUID – identifies the purpose of a partition:

    /dev/disks # partedUtil showGuids
    Partition Type       GUID
    vmfs                 AA31E02A400F11DB9590000C2911D1B8
    vmkDiagnostic        9D27538040AD11DBBF97000C2911D1B8 - VMware Diagnostic Partition GUID
    VMware Reserved      9198EFFC31C011DB8F78000C2911D1B8
    Basic Data           EBD0A0A2B9E5443387C068B6B72699C7
    Linux Swap           0657FD6DA4AB43C484E50933C84B4F4F
    Linux Lvm            E6D6D379F50744C2A23C238F2A3DF928
    Linux Raid           A19D880F05FC4D3BA006743F0F84911E
    Efi System           C12A7328F81F11D2BA4B00A0C93EC93B
    Microsoft Reserved   E3C9E3160B5C4DB8817DF92DF00215AE
    Unused Entry         00000000000000000000000000000000

    attribute – common attribute is 128 = 0x80, which indicates that the partition is bootable. Otherwise, most partitions have an attribute value of 0 (zero).

    How to calculate startSector and endSector?
    Well, startSector is easy – it is 2048. endSector = startSector + (partition_size_in_MB * 1024 * 1024 / 512).
    WOW, too much calculations! But on a plus side you can get really creative! The size of coredump partition for a single server is 110MB, which is 225280 sectors.
    You can slice the LUN into multiple diagnostic partitions and then configure hosts to use them individually!

    Host        Partition       startSector        endSector
    ESXi01         1            2048               227328
    ESXi02         2            227329             452609
    etc etc...

    It would be a nightmare to manage setup like that though!…
    Luckily, ESX host will be happy to use shared diagnostic partition between multiple hosts and will not overwrite each others coredumps, see screenshots bellow.

    To make your life easier, lets just create a new VMFS datastore on this LUN, make a note of the startSector and endSector, then delete it and re-create partition as vmkDiagnostic.

    Here is the same disk we were looking at before with a VMFS partition:

    /dev/disks # partedUtil getptbl /vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111
    gpt
    1305 255 63 20971520
    1 2048 20971486 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

    1 2048 20971486   ^^^^  0
    | |       |         |   |
    | |       |         |   --- attribute
    | |       |         ------- type/GUID
    | |       ----------------- ending sector
    | ------------------------- starting sector
    --------------------------- partition number

    OK, we are ready:

    /dev/disks # partedUtil setptbl "/vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111" gpt "1 2048 20971486 9D27538040AD11DBBF97000C2911D1B8 0"
    gpt
    0 0 0 0
    1 2048 20971486 9D27538040AD11DBBF97000C2911D1B8 0
  4. Make sure the partition was created
    /dev/disks # partedUtil getptbl /vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111
    gpt
    1305 255 63 20971520
    1 2048 20971486 9D27538040AD11DBBF97000C2911D1B8 vmkDiagnostic 0
  5. Now we can configure the host to use Shared Diagnostic Partition:
    1. Lets check if the coredump partition is currently configured:
      Usage: esxcli system coredump partition list

      ~ # esxcli system coredump partition list
      Name                                    Path                                                        Active  Configured
      --------------------------------------  ----------------------------------------------------------  ------  ----------
      naa.6006016055d02c00c0815b17cc78e111:1  /vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111:1   false       false
      naa.6006016055d02c00acc1158a7e31e111:7  /vmfs/devices/disks/naa.6006016055d02c00acc1158a7e31e111:7    true        true

      In this example the coredump partition was configured during server installation on local storage (in fact it is a partition on a SAN boot LUN)

    2. Configure host to use shared diagnostic partition:
      Usage: esxcli system coredump partition set [cmd options]
      Description:
      set                   Set the specific VMkernel dump partition for this system. This will configure the dump partition for the next boot. This command will change the active dump partition to the partition specified.
      cmd options:
      -e|–enable           Enable or disable the VMkernel dump partition. This option cannot be specified when setting or unconfiguring the dump partition.
      -p|–partition=<str>  The name of the partition to use. This should be a device name with a partition number at the end. Example: naa.xxxxx:1
      -s|–smart            This flag can be used only with –enable=true. It will cause the best available partition to be selected using the smart selection algorithm.
      -u|–unconfigure      Set the dump partition into an unconfigured state. This will remove the current configured dump partition for the next boot. This will result in the smart activate algorithm being used at the next boot.

      esxcli system coredump partition set --partition="naa.6006016055d02c00c0815b17cc78e111:1"
      esxcli system coredump partition set --enable=true
      esxcli system coredump partition list
      Name                                    Path                                                        Active  Configured
      --------------------------------------  ----------------------------------------------------------  ------  ----------
      naa.6006016055d02c00c0815b17cc78e111:1  /vmfs/devices/disks/naa.6006016055d02c00c0815b17cc78e111:1    true        true
      naa.6006016055d02c00acc1158a7e31e111:7  /vmfs/devices/disks/naa.6006016055d02c00acc1158a7e31e111:7   false       false

      You need to run these commands on each host.

OK, lets see how it works:

Q: How to manually initiate PSOD (Purple Screen Of Death)?
A: Run the following command: vsish -e set /reliability/crashMe/Panic

The host crashes and saves coredump to a partition on a local disk, hence ‘Slot 1 of 1’:

Next two screenshots are PSOD from servers configured to use Shared Diagnostic Partition.
In this example 2GB shared diagnostic partition was configured for all hosts in the cluster. As the coredump size is less than 110Mb, there are 20 slots on 2GB partition.
Host 1:

Host 2:

Restart/cold boot the host and, when it comes back online, use the vSphere Client or vm-support command to copy diagnostic information from the host. For more information, see Collecting diagnostic information for VMware ESX/ESXi using the vSphere Client (653) or Collecting diagnostic information for VMware ESX/ESXi using the vm-support command (1010705). It is possible to re-copy the contents of the Dump Partition to a vmkernel-zdump-* coredump file. This may be necessary if the automatically-generated file is deleted. For more information, see Manually regenerating core dump files in VMware ESX and ESXi (1002769).

UPDATE:

I had to rebuild a couple of hosts in a cluster where a Shared Diagnostic partition was created and configured (a LUN with vmkDiagnostic GUID partition, masked to a host) and even though the host was configured to boot from SAN and therefore had ‘local’ coredump partition, it ‘choose’ to use shared diagnostic partition!

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>