SHIFT

--- Sjoerd Hooft's InFormation Technology ---

User Tools

Site Tools


Sidebar

Recently Changed Pages:

View All Pages


View All Tags


LinkedIn




WIKI Disclaimer: As with most other things on the Internet, the content on this wiki is not supported. It was contributed by me and is published “as is”. It has worked for me, and might work for you.
Also note that any view or statement expressed anywhere on this site are strictly mine and not the opinions or views of my employer.


Pages with comments

View All Comments

vmperformancetroubleshooting

VM Performance Troubleshooting

Business case: Overall performance of virtual infrastructure is fine. One VM is not performing according to what can be expected based on results of the past or according to specs of the vendor.
Step one, defining an incident and performing standard troubleshooting failed. Activities in this step may include but are not limited to:

  • Determine what has changed
  • Carry out vendor best practices regarding a virtual environment and the application
  • Define reservations CPU/RAM
  • Assign more resources CPU/RAM
  • Assign more shares CPU/RAM/DISK
  • Migrate the VM to different host
  • Migrate the VM to different datastore
  • Check limits CPU/RAM
  • Check traffic shaping policies on port group
  • Check Guest OS Swap file (1,5 x configured memory)
  • Check Guest OS file compression (should be turned off)

Defining a Problem

According to ITIL you can create a problem if you can't find the solution or the underlying cause of one or more incidents. Defining a problem should allow you to assign more resources to reach a solution, so more resources for VM, more time for troubleshooting, new hardware and so on. I prefer to start with assigning time so the underlying cause can be found. If you know the cause you can prevent the problem from occurring in the future and you know for sure that the solution you eventually come up with will deal with the specific underlying cause.
So, assuming the time is assigned, how to tackle this particular VM. By following these guidelines you're sure the find the bottleneck:

  1. First, find a timeframe where the specific problem does not occur. So, during the day, evening, weekend.
  2. Second, find a timeframe where the specific problem does occur. Again, during the day, evening, weekend.
  3. Then start monitoring the VM during the first timeframe, to determine if there are any bottlenecks. To do so, monitor the counters below all at the same time.
  4. Now determine if there is anything wrong that could speed up the process even more. Remember, this is the timeframe where everything goes well and you want it to be perfect, otherwise you could draw the wrong conclusion at the end.
  5. After solving the bottlenecks in the previous step, monitor the exact same counters during the second timeframe.
  6. Now comes the hard part, now you have to combine all the information gathered and determine the exact underlying cause of the problem. There's no help in that. You need years of experience and knowledge to do that.


There might be a catch, the above activities will tell you the root of the problem but not your solution. It will just tell you your bottleneck. Your solution might for example be found in:

  • Splitting up processes to different VMs
  • Assign more resources as in CPU/MEM
  • or any other solution


It could also be the case that you need to investigate even further:

  • Network related problems can be analyzed using wireshark
  • Sometimes you need to assign specific processes to specific CPUs (on VM (CPU affinity) or guest level)
  • Process monitoring on the guest
  • FC paths (MPIO)
  • Windows perfmon – average disk queue length. This contains both active and queued commands.
  • Linux - top
  • ESX - esxtop - qstats
  • Windows pagefile (1,5 times the memory size)
  • or any other activity

Counters to Monitor

  • VM Counters
    • Per CPU
      • Ready Time
      • Usage in Mhz
    • Memory
      • Consumed
      • Active
      • Overhead
      • Swap in
      • Swap out
      • Balloon
      • Usage
    • Network usage per NIC
      • Usage
      • Data transmit rate
      • data receive rate
    • Disk per datastore
      • Read Rate
      • Write Rate
      • Commands Issued
      • Command Aborts
  • Host Counters
    • CPU
      • Utilization
      • Reserved Capacity
    • Memory
      • Balloon
      • Swap Out
      • Swap In
      • Active
      • Shared Common
      • Reserved Capacity
      • Granted
      • Used by VMkernel
    • Network usage per NIC
      • Usage
      • Data transmit rate
      • Data receive rate
    • Disk (only select the disks that are in use by the VM)
      • Command Latency
      • Write latency
      • Read latency
      • Maximum queue depth
      • Commands issued
      • Average commands issued per second
  • OS Counters
    • Open process monitoring to see the processes running
    • Use Resource Monitor to see the Queue Length for the disks

Data Sheet

You can use this data sheet to make a summary of the VM you're about to troubleshoot.

You could leave a comment if you were logged in.
vmperformancetroubleshooting.txt · Last modified: 2021/09/24 00:25 (external edit)