Friday, February 28, 2020

Asset Management with Power Bi and vRops

I was recently reading a blog post by Bob Plankers on the official vSphere Blog about setting up asset management with some free to use tools.  This time last year I was in a heated discussion about how our biggest security hole was a lack of asset management.  Now, we weren't with out it 100%.  I track our servers with a spreadsheet.  We also own vRops and various other tools that help track management of assets.  There are a couple of problems.
  1. Unreliable Data:  We have long ago out grown the "Server Spreadsheet" as a good means of tracking servers.  We are a large company, but not an enterprise, with over 2000 servers all over the world.  My boss is a firm believer that anything inputted manually has a chance of being unreliable.  Items may be missed and changes may occur and not tracked.  I was good at entering data as I made changes, but I was not the only person making changes.  
  2. Painful Reporting:  We have plenty of management tools with good data.  But reports took a long time to generate because they all had to be formatted for viewing.  I was spending hours on every report just formatting it to show things like servers being replicated per team or patches not installed by team.
  3. Lack of Application Data:  The server team was pretty good at updating information in our "spreadsheet", but sometimes the applications team makes changes that we aren't told of that affects the data in the "spreadsheet".  We had no idea what the servers did outside of a quick description that we entered when we built it originally.
Why didn't we buy a proper Asset Management software?  Lets see if you've heard this before?  
  • Why do we need it, you have a spreadsheet.
  • That's a lot of money, don't we have something already for this?
  • Who is going to manage it?  Not us.
  • We don't need it, you guys are doing fine.
  • We don't have time to take on a project of that scale.
What did I have?
  • A "Server Spreadsheet" with server names, brief description and application owners.  Plus a lot of other data that was unreliable due to changes.
  • A server environment where 98% of it is virtualized on VMware.  It means most of the server's configuration can be tracked back to VMware and only a few need to be tracked manually.
  • vRealize Operations Manager.  This contains all the VM and host configuration and performance data.  It is "reliable" data as it updated dynamically.
  • A heavy Microsoft Windows environment.  This includes Active Directory and popular Microsoft Applications. 
  • A large assortment of common management tool with SQL databases (This is important).
  • An On-Prem version of Power BI.  I was lucky here.  Some of the queries I do would be a lot more difficult to do in the O365 version of it.
How did I get here?  I was complaining to a co-worker about my need for a better way to track assets and mentioned that I was about to just setup a larger spreadsheet with vlookups.  Then he showed me what he was doing in Power Bi.

What is Power Bi?  The description says its a  business analytics service by Microsoft that provides interactive visualizations.

How I use something like that for Asset Management?  The secret sauce behind Power Bi is the Power Query and the built in ability to link data sources together.  

I can pull in multiple data sources:
  • Lets say a spreadsheet with server names, application description and application team owner
  • Then I pull a vRops report with server names, vCPU's, Memory, disk size and power state. (Plus a ton of other data, just think about what vRops can get).
I put that data together:
  • The server names are the same on both reports.  In Power BI, I link the two columns together.  Power BI then knows that all the data rows in each data source is related.

  • What if the server names are not exactly the same?  Like one uses FQDN and the other is a simple name.  That's were Power Query is your friend.  It changes the data as it pulls it from the data source dynamically.  For example, you can drop the "" from the FQDN and leave the rest.  In the same way, you can capitalize it as well.  Lots of ways to solve problems like this.

  • The Power BI virtualization allows me to make a list with whatever data I want together as long as it is linked.
  • Once uploaded to a Power BI server, it can be set to automatically update every day with new data.  If your queries are setup right, the server compiles all the data automatically.
  • Now, I have a server name and description side by side with the current vCPU's, memory and disk space.  The virtual configuration is as current as the latest vRops report.
  • The built-in visualizations allow for filters (slicers), so you can easily filter based on location of the servers, owners, OS versions, whatever you can imagine and\or fit on a page.
  • The list automatically removes those servers that are powered off, based on the power state from vRops.

Is that all?  Hell No!  Now that I have this data formatted, what else can I do with it?

  • Do you want to track the removal of Server 2008 from your environment?  That's an easy page with the data I've already uploaded.

  • The page filters the Server OS to Server 2008 and filters out those that are "powered off".  Lists the totals by teams.  All the values are interactive.  If you click on a slice of the pie, it filters the list below to that team only.
  • All lists are exportable to CSV for those that don't have time to click on a web page and a tab.  I suggest formatting these lists in a way that is easy to read for management.  
Other Examples:

  • VM Tools & VM Hardware versions:

  • Host Hardware\OS Version:

  • Do you want to track which OU the server is located in AD?  Add Active Directory as a data source.  Its already built-in to Power BI.
  • Do you want to track if a specific patch is installed?  Add your CMDB or Patching server DB as a source.

The Best Part!!!  Self Service REPORTS!

I save hours a work a week by building common reports in Power BI.  Instead of spending the time gathering the data manually, it does it for me.  It updates every day dynamically from multiple sources.  No longer do I need to pull information from VMware and merge it with Physical servers information, its already there.

Does a director keep asking for the same information?  Tell him to visit the page and get it himself, anytime.

Future Projects:

  • Build an application page, like the server asset page and link the two together.  This is a struggle because the application information is not mine.  I have to work with other teams to get this information updated and formatted the way we need.
  • Use the application data to check on VMware configurations like DRS anti-affintiy rules to make sure we are properly set up for High Availability.  We want to make sure that paired Web servers are on separate hosts and datastores.  This will prove tricky since this data is not in vRops yet.  I've asked for it.
  • Add performance data.  I don't want to re-create the graphs or dashboards that vRops already has.  But I would like people to be able to get to vRops, from here to look at the performance data.  One dashboard to rule them all!
  • Add Costing in Power Bi.  The costing in vRops is very helpful.  But I doubt it will be complete for my environment.  I can export and manipulate the costing from vRops and make it look the way I want within Power BI.

Friday, November 29, 2019

HPE ILO 4 SD card issues

I've been a big fan of HPE server hardware for years.  I've been using them for as long as I have been in IT.  We switched to blades very early in their development.  It hasn't always been a fun ride, but not all server hardware is perfect through the years.  Most recently we had a mix of HPE blade servers.  Gen 8 and Gen 9 blades as our main VMware hosts in our data centers.  Then we purchased some DL160 Gen 9 servers as a very cheap vSAN 2-node ROBO solution for some of our remote locations.  They worked great.  All of these servers were built without disk for the OS, for the blades, there were no disks in them at all.  They all booted from SD Cards.  We knew with our redundancy, we could withstand a failure and rebuild quickly if necessary.

Then we started to see the ESXi errors on writing to hardware.  When we rebooted the hosts, they would fail to find the boot disk or SD card or it would look like a corrupted ESXi instance.  After messing with support for a while, we tried replacing the SD Cards, but they still wouldn't work.  Replacing the Motherboard did work.

Turns out there was an issue were the NAND memory on the ilo (which is where the SD card controller is located) would become corrupted.

We had only a few issues for a while but once we had a project to upgrade ESXi from 6.0 to 6.5 and upgrade firmware (BIOS requirement for Spectre\Meltdown) we started to see these issues en masse.  Almost every one of the hosts we tried to upgrade saw these issues.  It was a huge pain and slowed our migration down significantly.

HPE released many different advisories and ilo firmware in an attempt to fix the issue.

This is the latest version of that Advisory.

The Simple Procedure for blades:

  1. Upgrade the firmware to 2.61+. (When I started this it was 2.50.  There were many versions that changed the behavior throughout the year.  Some version were better than others.)
  2. Run ilo command to Format the NAND Memory.  You can get the Force_Format.xml details from the advisory.
    1. You can run it from a Windows host with the Ilo configuration utility.
    2. You can run it from SSH session from the Enclosure OA.
    3. NOTE:  This will “format” the NAND memory.  It will not erase anything.  Just resets the memory on the ilo that the SD card data runs from.  This can be run while ESXi is online or not, but it is preferable to shut the server off.  It will not format the SD-Card.
  3. Then reset the bay via the e-fuse command. 
    1. From SSH session on OA, run “Show server list”  To view the blade statuses.  Confirm the bay that you want to reset is correct.  This will reset the bay you input, very easy to make a mistake.
    2. Run “Reset Bay XX” Change XX to the Bay number.  Then Type Yes to continue.  Can’t stress enough that the blade will be reset immediately.
    3. Run “Show server list”  to monitor the status of the reset.
  4. After the blade is back up, it should boot automatically.  This usually fixes it.  Sometimes you need to reset it again.
In firmware version 2.51 (I think) HPE added a GUI button for this procedure.  Only if the SD-card controller experiences the error.  Once the error disappears, the GUI buttons for this procedure disappears.  See the advisory for details.

For DL class servers, not blades.  You do not have the ability to reset the e-fuse in order to reset the ilo.  You have to use an AUX Power cycle command.  This is only available via HPE Restful utility.

For the most part this worked.  Sometimes, a simple power off and power on fixes some boot issues.

Recently, with ilo version 2.70, we've had a couple of failures where the SD-card actually fails.  I don't believe that the card is actually failed, but it seems that the NAND format fixes the ilo, but the ilo fails to recognize the SD card.  Switching Motherboard did not resolve it, but swapping SD-cards does.  But we had to rebuild the host.  I do not have a solution for this yet.  Maybe we'll figure something out.

Suffice to say, this was one main reason we did not re-buy HPE hardware.  There were others though, but this contributed to a couple of really stressful years of upgrades.

Navigating Spectre Meltdown

Today, I catch up on long over due blog posts.  I've been meaning to post a couple this year, but I've found it very difficult to balance work with vCommunity blogs.  Let's hope this blog helps break the ice.

First up, Spectre\Meltdown.  I did a presentation at the Pittsburgh VMUG earlier this year in February.  I promised to upload the presentation and here it is.  Ignore the fact that it months late, and lets just celebrate the fact that it made it onto my blog at all.

Download PowerPoint Presentation on Github.

The vBrownbag video of my presentation.

I just wanted to add to this presentation and why I wanted to present on it at all. 

My company tends to live on the bleeding edge of technology. We are not a large enterprise, but we have the need to be up to date and nimble.  Recently we've put a lot of effort into securing our infrastructure via patching, discovering vulnerabilities and removing them.  Our security team was really pushing the patching around the same time that Intel released the Speculative Execution Side-channel vulnerabilities.

It got a lot of attention very quickly.  I mean have you seen the cute and scary mascots?  I had to explain our patching plan to the CIO and Director of IT Security.  So I had to figure it out quickly.  It didn't take long to discover that it was not as simple as normal patching.  It was going to take some time to do it properly.  I had to wade through all the scary discussions and discover the exact process to make it work.

I was told by outside IT comrades that very little VMware\Windows admins actually put as much effort into understanding and explaining the procedures and my knowledge would be helpful.  Often they would patch the Windows and\or ESXi hosts but not perform the VM hardware piece which is essential to tie it all together. Hence the presentation.

Since early 2018 and the time of this presentation in February 2019, we have seen a regular release of patches for CPU related vulnerabilities.  They all have impressive names and various risk ratings.  Each comes with different procedures to patch.  But with any CPU related patch, there are always multiple levels.

  • OS - Windows\Linux patch.  With Windows, Microsoft had just switched to an all in one cumulative patch.  At the time they didn't think ahead that there would be a need to not activate a patch.  But with these CPU patches, they remove CPU abilities in order to secure the system, thus slowing the system down.  
  • Windows Registry - So Microsoft had to inject a way to turn on or turn off the mitigation.  So they used a registry key to activate or not.  Desktop systems automatically activate the patch.  Server systems do not.  If you don't add the registry key, your system is not mitigated.
  • vCenter - The ESXi patches require changes to micocode and passing this microcode to the VMs.  In order to pull this off, you need to patch vCenter to be able to control this function.
  • ESXi - Of course there is a patch for ESXi.  Sometimes it contains the necessary CPU microcode.
  • BIOS\CPU Microcode.  The CPU needs patched too.  This changes the CPU instructions.
  • VM hardware - Finally, this new CPU Code needs to be passed to the VM's.  If you are running a cluster with EVC mode enabled (you should), you will need to patch all of them before completing these steps.  Once they are all patched, then you need to perform a cold power cycle of each VM (with VM hardware version 9 at least) to pass on the CPU instruction.

The Reality...  This can be done over time.  But what I have found is that it is really difficult pulling this off in a production data center with hundreds of hosts and thousands of VMs.  All of them have different change windows and expectations.  I've found that by the time I develop a plan to patch for one vulnerability, the next one has come out. The real trick is to keep the bad actors out of your environment.

My team is currently working through ways of automating some of these functions and patching.  I will reserve that for another blog post.

Friday, January 25, 2019

Spectre Meltdown Logos

It's not really a vulnerability until it gets its own logo.  And these are awesome.

Spectre is kinda cute and playful. I mean, its a smiling ghost with a stick.  What's it going to do, poke you?

Meltdown on the other hand is a bit scary.  A melting shield?

That is all for now.  I just wanted to fill this space with something.