VPLEX for VDI: Will It Work?

The Setup

Our team worked very closely over the summer with a number of partners, including VMware, Cisco, VCE, EMC, and partner GreenPages to develop a response to an RFP issued by one of our mutual customers.  This customer was looking for a VDI solution based on some general specifications.  IOPS, Disk Size, Number of Users, Persistent and Non-Persistent Users, etc.   Another piece of the clients request was that while the solution was to start with a certain number of users today, it needed to be robust enough to scale out and be replicated to handle additional users that would potentially triple or quadruple in the coming years. After reviewing the requirements from the customer, a Vblock 300GX was sized for their use.  A second identical array was also configured with RecoverPoint replication at their DR site.  The fact that ½ of their users would be running Persistent desktops required that those desktops were replicated to the DR site; while those users running Non-Persistent desktops would have their Profiles and User Data replicated via the Enterprise NAS solution, outside of the scope of our response.

Fully configured with B230m2 blades, FASTcache, and FAST, our Vblock solution was presented to the customer in late June.  The solution included licensing from VMware and Liquidware, Hardware from VCE (EMC and CISCO) and design and implementation expertise from EMC and GreenPages. 

The Challenge

A few weeks after the Oral presentation to the response had been provided; the customer came back to the RFP team and asked for a change:  How can we add a level of High-Availability ABOVE the Vblock?  The response of the RFP team was to first ensure that the customer was aware of all of the redundancies built into the Vblock.  Each component from the Networking, Switching, Cabling, Storage Networking, Service Processors, down the last wire, was all fully redundant already and that this had been a core tenant of our RFP design.

The response from the customer was that they viewed the Vblock itself as a Unit of Failure, and for this critical project wanted to have some way to handle the following situations:

  1.  During a Brown-Out or a period of “Degraded” performance on the Vblock, where performance and not availability was affected, how could they migrate the workload OFF of that Vblock to another Vblock.
  2. In the event of a Total Vblock Outage (or the outage of a primary system) how can we add a VMware-like High-Availability to the design to provide HA.
  3. The process to automate and to manually perform the “failover” from one Vblock to a second Vblock had to be something that could be operationally practical. 

The Process

Our first course of action was to define what the customer meant by “HA”.  We had to talk about RTO (Recovery Time Objective) and RPO (Recovery Point Objectives) in order to distinguish between Disaster Recovery solutions and High Availability.

In Disaster Recovery solutions, you typically have some form of replication off to a DR site of data as well as configuration of the system itself.  DR solutions will also include automation process software (or complicated run-books) to enable the system affected by disaster to be brought back online in the secondary site within the boundaries of the defined RTO and RPO.

In a High Availability design, the failover time and loss of data is significantly less than you would have in a Disaster Recovery solution.  High Availability is also talked about in the same circles as fault-tolerance. 

In the case of VMware, HA is an immediate and automatic restart of a failed host of all the virtual workloads and the loss of data is limited to what was running or unsaved on the user’s desktop at the time of the outage.  In the case of a storage array, HA and Fault Tolerance are built in through the use of multiple storage processors that can automatically or proactively “failover” between each other, or disk groups that are configured to withstand the loss of a single or even multiple drives without incurring downtime of loss of data.

With some of these definitions and examples in mind, the RFP team looked over the portfolio of solutions that could be brought to bear on the problem and came up with the following analysis:

  1.  RecoverPoint replication is an obvious place to start as it is already a component of the solution and part of the VCE product solution stack.  Pretty quickly the team realizes that RecoverPoint would have to replicate all of the data to a second array, but that by itself, would not provide RAPID failover that is defined as one of the HA trademarks.  The process to migrate from one Vblock to another with RecoverPoint would not be a simple or practical task.  In addition, scaling out the replication to multiple Vblocks down the road as more users came online would be a design challenge that would bring additional inefficiencies to the design and cost.
  2.  Avamar backup is another great tool that is often used in Vblock and virtualization environments and can perform and maintain backups of the desktops without impacting performance on the hosts.  A great solution for BACKUP, the RESTORE time to get the data OFF of the Avamar grid would take much too long to be considered HA.
  3. VPLEX is the last alternative that we considered.  As a way to “replicate” data and at the same time, enable the customer to use standard VMware and CISCO tools to manage the HA process, this solution also gives them the solution to gracefully manage a “brown-out” or to take an entire Vblock offline without incurring any downtime or interruption to the end users. 

More about VPLEX

VPLEX is a storage virtualization solution that allows us to place multiple storage arrays behind a super highly-available platform.  Each of the arrays provided the storage backend to service the storage requests of our hosts, and the VPLEX offers the capability to create a “mirror” between the LUNs created on each of the arrays.  Through virtualizing the storage, we can ensure that there are 2 copies of each LUN/Datastore presented from multiple arrays to each of our hosts.

As the Vblocks will live in the same production datacenter, high-bandwidth / low-latency connections are readily available at both the Ethernet and Fibre Channel layers.  This is going to be crucial as we are going to use virtualization to bring 2 Vblocks together to act as one.

  • Storage layer (virtualized with VPLEX)
  • Compute layer (virtualized with vSphere)
  • Network layer (virtualized with Nexus 1000v)

What does this accomplish?

In the diagram below, the design at a high level has been optimized to provide the highest level of HA for each of the major components.

  • Cluster nodes are spread across chassis and Vblocks with full HA and DRS enabled
  • Datastores are spread across the DAEs and SPs in each array for optimal HA
  • VPLEX mirrors each LUN synchronously between the multiple arrays

How does this level of redundancy affect the design?

Physical Capacity

The first area where this new functionality affects the design has to do with capacity.  In order to make this work and support the level of performance required, the capacity of physical resources has to DOUBLE.  In the diagram, there are 2 fully redundant and identical Vblocks.  In the event that a Vblock fails (as unlikely an event as possible) or needs to be gracefully removed from servicing production users, the remaining capacity must be able to handle the peak user demand. 

Cluster Sizes and Compute Capacity

The design also affects the cluster sizing as well.  Depending on the architecture of the solution and the tools used to optimize the storage (Composer, MCS, PVS, etc) there are limitations and caveats to cluster sizes that must be adhered to.  This will most likely result in clusters sizes of 8 to 10 nodes under normal production circumstances, and when the failover condition is in effect, there would be 4 to 5 remaining nodes.  The “remaining nodes” is the sizing that we need to then use as the basis for our compute layer.  In this example, the core calculations would most likely factor in the loss of one whole side of the cluster, as well as a single node in the “remaining nodes” on the active Vblock.  This design then requires us to cut our desktops per core in half…which results in the doubling of the compute later.

In the example diagram, there are 4 Active Clusters, each with 8 nodes in the cluster.  Represented by the difference colors to make it pretty, the Red cluster has nodes in each of the chassis. In the event that a node, or an entire chassis were to go offline, the worst case would be the loss of 2 nodes.  In the event of a total failure of a Vblock (or a planned outage) the worst case outage would still leave the cluster with 50% of the capacity.  Following the logic that all of our physical resources are doubled, a 50% loss of resources should not affect our performance levels.

Storage Capacity and Optimization

At the storage layer, we are mirroring the storage between the two arrays on the Vblock.  Each LUN on Array 1 is mirrored on Array 2.  As the VPLEX is dependent on receiving write acknowledgements from the performance and topology of the 2 arrays has to be the same.  While we could TECHNICALLY use a SAS-Based LUN on Array 1 and an NL-SAS LUN on Array 2…the performance would sink to the slowest performing disk.  In the same way, we need to ensure that VPLEX-mirrored LUNs from both arrays use the same FAST and FASTcache policies.  If the slowest array will determine the performance, we need to ensure that both arrays are equally optimized, and to some degree, utilized.

Once again, in our 2-Vblock design, each LUN is mirrored on the BOTH Vblocks.  If one of the Red LUNs goes down or for that matter, an entire array, there would be no disruption to the end users, and once the LUNs become available again, the re-sync process will begin automatically.

Network Design

The design of the IP and Storage networking ramifications when integrating 2 Vblocks is a bit more precarious at first.  In the case of the storage networking, the VSANs on each of the Vblocks have to match in order for each host and network path to be accessible to the VPLEX, and then from the VPLEX to the Hosts themselves.

On the IP networking side, the Nexus 1kv virtual switches need to be “connected” so that the configuration of the networking port groups on each host is also identical, in order to enable transparent vMotion between all nodes in the clusters, regardless of which Vblock they currently reside on.

Also, keep in mind that the VLANs created to host the vms and the host servers also have to be available all the way through the network.

How does it work?

The operational side of the solution was a key requirement from the customer.  The solution has to be one that can be managed without pages and pages of run-books making it too difficult to practically perform migration and failover operations. 

The solution also has to perform the failover quickly and with as much automation as possible.  Have we got a solution for you:

Operational on the compute side…

On the compute side, the cluster nodes are distributed throughout the UCS chassis on the on EACH of the Vblocks…as equally as possible.  Failing over the compute layer from one Vblock to another is accomplished using standard vSphere vMotion/DRS capabilities.  Place all of the nodes in Vblock #1 into maintenance mode, and let the migrations begin.  Once all of the nodes are clear, your compute now resides on the other Vblock.  As each Vblock sees the same storage presented from the VPLEX, there is really no difference between a compute node on one Vblock or another. 

HA on the compute side works the same way.  Using a standard HA/DRS cluster, if one or all of the nodes in a cluster fail, the “remaining nodes” will pick up the desktops that were running when the failure occurred.  If this happens to an entire Vblock, then ½ of the nodes in your cluster are lost, and according to our new capacity guidelines, all of the workload will now fall onto the remaining nodes in the remaining Vblock.

Operational on the storage side…

On the storage side, each of the hosts is actually connected to the VPLEX for all of its storage needs.  This should also include the UCS boot LUNs, as well as the LUNs servicing the VDI desktops.  The inclusion of the Boot LUNs ensures that the compute nodes will stay up and running in the event of a total outage at the array level.  If the Boot LUNs are not included, the compute nodes will crash if the array that they are hosted on is removed.  By putting them onto the VPLEX, that condition is removed and HA is achieved at yet another level in the design.

In the event that a storage array needs to be removed gracefully from production, this can be done with zero impact to the end user desktops.  While there is no “Maintenance Mode” button in the VPLEX right now to perform the change and take out one of the arrays, this can be easily achieved by editing the zone set in the UCS Fabric Manager to disconnect the VPLEX from the backend array.  Once disconnected, the VPLEX will continue to service storage requests to the hosts using the remaining array, while the offline array is worked on.  Once the zone set is enabled back in place, a full re-sync of the mirror will commence automatically and bring the multi-array mirror back into production.

The HA portion of the solution functions in exactly the same way.  Should the entire Vblock fail and go offline, the VPLEX will continue to service storage requests and wait for the failed array to come back online.

What we tested…

In testing the solution at the VCE Labs in Marlborough, MA, we successfully tested the following scenarios with the customer:

  • VMware vSphere Maintenance Mode Cluster-wide
    • Removed all of the blades from Vblock #2 from production and demonstrated the DRS/vMotion effect of moving all of the workloads to the blades on Vblock #1.  Zero end user downtime or impact.
  • Cisco UCS Blade failure
    • Forcibly removed a blade from the vSphere cluster and demonstrated the HA migration of the active virtual desktops to other blades in the cluster. 
    • Only users on that blade are impacted and the only data-loss is what was open on the desktop and had not yet been saved.
      • HA restart enables persistent user access in approximately 5 minutes. 
      • Non-persistent users are able to log back in to waiting desktops immediately.
  • VNX Storage Processor failure
    • Pulled the power cables on one of the Storage Processors in one of the Vblocks and demonstrated the failover capabilities inherent in the VNX architecture.  Expected results are that a portion of the users currently services from that Storage Processor may see a brief pause, however, that will be masked by the read-cache available through the VPLEX.
  • Cisco Fabric Interconnect failure
    • Removed the power cables from one of the Cisco 6140 Fabric Interconnect modules, demonstrating the HA capabilities of the Cisco UCS platform.  Zero end user downtime or impact.
  • Cisco MDS Storage Switch failure
    • Removed the power cables from one of the Cisco MDS switches.  Zero end user downtime or impact.
  • Cisco UCS Chassis Fabric Interconnect cable failure
    • Removed a single fabric interconnect cable from the back of a UCS Chassis.  Zero end user downtime or impact.
  • Cisco Nexus 5xxx Switch failure
    • Removed the power cords from a Cisco 5xxx switch and witnessed Zero end user downtime or impact.
  • Cisco UCS Chassis Power Supply failure
    • Removed the power supplies from a UCS chassis.  Similar to removing the a single blade, VMware HA initiates a failover for all of the failed blades and moves their workload to the remaining compute nodes.  Only users on those blades in the affected chassis are affected.  All other users see zero downtime or impact.
  • EMC Unisphere LUN creation process
    • An operational demonstration to show the customer how easy it is to create a new block-based LUN for use with VPLEX through Unisphere.
  • EMC VPLEX Volume and Mirror creation process
    • An operational demonstration to show the customer how easy it is to create a new multi-array VPLEX mirrored LUN presented to existing VMware hosts.
  • EMC VNX manual failover
    • In this test, we modified the zoneset from the UCS fabric manager to demonstrate a manual removal of an array from behind the VPLEX, and also to simulate what an unplanned outage of the array would look like to the end users and the VPLEX.  Zero end user downtime or impact.
  • Catastrophic Vblock failure
    • In this test, we performed an extreme simulation and forcibly switched off the power at the back of the Compute and Storage racks on one of the Vblocks.  As expected, VMware HA initiated failovers of all of the active workloads to the remaining nodes in the remaining Vblock, and after several minutes, had started each of the virtual desktops back up.  All of the storage resources remained available and unaffected to the hosts during this process.
  • Vblock return from catastrophic failure
    • The last test from the customer’s perspective was to see the “failed” Vblock return automatically to production and resume servicing desktops without requiring any intervention.  In demonstrating this functionality, the power was reintroduced to the compute and storage cabinets in the Vblock that had been powered down, and once the system booted, the blades returned to an active state, and DRS began to migrate desktops over to the newly returned compute resources as expected.  The VPLEX initiated a full re-sync of the mirror automatically in reestablishing storage services.

Now what?

As was mentioned in the opening requirements, the solution presented has to be able to scale beyond the initial use cases.  This design was specifically put together to support that requirement and to increase the efficiencies of the multi-Vblock approach as more Vblocks are added to it. 

In the 2-Vblock design, the Compute and the Storage layers were doubled from a capacity standpoint to ensure that the loss of a single Vblock would not result in performance degradation.  As we consider adding a third Vblock to the mix, our distribution of the compute and storage layers can potentially change to offer more efficiencies.  In the compute layer, we had to double the resources, as potentially, we had to account for the loss of 50% of the compute resources.  If we add a 3rd Vblock to the design, we can cut the number of nodes in each cluster from 8 to 6 (or increase to 12 as the solution design permits), as our worst case loss would be no bigger than 2 blades…maintaining our performance profile of 4 blades in the cluster.

The storage is a bit different.  In order to provide seamless active-active failover, we always have to DOUBLE the storage and mirror each LUN to a second Vblock.  Where we can gain some efficiency is to spread the LUNS around across the storage infrastructure to reduce the risk of one failure impacting a single array from a performance standpoint.  The relationship between the compute and storage layers is so well abstracted by the VPLEX, that we can arrange and balance the LUNS actively over time on the arrays as groups of users and use cases demonstrate their different performance requriements.