Hyperflex

  1. Hyperflex
    1. Controller VM (CVM)
    2. IOVisor
    3. StorFS
      1. Deduplication
      2. Compression
    4. Cluster
  2. Stretched cluster
  3. HX Edge
  4. Backup in HX

Hyperflex

  • abstracts DAS over several hosts as a datastore
  • ESXi, Hyper-V, K8s
  • NFS interface for Hyper-V and ESXi, not accesible from outside
  • external storage support: FCoE, iSCSI, NFS
  • always distributed I/O: does not copy VM data to local datastore on VM move
    • synchronous write to SSD
    • synchronous replication, many-to-many
  • cache: SSD + HDD
  • mirroring vs parity: faster rebuild, parallel read and rebuild
  • up to 64 nodes in cluster: 32 HX + 32 compute-only
    • compute:converged ratio – 1:1 or 2:1 (depends on license)
  • logical availability zone (LAZ)
    • availability group, ≈ SRLG
    • single failure within zone – whole zone is considered to have failed
    • no more than one data copy
    • requires at least 8 nodes in cluster
  • write I/O:
    1. write to cache
    2. compress
    3. ack I/O to VM
    4. deduplicate
  • authentication
    • Kerberos + AD
    • required NTP
  • hardware management: UCSM or Intersight
    • requires jumbo frame on upstream
  • min cluster size:
    • LAZ: 8 nodes
    • RF3: 5 nodes (to sustain double failure)

Controller VM (CVM)

  • Ubuntu-based
  • provides access to physical storage
  • hosts HX Connect

IOVisor

  • controls hypervisor access to storage, distributes I/O over CVMs in cluster
  • NFS mount point
  • VIB ≡ driver

StorFS

  • HXDP file system
    • 70% usable: rest is required for HX operation, otherwise performance drops
      • 8% for StorFS operation
      • if 70% utilization is not met, does not allow upgrade or expand cluster
    • circular buffer: data is always written to head
      • no seek time for HDD
      • uniform write over SSD cells
    • cleanup process marks unneeded blocks as free
    • 32 MB chunks
    • stripe ≡ 8 chunks – atomic unit for HX operation
  • unpredictable read time: read is supposed to be from cache (HDD backend)
  • cache:
    • read: hybrid only (not needed for SSD, can read directly at max speed)
    • write
      • hybrid and All-Flash
      • active segment – I/O, passive segment – destaging
      • 3 levels: resiliency in case primary fails during destaging
        • 1 master on primary CVM
        • 2 and 3 slaves mirror data on secondary CVMs
        • destaging:
          • primary CVM performs deduplication and then mirrors level 1 passive cache to secondary CVMs
          • fresh data is copied to read cache
    • cache is full:
      • destage data to backend
      • swap active and passive roles for cache segments
  • faster rebuld compared to RAID: copy only, no need for recalc
  • more stable than RAID: server/controller is single point of failure for RAID
  • native snapshots change only metadata
    • more effective than native snapshot chain (RoW) – less latency to read
  • self-healing delay
    • disk failure: 1 minute
    • node failure: 2 hours

Deduplication

  • per VMDK
  • block-level, uses fingerprint for comparison
  • during destaging: after write to SSD before writing to HDD
  • mofifies pointers in inode
  • inline
  • cannot be disabled
  • useful for ReadyClones ≡ VDI, based on access frequency: the more accessed, the more probable to be deduplicated

Compression

  • inline
  • cannot be disabled
  • on cache level, algorithm – Google Snappy

Cluster

  • must match
    • type
    • count and type of HDD
    • server type
  • single UCS domain
  • no support: single FI, FEX, breakout cables
  • direct mode connection
  • HX server must connect to FI ports with same number
  • split-brain “protection”: use single pair of FIs for the whole cluster

Stretched cluster

  • RF4: RF2 + RF2
  • uses witness VM to determine which site is online
    • Intersight can act as witness
    • VM requirements: 100 Mbps, 200 ms RTT
    • if witness is unavailable on failure – cluster shutdown
  • 2 sites only
  • inter-site requirement: 100 Mbps, 5 ms RTT
  • active/active (native replication ≡ active/standby)
  • erase storage if created from standalone cluster
  • VMware only
  • read: local site
  • write: local and remote site

HX Edge

  • 2-4 nodes
    • 2-node: Intersight for quorum, RF2 only
    • 4-node: in case of 2+2 partition becomes read-only (cannot reach quorum)
  • VMware only
  • impossible to convert to full-feature cluster later
  • impossible to expand
  • hardware management – Intersight
  • allows 1 CPU per server
  • mLOM only, no VIC/NIC
  • does not support SED, NVMe

Backup in HX

  • VM is suspended during operations with VMDK (VM stun) to ensure consistency
  • Veeam
    • uses HX datastore directly via CVM
    • NFSv3 not used, because of potential ESXi performance hit: locking, if Veeam and VM are on different hosts