Nutanix

  1. Distributed storage fabric (DSF)
  2. Acropolis hypervisor (AHV)
  3. CVM
  4. Stargate
  5. Cassandra
  6. Zookeeper
  7. Curator
  8. Cerebro
  9. Prism
  10. Oplog
  11. Information lifecycle management (ILM)
  12. Metadata
  13. Unified cache
  14. High availability
  15. Deduplication
  16. Compression
  17. EC
  18. Snapshot
  19. Shadow clone
  20. Acropolis block service (ABS)
  21. Acropolis file services (AFS)
  22. Calm
  23. Disk layout
  24. I/O path

Distributed storage fabric (DSF)

  • uses CVM on every node
  • supports ESXi, Hyper-V, XenServer, AHV
  • node limitation
    • AHV: ∞
    • ESXi: 64
  • components
    • storage pool: HDD, SSD
    • container: ≈ datastore
    • vDisk: files > 512 KB, consist of extents
    • extent: 1 MB logically continuous data, consists of slice
    • slice: 4-8 KB, atomic unit for read/write/modify
    • extent group: 1-4 MB logically continuous data on disks
  • per vDisk lock to avoid race condition ≡ simultaneous data modification
  • every node can access every cache, SSD, HDD within cluster

Acropolis hypervisor (AHV)

  • free
  • based on CentOS KVM
  • AHV turbo: RDMA, 3DXpoint for CVM-to-CVM communication

CVM

  • supports NFS, iSCSI, SMB
  • prefers data locality if delay does not suffer, otherwise – remote read
  • data locality: local hot data, VM move closer to data

Stargate

  • I/O manager
  • processes hypervisor requests, received via NFS, SMB, iSCSI
  • potentially uses write-in-place ⇒ HDD fragmentation, SSD flash wear

Cassandra

  • NoSQL metadata store
  • key-value in ring-like structure

Zookeeper

  • configuration store: hosts, disks, IPs

Curator

  • map reduce to distribute load
  • disk balancing, data scrubbing, data tiering, data rebuild after failure

Cerebro

  • replication, DR manager
  • schedule snapshots per PD
  • data migration and failover
  • protection domain (PD): VM group that has to be replicated together, consists of CG
  • consistency group (CG): group of VMs and files that provide crash-consistent state on restore
  • replication topology: node-to-node, many-to-one, many-to-many
  • selects one master CVM per site that makes the decisions
    • master and slaves instruct Stargate for replication
    • inter-site communication is between masters
  • async replication: 1h RPO
  • sync replication: 5 ms RTT, 10 Gbps

Prism

  • monitoring
  • management: CLI, HTML5, API per cluster (Prism Central – several clusters)

Oplog

  • buffer for random write
    • coalesce before drain
    • allocation – per vDisk, 6 GB max
  • sequential write ≡ 1.5 MB I/O from OS
  • synchronous replication to other node before ACK

Information lifecycle management (ILM)

  • decides whether the data are hot or cold, based on frequency of access to data
  • affects data pre-loading to cache, deduplication

Metadata

  • checksum: verified on every read

Unified cache

  • SSD + CVM RAM
  • 4 KB granularity
  • if data are in Oplog and were not written – read from Oplog
  • decision about data freshness – based on LRU (least recently used)

High availability

  • block ≡ 4-node chassis with 3 nodes at least
  • replication – over different blocks
  • many-to-many replication
  • after restore, I/O goes back to primary CVM

Deduplication

  • post-process on write
  • inline on read
  • if data > 64 KB – hash immediately, if less – hash later in extent store
  • SHA-1 hash over 16 KB
  • can be disabled

Compression

  • through capacity optimization engine
  • inline or post-process (default)
  • can be disabled

EC

  • post-process: no sooner than 1h after write, only in extent store
  • per VM

Snapshot

  • own block map for every snapshot ⇒ no chaining (like in redirect-on-write)

Shadow clone

  • local caching of vDisk within CVM for data locality purpose
  • condition: vDisk is read by local CVM + ≥ 2 remote CVMs and access – read I/O
  • if base vDisk changes, all shadow clones are deleted
  • useful for VDI

Acropolis block service (ABS)

  • volume group (VG): set of vDisk that are represented as LUN
  • no MPIO for iSCSI
    • redirect to active Stargate instead at login to data services IP
    • new login required on failure

Acropolis file services (AFS)

  • NAS
  • file server ≡ 3 FS VM (4 vCPU, 12 GB RAM)

Calm

  • app-level orchestration
  • describes deployment in blueprint
  • user can request service via marketplace
  • RBAC
  • private and public cloud

Disk layout

I/O path