Nutanix

Distributed storage fabric (DSF)

uses CVM on every node
supports ESXi, Hyper-V, XenServer, AHV
node limitation
- AHV: ∞
- ESXi: 64
components
- storage pool: HDD, SSD
- container: ≈ datastore
- vDisk: files > 512 KB, consist of extents
- extent: 1 MB logically continuous data, consists of slice
- slice: 4-8 KB, atomic unit for read/write/modify
- extent group: 1-4 MB logically continuous data on disks
per vDisk lock to avoid race condition ≡ simultaneous data modification
every node can access every cache, SSD, HDD within cluster

Acropolis hypervisor (AHV)

free
based on CentOS KVM
AHV turbo: RDMA, 3DXpoint for CVM-to-CVM communication

CVM

supports NFS, iSCSI, SMB
prefers data locality if delay does not suffer, otherwise – remote read
data locality: local hot data, VM move closer to data

Stargate

I/O manager
processes hypervisor requests, received via NFS, SMB, iSCSI
potentially uses write-in-place ⇒ HDD fragmentation, SSD flash wear

Cassandra

NoSQL metadata store
key-value in ring-like structure

Zookeeper

configuration store: hosts, disks, IPs

Curator

map reduce to distribute load
disk balancing, data scrubbing, data tiering, data rebuild after failure

Cerebro

replication, DR manager
schedule snapshots per PD
data migration and failover
protection domain (PD): VM group that has to be replicated together, consists of CG
consistency group (CG): group of VMs and files that provide crash-consistent state on restore
replication topology: node-to-node, many-to-one, many-to-many
selects one master CVM per site that makes the decisions
- master and slaves instruct Stargate for replication
- inter-site communication is between masters
async replication: 1h RPO
sync replication: 5 ms RTT, 10 Gbps

Prism

monitoring
management: CLI, HTML5, API per cluster (Prism Central – several clusters)

Oplog

buffer for random write
- coalesce before drain
- allocation – per vDisk, 6 GB max
sequential write ≡ 1.5 MB I/O from OS
synchronous replication to other node before ACK

Information lifecycle management (ILM)

decides whether the data are hot or cold, based on frequency of access to data
affects data pre-loading to cache, deduplication

Metadata

checksum: verified on every read

Unified cache

SSD + CVM RAM
4 KB granularity
if data are in Oplog and were not written – read from Oplog
decision about data freshness – based on LRU (least recently used)

High availability

block ≡ 4-node chassis with 3 nodes at least
replication – over different blocks
many-to-many replication
after restore, I/O goes back to primary CVM

Deduplication

post-process on write
inline on read
if data > 64 KB – hash immediately, if less – hash later in extent store
SHA-1 hash over 16 KB
can be disabled

Compression

through capacity optimization engine
inline or post-process (default)
can be disabled

EC

post-process: no sooner than 1h after write, only in extent store
per VM

Snapshot

own block map for every snapshot ⇒ no chaining (like in redirect-on-write)

Shadow clone

local caching of vDisk within CVM for data locality purpose
condition: vDisk is read by local CVM + ≥ 2 remote CVMs and access – read I/O
if base vDisk changes, all shadow clones are deleted
useful for VDI

Acropolis block service (ABS)

volume group (VG): set of vDisk that are represented as LUN
no MPIO for iSCSI
- redirect to active Stargate instead at login to data services IP
- new login required on failure

Acropolis file services (AFS)

NAS
file server ≡ 3 FS VM (4 vCPU, 12 GB RAM)

Calm

app-level orchestration
describes deployment in blueprint
user can request service via marketplace
RBAC
private and public cloud

Disk layout

I/O path

Design a site like this with WordPress.com