- Distributed storage fabric (DSF)
- Acropolis hypervisor (AHV)
- CVM
- Stargate
- Cassandra
- Zookeeper
- Curator
- Cerebro
- Prism
- Oplog
- Information lifecycle management (ILM)
- Metadata
- Unified cache
- High availability
- Deduplication
- Compression
- EC
- Snapshot
- Shadow clone
- Acropolis block service (ABS)
- Acropolis file services (AFS)
- Calm
- Disk layout
- I/O path
Distributed storage fabric (DSF)
- uses CVM on every node
- supports ESXi, Hyper-V, XenServer, AHV
- node limitation
- AHV: ∞
- ESXi: 64
- components
- storage pool: HDD, SSD
- container: ≈ datastore
- vDisk: files > 512 KB, consist of extents
- extent: 1 MB logically continuous data, consists of slice
- slice: 4-8 KB, atomic unit for read/write/modify
- extent group: 1-4 MB logically continuous data on disks
- per vDisk lock to avoid race condition ≡ simultaneous data modification
- every node can access every cache, SSD, HDD within cluster
Acropolis hypervisor (AHV)
- free
- based on CentOS KVM
- AHV turbo: RDMA, 3DXpoint for CVM-to-CVM communication
CVM
- supports NFS, iSCSI, SMB
- prefers data locality if delay does not suffer, otherwise – remote read
- data locality: local hot data, VM move closer to data
Stargate
- I/O manager
- processes hypervisor requests, received via NFS, SMB, iSCSI
- potentially uses write-in-place ⇒ HDD fragmentation, SSD flash wear
Cassandra
- NoSQL metadata store
- key-value in ring-like structure
Zookeeper
- configuration store: hosts, disks, IPs
Curator
- map reduce to distribute load
- disk balancing, data scrubbing, data tiering, data rebuild after failure
Cerebro
- replication, DR manager
- schedule snapshots per PD
- data migration and failover
- protection domain (PD): VM group that has to be replicated together, consists of CG
- consistency group (CG): group of VMs and files that provide crash-consistent state on restore
- replication topology: node-to-node, many-to-one, many-to-many
- selects one master CVM per site that makes the decisions
- master and slaves instruct Stargate for replication
- inter-site communication is between masters
- async replication: 1h RPO
- sync replication: 5 ms RTT, 10 Gbps
Prism
- monitoring
- management: CLI, HTML5, API per cluster (Prism Central – several clusters)
Oplog
- buffer for random write
- coalesce before drain
- allocation – per vDisk, 6 GB max
- sequential write ≡ 1.5 MB I/O from OS
- synchronous replication to other node before ACK
Information lifecycle management (ILM)
- decides whether the data are hot or cold, based on frequency of access to data
- affects data pre-loading to cache, deduplication
Metadata
- checksum: verified on every read
Unified cache
- SSD + CVM RAM
- 4 KB granularity
- if data are in Oplog and were not written – read from Oplog
- decision about data freshness – based on LRU (least recently used)
High availability
- block ≡ 4-node chassis with 3 nodes at least
- replication – over different blocks
- many-to-many replication
- after restore, I/O goes back to primary CVM
Deduplication
- post-process on write
- inline on read
- if data > 64 KB – hash immediately, if less – hash later in extent store
- SHA-1 hash over 16 KB
- can be disabled
Compression
- through capacity optimization engine
- inline or post-process (default)
- can be disabled
EC
- post-process: no sooner than 1h after write, only in extent store
- per VM
Snapshot
- own block map for every snapshot ⇒ no chaining (like in redirect-on-write)
Shadow clone
- local caching of vDisk within CVM for data locality purpose
- condition: vDisk is read by local CVM + ≥ 2 remote CVMs and access – read I/O
- if base vDisk changes, all shadow clones are deleted
- useful for VDI
Acropolis block service (ABS)
- volume group (VG): set of vDisk that are represented as LUN
- no MPIO for iSCSI
- redirect to active Stargate instead at login to data services IP
- new login required on failure
Acropolis file services (AFS)
- NAS
- file server ≡ 3 FS VM (4 vCPU, 12 GB RAM)
Calm
- app-level orchestration
- describes deployment in blueprint
- user can request service via marketplace
- RBAC
- private and public cloud