- Architecture
- Storage types
- Workload
- Disk interfaces
- Controller models
- Virtualization
- vStorage API for array integration (VAAI)
- Offloaded data transfer (ODX)
- vVols
Architecture
Shared storage model for virtualization (SSM)
- level 4: application
- level 3: file/record
- DB
- FS
- level 2: block aggregation
- host: LLVM
- network
- device: array controller
- level 1: storage device
- LBA in HDD/SSD
Enterprise grade
- more overprovisioned capacity
- more cache
- better/more complex controller and firmware
- more channels
- condensers for powering cache: provides time to save data during power outage
- batteries for powering cache: longer power supply during power outage
- longer warrantys
Storage drive
- IOPS: for 512-byte sectors, because max performance
- IOPS ≠ latency
- usually vendors do not list latency, because IOPS is measured in artificial circumstances
- logical block addressing (LBA): abstract internal addressing as continuous address space
- SMART: self-monitoring, analysis and reporting technology
- standard to predict failure
- write modes
- write-through: write to backend, then acknowledge
- write-back: write to cache, then acknowledge, then write to backend
- redirect-on-right
- cache mirror: only for write (data for read is on the backend, available at the same speed)
- data scrubbing:
- usually low priority background process
- detect and fix (CRC32) bit rot: physical medium degradation because of inactivity (demagnetizing, condenser charge leak)
Tape
- high capacity, sequential performance, low power
- media degrades
- not all types of backup (e.g., bad for differential)
- linear tape-open (LTO) format:
- compatible for read with level of LTO – 2 (including)
- compatible for write with level of LTO – 1 (including)
- shoe-shining:
- data speed ≠ tape speed
- slows down backup, wears the tape and drive
- multiplexing can solve shoe-shining but slows down restore – data are interleaved
- virtual tape library (VTL):
- emulates tape, uses HDD on backend
- native deduplication
HDD
- zoned data recording (ZDR): outer tracks have more sectors than inner tracks ⇒ I/O speed on edge is higher ⇒ controller tries to write to outer tracks
- short stroking: use only part of capacity closer to the edge
- factory format
- low-level format
- physical creation of tracks and sectors + control marking
- queuing is required to optimize I/O: rearrange read/write ⇒ less rotational latency and seek time
- 3.5 inch ≡ size of floppy drive
- 7200 RPM platter size – 3.7 inch
- 10K/15K RPM has smaller diameter
- head thrashing ≡ excessive seeking
- copying from disk to another disk is faster than copying within disk – no seeking (e.g., for backup)
- sustained transfer rate (STR): I/O speed for sequential data on different tracks
SSD
- NAND, NOR (rare)
- cell → page (4k, 8k, 16k) → block (128-512k)
- new disk has all cells are initialized to 1
- write: per cell
- erase: per block ⇒ rewrite (read + erase + write) is slow (1/10/100)
- P/E: program/erase
- write cliff: shortage of pre-erased block for I/O redirect ⇒ rewrite is required
- cell types
- SLC
- 0-1
- 100k P/E
- lowest power consumption
- MLC
- 00-11
- 10k P/E
- several levels due to 4 levels of voltage
- eMLC
- enterprise MLC
- better ECC
- more cell overprovision to replace dead cells and distribute load
- TLC
- 000-111
- 5k P/E
- SLC
- cache
- required for write coalesce: reduce cell wear
- write combine: perform only last write instead of call per write
- not required for read
- required for write coalesce: reduce cell wear
- garbage collection
- improves performance, reduces P/E
- erases unused blocks in the background
- repackages partially full blocks
- wear leveling
- distribite load over all cells
- types
- background
- in-flight
NVMe
- for SSD instead of SCSI
- more command queues: 64k queues
- deeper queues: 64k commands per queue
- only 13 commands
- not I/O, but memory on PCIe:
- controller not needed
- P2P PCI instead of shared SCSI bus
- OS → storage on PCIe, replaces AHCI+SCSI
NVMe-oF
- connectivity between NVMe and transport (FC, IB, RoCEv2, iWARP)
I/O blender
- different VMs with sequential I/O on the same host ⇒ random I/O on host
- solved by flash storage
Provisioning
Thich volume
- pre-provision capacity
- for sequential read from HDD
Thin volume
- on-demand physical capacity
- extent ≡ minimal size of volume growth
- causes fragmentation ⇒ longer seek time ⇒ bad for sequential read
- write performance overhead because of time to allocate physical blocks on the fly
Space reclamation
- fill with zeros
- create file
- fill file with zeros
- delete file
- call array script
- OS integration: SCSI UNMAP
Data locality
- advantages
- lower delay (≈ 0.5 ms)
- lower network load
- disadvantages
- delay may not be relevant due to other slower bottlenecks: HDD (10 ms), SSD (1 ms), NVMe (CPU)
- limits storage, accessible to vMotion
- contradicts distributed nature
- contradicts features: deduplication
- not suitable for containers
Storage types
Block storage
- server boot, VM boot
- DB, transactions
- random I/O
- direct access
- max performance, low delay
- no concurrent access to LUN
- relies on HA of every level
Object storage
- flexible due to metadata
- usually read-intensive, unstructured ⇒ suitable for media
- can be geographically dispersed
- ID ≠ object name
- flat namespace
- mount not needed, access via REST/SOAP/XAM
- in-place update is often not available (e.g., ID includes hash)
Software-defined storage (SDS)
- software does not depend on hardware (contrast: controller firmware)
- layers
- orchestration
- data service
- hardware
Tiering
- sub-LUN auto-tier: move only part of LUN
- policy
- exclusion: backup does not use flash, but is limited to using NL-SAS
- limit
- priority
- on replication only write I/O is copied ⇒ on DR invoke and auto-tiering data location may not match (was in cache – now on backend)
Workload
- application signature
- read/write %
- random/sequential access
- data/metadata %
- compressibility
- block/chunk size
- metadata frequency
- async/compound commands
- sensitivity to delays
Deduplication
- uses pointers to data instead of storing several copies
- Flash has fast read – possibility of inline deduplication on primary storage
- block level deduplication only
- file level deduplication: whole file must match ≡ single instancing, links
- fixed block length: single block change causes change of subsequent blocks ⇒ breaks deduplication
- variable block length: searches for patterns, using them to segment into blocks
- HDD: increases fragmentation
- SSD: reduces flash wear
Compression
- types
- inline
- requires large cache, but low I/O on backend
- slows down read and write
- port-process
- write only
- no performance hit
- inline
- algorithms
- LZO: fast, small compression
- block level only
- if file level, then in order to read part of file – the whole file has to be decompressed
- low effectiveness for audio/video due to inherent compression
- dedup → compress, compress →× dedup
- not suitable for encrypted data: high entropy would lead to higher size
Disk interfaces
Serial advanced technology attachment (SATA)
- block-level protocol, electrical interface, physical interface
- 512-byte sector
- ATA
- smaller command set than in SCSI
- ATA packet interface (ATAPI) – in software
- near-line SAS (NL-SAS)
- SAS interface
- SAS command set
- SATA backplane
- native command set (NCQ): queuing
- advanced host controller interface (AHCI): API between OS and controller
- serial vs parallel
- parallel has lower BW because of crosstalk
- long serial cable is simpler and cheaper
- bus, up to 2 devices
- integrated drive electronics (IDE): ATA-1 (parallel interface)
- cheap
Serial attached SCSI
- SAS controller supports SATA II disks and newer
- different sector sizes: 512/520, 4k
- end-to-end data protection (EDP) support: add data integrity field (T10 DIF) to sector data (512 + 8 = 520)
- higher component quality, quality of queuing system
- higher MTBF than in ATA
- more features
- higher performance, capacity
- dual ported drive: active/passive connection to different controllers
- command tag queuing (CTQ)
Controller models
Array
- active/passive controller per LUN ≡ asymmetric logical unit access (ALUA)
- if one of 2 controllers fail – write-through: cannot acknowledge data in cache, that is not replicated
- if LUN is accessed through passive controller – redirect request within array to primary ≡ latency
- scaling: only disk addition is possible (scale-up)
- massive array of idle disks (MAID)
- disks switch to low-power mode if not used
- useful for archives
Grid
- active/active
- scaling: disks, CPU, cache, BW (scale-up), nodes (scale-out)
Virtualization
- use external array as internal disks
- controller functionality + capacity of simple virtualized array ⇒ suitable for auto-tiering
- controller emulates common host
vStorage API for array integration (VAAI)
- not an actual API
- enables hypervisor to understand SCSI and perform offload
- atomic test and set (ATS)
- instead of SCSI LOCK on the whole volume
- extent level
- lock is offloaded to array
- zeroing a range in lieu of per block: useful for eager zeroed thick volume (EZT)
- copy offload for storage vMotion
- thin provision: stun ≡ run out of capacity to allocate – VM pause instead of VM crash
- zero space reclamation – with SCSI UNMAP
Offloaded data transfer (ODX)
- Microsoft, ≈ VAAI
- copy offload
- bulk zeroing
vVols
- object-based FS over SAN, instead of LUNs
- gives array visibility per VMDK instead of per datastore ≡ LUN
- snapshot, replication, encryption offload from vCenter to array
- vCenter manages objects (vVols) OOB using 128-bit GUID ⇒ no need for VMFS (in contrast to block storage)
- storage policy-based management (SPBM)
- data in vVol is accessed in-band over FC/NFS/… via protocol endpoint (PE)
- types
- config vVol: VM metadata, .vmx, .nvram, logs
- data vVol: VMDK
- mem vVol: for snapshots with RAM image
- swap vVol
- other vVol: vendor/solution specific