Architecture

Shared storage model for virtualization (SSM)

level 4: application
level 3: file/record
- DB
- FS
level 2: block aggregation
- host: LLVM
- network
- device: array controller
level 1: storage device
- LBA in HDD/SSD

Enterprise grade

more overprovisioned capacity
more cache
better/more complex controller and firmware
more channels
condensers for powering cache: provides time to save data during power outage
batteries for powering cache: longer power supply during power outage
longer warrantys

Storage drive

IOPS: for 512-byte sectors, because max performance
- IOPS ≠ latency
- usually vendors do not list latency, because IOPS is measured in artificial circumstances
logical block addressing (LBA): abstract internal addressing as continuous address space
SMART: self-monitoring, analysis and reporting technology
- standard to predict failure
write modes
- write-through: write to backend, then acknowledge
- write-back: write to cache, then acknowledge, then write to backend
- redirect-on-right
cache mirror: only for write (data for read is on the backend, available at the same speed)
data scrubbing:
- usually low priority background process
- detect and fix (CRC32) bit rot: physical medium degradation because of inactivity (demagnetizing, condenser charge leak)

Tape

high capacity, sequential performance, low power
media degrades
not all types of backup (e.g., bad for differential)
linear tape-open (LTO) format:
- compatible for read with level of LTO – 2 (including)
- compatible for write with level of LTO – 1 (including)
shoe-shining:
- data speed ≠ tape speed
- slows down backup, wears the tape and drive
- multiplexing can solve shoe-shining but slows down restore – data are interleaved
virtual tape library (VTL):
- emulates tape, uses HDD on backend
- native deduplication

HDD

zoned data recording (ZDR): outer tracks have more sectors than inner tracks ⇒ I/O speed on edge is higher ⇒ controller tries to write to outer tracks
short stroking: use only part of capacity closer to the edge
factory format
- low-level format
- physical creation of tracks and sectors + control marking
queuing is required to optimize I/O: rearrange read/write ⇒ less rotational latency and seek time
3.5 inch ≡ size of floppy drive
- 7200 RPM platter size – 3.7 inch
- 10K/15K RPM has smaller diameter
head thrashing ≡ excessive seeking
copying from disk to another disk is faster than copying within disk – no seeking (e.g., for backup)
sustained transfer rate (STR): I/O speed for sequential data on different tracks

SSD

NAND, NOR (rare)
cell → page (4k, 8k, 16k) → block (128-512k)
new disk has all cells are initialized to 1
write: per cell
erase: per block ⇒ rewrite (read + erase + write) is slow (1/10/100)
- P/E: program/erase
write cliff: shortage of pre-erased block for I/O redirect ⇒ rewrite is required
cell types
- SLC
  - 0-1
  - 100k P/E
  - lowest power consumption
- MLC
  - 00-11
  - 10k P/E
  - several levels due to 4 levels of voltage
- eMLC
  - enterprise MLC
  - better ECC
  - more cell overprovision to replace dead cells and distribute load
- TLC
  - 000-111
  - 5k P/E
cache
- required for write coalesce: reduce cell wear
  - write combine: perform only last write instead of call per write
- not required for read
garbage collection
- improves performance, reduces P/E
- erases unused blocks in the background
- repackages partially full blocks
wear leveling
- distribite load over all cells
- types
  - background
  - in-flight

NVMe

for SSD instead of SCSI
more command queues: 64k queues
deeper queues: 64k commands per queue
only 13 commands
not I/O, but memory on PCIe:
- controller not needed
- P2P PCI instead of shared SCSI bus
OS → storage on PCIe, replaces AHCI+SCSI

NVMe-oF

connectivity between NVMe and transport (FC, IB, RoCEv2, iWARP)

I/O blender

different VMs with sequential I/O on the same host ⇒ random I/O on host
solved by flash storage

Provisioning

Thich volume

pre-provision capacity
for sequential read from HDD

Thin volume

on-demand physical capacity
extent ≡ minimal size of volume growth
causes fragmentation ⇒ longer seek time ⇒ bad for sequential read
write performance overhead because of time to allocate physical blocks on the fly

Space reclamation

fill with zeros
1. create file
2. fill file with zeros
3. delete file
4. call array script
OS integration: SCSI UNMAP

Data locality

advantages
- lower delay (≈ 0.5 ms)
- lower network load
disadvantages
- delay may not be relevant due to other slower bottlenecks: HDD (10 ms), SSD (1 ms), NVMe (CPU)
- limits storage, accessible to vMotion
- contradicts distributed nature
- contradicts features: deduplication
- not suitable for containers

Storage types

Block storage

server boot, VM boot
DB, transactions
random I/O
direct access
max performance, low delay
no concurrent access to LUN
relies on HA of every level

Object storage

flexible due to metadata
usually read-intensive, unstructured ⇒ suitable for media
can be geographically dispersed
ID ≠ object name
flat namespace
mount not needed, access via REST/SOAP/XAM
in-place update is often not available (e.g., ID includes hash)

Software-defined storage (SDS)

software does not depend on hardware (contrast: controller firmware)
layers
- orchestration
- data service
- hardware

Tiering

sub-LUN auto-tier: move only part of LUN
policy
- exclusion: backup does not use flash, but is limited to using NL-SAS
- limit
- priority
on replication only write I/O is copied ⇒ on DR invoke and auto-tiering data location may not match (was in cache – now on backend)

Workload

application signature
- read/write %
- random/sequential access
- data/metadata %
- compressibility
- block/chunk size
- metadata frequency
- async/compound commands
- sensitivity to delays

Deduplication

uses pointers to data instead of storing several copies
Flash has fast read – possibility of inline deduplication on primary storage
block level deduplication only
- file level deduplication: whole file must match ≡ single instancing, links
fixed block length: single block change causes change of subsequent blocks ⇒ breaks deduplication
variable block length: searches for patterns, using them to segment into blocks
HDD: increases fragmentation
SSD: reduces flash wear

Compression

types
- inline
  - requires large cache, but low I/O on backend
  - slows down read and write
- port-process
  - write only
  - no performance hit
algorithms
- LZO: fast, small compression
block level only
- if file level, then in order to read part of file – the whole file has to be decompressed
low effectiveness for audio/video due to inherent compression
dedup → compress, compress →× dedup
not suitable for encrypted data: high entropy would lead to higher size

Disk interfaces

Serial advanced technology attachment (SATA)

block-level protocol, electrical interface, physical interface
512-byte sector
ATA
- smaller command set than in SCSI
- ATA packet interface (ATAPI) – in software
near-line SAS (NL-SAS)
- SAS interface
- SAS command set
- SATA backplane
native command set (NCQ): queuing
advanced host controller interface (AHCI): API between OS and controller
serial vs parallel
- parallel has lower BW because of crosstalk
- long serial cable is simpler and cheaper
bus, up to 2 devices
integrated drive electronics (IDE): ATA-1 (parallel interface)
cheap

Serial attached SCSI

SAS controller supports SATA II disks and newer
different sector sizes: 512/520, 4k
end-to-end data protection (EDP) support: add data integrity field (T10 DIF) to sector data (512 + 8 = 520)
higher component quality, quality of queuing system
- higher MTBF than in ATA
- more features
- higher performance, capacity
dual ported drive: active/passive connection to different controllers
command tag queuing (CTQ)

Controller models

Array

active/passive controller per LUN ≡ asymmetric logical unit access (ALUA)
if one of 2 controllers fail – write-through: cannot acknowledge data in cache, that is not replicated
if LUN is accessed through passive controller – redirect request within array to primary ≡ latency
scaling: only disk addition is possible (scale-up)
massive array of idle disks (MAID)
- disks switch to low-power mode if not used
- useful for archives

Grid

active/active
scaling: disks, CPU, cache, BW (scale-up), nodes (scale-out)

Virtualization

use external array as internal disks
controller functionality + capacity of simple virtualized array ⇒ suitable for auto-tiering
controller emulates common host

vStorage API for array integration (VAAI)

not an actual API
enables hypervisor to understand SCSI and perform offload
atomic test and set (ATS)
- instead of SCSI LOCK on the whole volume
- extent level
- lock is offloaded to array
zeroing a range in lieu of per block: useful for eager zeroed thick volume (EZT)
copy offload for storage vMotion
thin provision: stun ≡ run out of capacity to allocate – VM pause instead of VM crash
zero space reclamation – with SCSI UNMAP

Offloaded data transfer (ODX)

Microsoft, ≈ VAAI
copy offload
bulk zeroing

vVols

object-based FS over SAN, instead of LUNs
gives array visibility per VMDK instead of per datastore ≡ LUN
snapshot, replication, encryption offload from vCenter to array
vCenter manages objects (vVols) OOB using 128-bit GUID ⇒ no need for VMFS (in contrast to block storage)
- storage policy-based management (SPBM)
data in vVol is accessed in-band over FC/NFS/… via protocol endpoint (PE)
types
- config vVol: VM metadata, .vmx, .nvram, logs
- data vVol: VMDK
- mem vVol: for snapshots with RAM image
- swap vVol
- other vVol: vendor/solution specific

Networking and IT

Uncovering the Why

Storage

Architecture

Shared storage model for virtualization (SSM)

Enterprise grade

Storage drive

Tape

HDD

SSD

NVMe

NVMe-oF

I/O blender

Provisioning

Thich volume

Thin volume

Space reclamation

Data locality

Storage types

Block storage

Object storage

Software-defined storage (SDS)

Tiering

Workload

Deduplication

Compression

Disk interfaces

Serial advanced technology attachment (SATA)

Serial attached SCSI

Controller models

Array

Grid

Virtualization

vStorage API for array integration (VAAI)

Offloaded data transfer (ODX)

vVols