Storage

  1. Architecture
    1. Shared storage model for virtualization (SSM)
    2. Enterprise grade
    3. Storage drive
    4. Tape
    5. HDD
    6. SSD
    7. NVMe
      1. NVMe-oF
    8. I/O blender
    9. Provisioning
      1. Thich volume
      2. Thin volume
    10. Space reclamation
    11. Data locality
  2. Storage types
    1. Block storage
    2. Object storage
    3. Software-defined storage (SDS)
    4. Tiering
  3. Workload
    1. Deduplication
    2. Compression
  4. Disk interfaces
    1. Serial advanced technology attachment (SATA)
    2. Serial attached SCSI
  5. Controller models
    1. Array
    2. Grid
  6. Virtualization
  7. vStorage API for array integration (VAAI)
  8. Offloaded data transfer (ODX)
  9. vVols

Architecture

Shared storage model for virtualization (SSM)

  • level 4: application
  • level 3: file/record
    • DB
    • FS
  • level 2: block aggregation
    • host: LLVM
    • network
    • device: array controller
  • level 1: storage device
    • LBA in HDD/SSD

Enterprise grade

  1. more overprovisioned capacity
  2. more cache
  3. better/more complex controller and firmware
  4. more channels
  5. condensers for powering cache: provides time to save data during power outage
  6. batteries for powering cache: longer power supply during power outage
  7. longer warrantys

Storage drive

  • IOPS: for 512-byte sectors, because max performance
    • IOPS ≠ latency
    • usually vendors do not list latency, because IOPS is measured in artificial circumstances
  • logical block addressing (LBA): abstract internal addressing as continuous address space
  • SMART: self-monitoring, analysis and reporting technology
    • standard to predict failure
  • write modes
    • write-through: write to backend, then acknowledge
    • write-back: write to cache, then acknowledge, then write to backend
    • redirect-on-right
  • cache mirror: only for write (data for read is on the backend, available at the same speed)
  • data scrubbing:
    • usually low priority background process
    • detect and fix (CRC32) bit rot: physical medium degradation because of inactivity (demagnetizing, condenser charge leak)

Tape

  • high capacity, sequential performance, low power
  • media degrades
  • not all types of backup (e.g., bad for differential)
  • linear tape-open (LTO) format:
    • compatible for read with level of LTO – 2 (including)
    • compatible for write with level of LTO – 1 (including)
  • shoe-shining:
    • data speed ≠ tape speed
    • slows down backup, wears the tape and drive
    • multiplexing can solve shoe-shining but slows down restore – data are interleaved
  • virtual tape library (VTL):
    • emulates tape, uses HDD on backend
    • native deduplication

HDD

  • zoned data recording (ZDR): outer tracks have more sectors than inner tracks ⇒ I/O speed on edge is higher ⇒ controller tries to write to outer tracks
  • short stroking: use only part of capacity closer to the edge
  • factory format
    • low-level format
    • physical creation of tracks and sectors + control marking
  • queuing is required to optimize I/O: rearrange read/write ⇒ less rotational latency and seek time
  • 3.5 inch ≡ size of floppy drive
    • 7200 RPM platter size – 3.7 inch
    • 10K/15K RPM has smaller diameter
  • head thrashing ≡ excessive seeking
  • copying from disk to another disk is faster than copying within disk – no seeking (e.g., for backup)
  • sustained transfer rate (STR): I/O speed for sequential data on different tracks

SSD

  • NAND, NOR (rare)
  • cell → page (4k, 8k, 16k) → block (128-512k)
  • new disk has all cells are initialized to 1
  • write: per cell
  • erase: per block ⇒ rewrite (read + erase + write) is slow (1/10/100)
    • P/E: program/erase
  • write cliff: shortage of pre-erased block for I/O redirect ⇒ rewrite is required
  • cell types
    • SLC
      • 0-1
      • 100k P/E
      • lowest power consumption
    • MLC
      • 00-11
      • 10k P/E
      • several levels due to 4 levels of voltage
    • eMLC
      • enterprise MLC
      • better ECC
      • more cell overprovision to replace dead cells and distribute load
    • TLC
      • 000-111
      • 5k P/E
  • cache
    • required for write coalesce: reduce cell wear
      • write combine: perform only last write instead of call per write
    • not required for read
  • garbage collection
    • improves performance, reduces P/E
    • erases unused blocks in the background
    • repackages partially full blocks
  • wear leveling
    • distribite load over all cells
    • types
      • background
      • in-flight

NVMe

  • for SSD instead of SCSI
  • more command queues: 64k queues
  • deeper queues: 64k commands per queue
  • only 13 commands
  • not I/O, but memory on PCIe:
    • controller not needed
    • P2P PCI instead of shared SCSI bus
  • OS → storage on PCIe, replaces AHCI+SCSI

NVMe-oF

  • connectivity between NVMe and transport (FC, IB, RoCEv2, iWARP)

I/O blender

  • different VMs with sequential I/O on the same host ⇒ random I/O on host
  • solved by flash storage

Provisioning

Thich volume

  • pre-provision capacity
  • for sequential read from HDD

Thin volume

  • on-demand physical capacity
  • extent ≡ minimal size of volume growth
  • causes fragmentation ⇒ longer seek time ⇒ bad for sequential read
  • write performance overhead because of time to allocate physical blocks on the fly

Space reclamation

  • fill with zeros
    1. create file
    2. fill file with zeros
    3. delete file
    4. call array script
  • OS integration: SCSI UNMAP

Data locality

  • advantages
    • lower delay (≈ 0.5 ms)
    • lower network load
  • disadvantages
    • delay may not be relevant due to other slower bottlenecks: HDD (10 ms), SSD (1 ms), NVMe (CPU)
    • limits storage, accessible to vMotion
    • contradicts distributed nature
    • contradicts features: deduplication
    • not suitable for containers

Storage types

Block storage

  • server boot, VM boot
  • DB, transactions
  • random I/O
  • direct access
  • max performance, low delay
  • no concurrent access to LUN
  • relies on HA of every level

Object storage

  • flexible due to metadata
  • usually read-intensive, unstructured ⇒ suitable for media
  • can be geographically dispersed
  • ID ≠ object name
  • flat namespace
  • mount not needed, access via REST/SOAP/XAM
  • in-place update is often not available (e.g., ID includes hash)

Software-defined storage (SDS)

  • software does not depend on hardware (contrast: controller firmware)
  • layers
    • orchestration
    • data service
    • hardware

Tiering

  • sub-LUN auto-tier: move only part of LUN
  • policy
    • exclusion: backup does not use flash, but is limited to using NL-SAS
    • limit
    • priority
  • on replication only write I/O is copied ⇒ on DR invoke and auto-tiering data location may not match (was in cache – now on backend)

Workload

  • application signature
    • read/write %
    • random/sequential access
    • data/metadata %
    • compressibility
    • block/chunk size
    • metadata frequency
    • async/compound commands
    • sensitivity to delays

Deduplication

  • uses pointers to data instead of storing several copies
  • Flash has fast read – possibility of inline deduplication on primary storage
  • block level deduplication only
    • file level deduplication: whole file must match ≡ single instancing, links
  • fixed block length: single block change causes change of subsequent blocks ⇒ breaks deduplication
  • variable block length: searches for patterns, using them to segment into blocks
  • HDD: increases fragmentation
  • SSD: reduces flash wear

Compression

  • types
    • inline
      • requires large cache, but low I/O on backend
      • slows down read and write
    • port-process
      • write only
      • no performance hit
  • algorithms
    • LZO: fast, small compression
  • block level only
    • if file level, then in order to read part of file – the whole file has to be decompressed
  • low effectiveness for audio/video due to inherent compression
  • dedup → compress, compress →× dedup
  • not suitable for encrypted data: high entropy would lead to higher size

Disk interfaces

Serial advanced technology attachment (SATA)

  • block-level protocol, electrical interface, physical interface
  • 512-byte sector
  • ATA
    • smaller command set than in SCSI
    • ATA packet interface (ATAPI) – in software
  • near-line SAS (NL-SAS)
    • SAS interface
    • SAS command set
    • SATA backplane
  • native command set (NCQ): queuing
  • advanced host controller interface (AHCI): API between OS and controller
  • serial vs parallel
    • parallel has lower BW because of crosstalk
    • long serial cable is simpler and cheaper
  • bus, up to 2 devices
  • integrated drive electronics (IDE): ATA-1 (parallel interface)
  • cheap

Serial attached SCSI

  • SAS controller supports SATA II disks and newer
  • different sector sizes: 512/520, 4k
  • end-to-end data protection (EDP) support: add data integrity field (T10 DIF) to sector data (512 + 8 = 520)
  • higher component quality, quality of queuing system
    • higher MTBF than in ATA
    • more features
    • higher performance, capacity
  • dual ported drive: active/passive connection to different controllers
  • command tag queuing (CTQ)

Controller models

Array

  • active/passive controller per LUN ≡ asymmetric logical unit access (ALUA)
  • if one of 2 controllers fail – write-through: cannot acknowledge data in cache, that is not replicated
  • if LUN is accessed through passive controller – redirect request within array to primary ≡ latency
  • scaling: only disk addition is possible (scale-up)
  • massive array of idle disks (MAID)
    • disks switch to low-power mode if not used
    • useful for archives

Grid

  • active/active
  • scaling: disks, CPU, cache, BW (scale-up), nodes (scale-out)

Virtualization

  • use external array as internal disks
  • controller functionality + capacity of simple virtualized array ⇒ suitable for auto-tiering
  • controller emulates common host

vStorage API for array integration (VAAI)

  • not an actual API
  • enables hypervisor to understand SCSI and perform offload
  • atomic test and set (ATS)
    • instead of SCSI LOCK on the whole volume
    • extent level
    • lock is offloaded to array
  • zeroing a range in lieu of per block: useful for eager zeroed thick volume (EZT)
  • copy offload for storage vMotion
  • thin provision: stun ≡ run out of capacity to allocate – VM pause instead of VM crash
  • zero space reclamation – with SCSI UNMAP

Offloaded data transfer (ODX)

  • Microsoft, ≈ VAAI
  • copy offload
  • bulk zeroing

vVols

  • object-based FS over SAN, instead of LUNs
  • gives array visibility per VMDK instead of per datastore ≡ LUN
  • snapshot, replication, encryption offload from vCenter to array
  • vCenter manages objects (vVols) OOB using 128-bit GUID ⇒ no need for VMFS (in contrast to block storage)
    • storage policy-based management (SPBM)
  • data in vVol is accessed in-band over FC/NFS/… via protocol endpoint (PE)
  • types
    • config vVol: VM metadata, .vmx, .nvram, logs
    • data vVol: VMDK
    • mem vVol: for snapshots with RAM image
    • swap vVol
    • other vVol: vendor/solution specific