[OSDI'20] Assise: Distributed FS w/ NVM Colocation

[OSDI'20] Assise: Distributed FS w/ NVM Colocation

Assise: Performance and Availability via NVM Colocation in a Distributed File System

Thomas E. Anderson, Marco Canini , Jongyul Kim, Dejan Kostic, Youngjin Kwon, Simon Peter, Waleed Reda , Henry N. Schuh, Emmett Witchel

Background on NVM (Non-Volatile Memory):

Background on Disaggregation

•monolithic servers =>network-attached resource pools

Why disaggregation is possible now?

  • Network bandwidth/latency is sufficient to support current applications[1,2]

Current distributed file system design

  1. Current design paradigm: Storage disaggregation
  • Separate servers from clients (files physically separate from clients)
  • Client’s main memory is treated as volatile.

But it has disadvantages when

  • On cache miss, multiple network RTT overhead to consult metadata servers and get data
  • On failure, rebuild data and metadata caches on client from scratch (high recovery time and high network utilization during recovery)
  • I/O size smaller than page block will downgrade performance


Start with measuring NVM: (some key points)

  • Access NVM with RDMA is still faster than local SSD
  • NVM access speed is fast to slow based on locality
NVM accessed via RDMA (NVM-RDMA), via loads and stores to another CPU socket (NVM-NUMA), or via system calls on the same socket (NVM-kernel) can be an order of magnitude slower in terms of both latency and bandwidth.
  • NVM-RDMA write latency is twice higher than read, because RDMA writes have to invoke the remote CPU (to flush caches

Assise IO path

Write:

  • First writes to process-level cache (update log) in NVM (W)
  • Update log is replicated to cache replica (remote NVM), on fsync/dsync. (A1, A2)
  • The last replica sends back ack so sync returns.
  • When the update log fills, a digest is initiated. (D)

Read:

  • First check local DRAM cache (R1)
  • If not found, check the hot shared area on SharedFS (R2)
  • If not found, check remote replica (RDMA) and local SSD in parallel. (R3,R4)
  • Also prefetch to local DRAM, DRAM cache evicted to NVM update log. (E)


CC-NVM (CrashconsistentcachecoherenceNVM)

  • Crash consistency with prefix semantics
  • Given a sequence: W1,o1,W2,o2,...,Wn,on
  • If crash after fsync oi, the file system will recover to a state with W1,W2,...,Wi applied.
  • Achieved with ordering guarantees of (R)DMA to write the log in order to replicas.

Sharing with linearizability with Leases

  • Leases are similar to reader-writer locks. but can be revoked and expire after a timeout. (can be re-acquired)
  • Use lease to grant shared read or exclusive write.
  • LibFS can acquire lease from SharedFS via syscall
    • SharedFS enforces the process’s private update log and cache entries are clean and replicated before lease transfer.
    • The lease transfer are also logged and replicated in NVM.
  • (hierarchical access) LibFS -> SharedFS -> cluster manager
    • If lease is held by another SharedFS, wait.
    • Minimize network communication and thus lease delegation overhead.


Assise:Recovery and fail-over strategy

  • LibFS recovery:
    • SharedFS will evict dead update log/expire lease, restart process.
    • DRAM cache will be re-built.
  • SharedFS recovery
    • Restart to checkpointing. LibFS status can be recovered by the SharedFS log in NVM
  • Cache replica fail-over
    • In case of power failure, fail over to cache replica
    • The replica’s SharedFS will take over lease management.
    • The writes during failover will invalidate the cached data on the failed node. This is done by tracking a file block bitmap on an epoch basis.
  • Node recovery
    • First, recover the shared FS, then invalidates all written blocks since crash.

Key take-aways

  • For distributed file system, NVM’s unique characteristics require cache and manage data on co-located persistent memory. (which seems to be against the trend of resource disaggregation, but it makes more sense because access NVM locally is definitely faster than access via RDMA.


[1] Is memory disaggregation feasible?: A case study with Spark SQL. ANCS’16

[2] Network Requirements for Resource Disaggregation. OSDI’16

编辑于 2020-12-05 06:06