In this paper, we investigate the roles of replication vs. repair to
achieve durability in large-scale distributed storage systems. Specifically, we
address the fundamental questions: How does the lifetime of an object depend on
the degree of replication and rate of repair, and how is lifetime maximized
when there is a constraint on resources? In addition, in real systems, when a
node becomes unavailable, there is uncertainty whether this is temporary or
permanent; we analyze the use of timeouts as a mechanism to make this
determination. Finally, we explore the importance of memory in repair
mechanisms, and show that under certain cost conditions, memoryless systems,
which are inherently less complex, perform just as well.
Pre-2018 CSE ID: CS2007-0900