Home Storage Costs

1. Background

It is incredibly easy to delude yourself into thinking something that you want to be true is. I want assembling a NAS with way more storage than I realistically need (even for my whole family) to be a sensible decision, but let's interrogate whether it's actually worth it.

Figure 1: The AOOSTAR WTR MAX R7 PRO, the lovely little toaster that I'm using as a NAS enclosure.

To compare with other cloud storage offerings, the calculation is really quite straightforward:

\[ \text{\$/TB/year} = \frac{\text{running cost (AUD)}}{\text{TB} \times \text{period (in years)}} \]

The tricky bit is working out an accurate running cost and the relevant period. To do this we need to consider both fixed and ongoing costs. For sake of this calculation, I'll consider all non-drive hardware to be fixed (up to a point), and drives and electricity to be ongoing costs.

This is simplified by my choice to use a NAS enclosure, leaving just boot/cache drives and memory as the only hardware to acquire beside bulk storage.

Hardware	Purpose	Cost ($AUD)
AOOSTAR WTR MAX R7 PRO	NAS enclosure	1125
32GB DDR5-4800	Overpriced memory¹	115
Patriot P300 128GB	Boot drive	20
Seagate FireCuda 530R 2TB	Read/write cache drive	220
Total		1480

2. Drive lifespan

The WD Ultrastar HC550 WUH721818ALE6L4 drives that I'll be using today are well-known series of enterprise drives. According to WD it has an AFR of 0.35% and an MTBF of 2.5 million hours (about 285 years). From some cursory reading on the internet, I've been cautioned than an MTBF of around 300 years should not be interpreted as an individual drive lasting 300 years, but that in an array of 300 drives you would expect one drive to fail a year on average.

Attribute	Value
AFR	0.35%
MTBF	285 years
Workload	550 TB/year
Power consumption	6W
Operating temperature	5°C to 60°C

On that basis, we might expect 48 years until the first drive failure (with new drives), but I think we're too close to the single-drive case for the MTBF to be meaningful here. Instead, we can resort to the AFR. In fact, due to the popularity of this drive series we can use some data from Backblaze.

They added the 16TB sibling (the WUH721816ALE6L4) to their drive pool in 2022, and observed an AFR of 0.36% in that year (at an average age of 5 months), followed by an AFR of 0.33% (12 months in), and 0.30% (21 months in). So far this seems reasonably close to WD's stated AFR of 0.35%.

It is not realistic to expect the AFR to continue around 0.35% though, since we know that failure rates go up over time. The older 14TB sibling (WUH721414ALE6L4) had an AFR of 0.45% with an average age of 48 months. Backblaze also makes reference to failure rates generally ticking up after about five years. This is often referred to as a "bathtub curve" (though in 2021 Backblaze was finding a shape more like a hockey stick). This bathtub is simplistically modelled as three Weibull curves with $\lambda = \frac{1}{\text{MTBF}}$: a $k_1 < 1$ curve for initial failures, a $k_2 = 1$ curve for random failures during the standard lifetime, and a $k_3 > 1$ curve for wear-out failures.

\begin{equation*} \tag{Weibull trio} f(t) = \begin{cases} \frac{k_1}{\lambda_1} \cdot {\left( \frac{t}{\lambda_1} \right)}^{k_1-1} \quad & 0 < t < \tau_1 \\ \frac{k_2}{\lambda_2} & \tau_1 \leq t < \tau_2 \\ \frac{k_3}{\lambda_3} \cdot {\left( \frac{t - \tau_2}{\lambda_3} \right)}^{k_3-1} & \tau_2 \leq t \end{cases} \end{equation*}

If I look for recent literature on drive failures I see paper titles like New Weibull Log-Logistic grey forecasting model for a hard disk drive failures² (with an abstracts that talks about multiple interaction effects and dependent variables with a non-linear relationship) which suggests to me that while one certainly can do better, the Weibull is still the appropriate basic distribution to use.

Instead of this "Weibull trio" formulation though, I think we can do a bit better by using the Weibull log-logistic mixture from The Weibull log-logistic mixture distributions: Model, theory and application to lifetime data, which uses the form:

\begin{align*} f(t, \psi) &= p \, f_1(t, \psi_1) \; + \; (1 - p) \, f_2(t, \psi_2) \qquad && 0 < p < 1 \\ f_1(t, \psi) &= \left( \frac{\beta_1}{\alpha_1} \right) {\left( \frac{t}{\alpha_1} \right)}^{\beta_1 - 1} e^{- {\left( \frac{t}{\alpha_1} \right)}^{\beta_1}} \qquad && 0 < t, \; 0 < \alpha_1, \; 0 < \beta_1 \\ f_2(t, \psi) &= \frac{\left( \frac{\beta_2}{\alpha_2} \right) {\left( \frac{t}{\alpha_2} \right)}^{\beta_2 - 1}}{{\left( 1 + {\left( \frac{t}{\alpha_2} \right)}^{\beta_2} \right)}^2} \qquad && 0 < t, \; 0 < \alpha_2, \; 0 < \beta_2 \end{align*}

The parameters $p, \alpha_1, \beta_1, \alpha_2, \beta_2$ can be numerically estimated using MLE together with a root-finding method. In the paper, the 2013–2019 Backblaze data is used to fit a censored version of the Weibull log-logistic mixture model (with $t$ as the age in years), and arrive at these parameters:

$p = 0.31$
$\hat{\beta_1} = 5.24$
$\alpha_1 = 5.96$
$\hat{\beta_2} = 4.54$
$\alpha_2 = 2.22$

To consider whether the 2013–2019 parameter estimates will be pessimistic, realistic, or optimistic we can once again return to Backblaze. In that same 2021 report, Backblaze also found that newer drives were improving in reliability: while in 2013 only 50% of drives were more than 6y old, by 2021 this had increased to 88% of drives. We can also see that in 2024 that WDC drives had the lowest failure rate of the various brands Backblaze monitor. The oldest low-failure rate drives that Backblaze had were from HGST (an AFR of 0.39% at an age of 96 months with the HMS5C4040BLE640), who previously produced the Ultrastar line that I've purchased before WD bought them out. Backblaze's 80 month old 8TB Ultrastars (HUH728080ALE600) had an AFR of 1.14%, while their 43 month old 12TB Ultrastars (HUH7212ALE604) had an AFR of 1.35%. Considering all of this, I expect that using the 2013–2019 parameters will severly under-estimate the lifespan of my drives.

We can use more recent Backblaze data to re-estimate these parameters, and then adjust the scale parameters ($\alpha_1, \; \alpha_2$) using the WDC/HGST helium drive data (assuming that cohorts of hard-drive technologies' failure distribution have the same shape, with the main difference being in scale).

2.1. Data processing

In order to fit the model, we'll first need to pull all the relevant information from Backblaze. Thankfully we can find data going all the way back to 2013, split by quarter at https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data.

There's a good package for managing data in Julia that I happen to be familiar with called DataToolkit.jl, which we can use to create the relevant datasets for us.

julia

using DataToolkit

dcol = DataToolkit.create!(DataCollection, "HDD Lifespans", "Data.toml",
                           plugins = ["cache", "defaults", "memorise", "versions"])

DataToolkit.config_set!(dcol, "config.defaults.storage._.checksum" => "auto")

dataperiods = vcat([(2013, 0), (2014, 0)],
                   collect(Iterators.flatten(tuple.(y, 1:4) for y in 2015:2024)),
                   [(2025, 1)])

dsets = DataSet[]

for (year, quarter) in dataperiods
    dset = DataToolkit.create!(dcol, DataSet, "BackblazeDriveSnapshot",
                               "description" => "Backblaze HDD reliability data",
                               "version" => "$year.$quarter")
    url = if quarter == 0
        "https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_$year.zip"
    else
        "https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q$(quarter)_$(year).zip"
    end
    DataToolkit.create!(dset, DataToolkit.DataStorage, :web, "url" => url)
    DataToolkit.create!(dset, DataToolkit.DataLoader, :julia,
                        "input" => DataToolkit.FilePath,
                        "path" => "drivestatuses.jl")
    push!(dsets, dset)
end

let dall = DataToolkit.create!(dcol, DataSet, "BackblazeDriveSnapshots")
    DataToolkit.create!(dall, DataToolkit.DataStorage, :raw, "value" => dsets)
    DataToolkit.create!(dall, DataToolkit.DataLoader, :passthrough)
end

DataToolkit.save!(dcol)

julia

using ZipArchives
using Mmap
using CSV
using Dates

function(zipfile::String)
    zreader = ZipReader(read(zipfile))
    knowndrives = Dict{UInt, @NamedTuple{firstseen::Date, model::String, serial_number::String}}()
    faileddrives = Dict{UInt, @NamedTuple{failed::Date, model::String, serial_number::String}}()
    removeddrives = Dict{UInt, @NamedTuple{removed::Date, model::String, serial_number::String}}()
    for dayfile in zip_names(zreader)
        daydata = CSV.File(zip_readentry(zreader, dayfile), stringtype = String)
        seendrives = Set{UInt}()
        for row in daydata
            driveid = hash(row.serial_number, hash(row.model))
            push!(seendrives, driveid)
            if !iszero(row.failure)
                faileddrives[driveid] = (failed = row.date, model = row.model, serial_number = row.serial_number)
                continue
            end
            if haskey(removeddrives, driveid)
                delete!(removeddrives, driveid)
            end
            haskey(knowndrives, driveid) && continue
            knowndrives[driveid] = (firstseen = row.date, model = row.model, serial_number = row.serial_number)
        end
        day = parse(Date, first(splitext(basename(dayfile))))
        for (driveid, (; model, serial_number)) in knowndrives
            driveid ∈ seendrives && continue
            haskey(faileddrives, driveid) && continue
            removeddrives[driveid] = (; removed = day, model, serial_number)
        end
    end
    (known = sort!(collect(values(knowndrives)), by=d->d.firstseen),
     failiures = sort!(collect(values(faileddrives)), by=d->d.failed),
     removals = sort!(collect(values(removeddrives)), by=d->d.removed))
end

2.2. Model fitting

3. Acronyms

AFR: Annualised Failure Rate
MTBF: Mean Time Between Failures

Footnotes:

The state of DDR5 memory is really not great at the moment. ECC costs twice as much as non-ECC (and would make sense since the NAS uses an 8845hs PRO chip, which supports ECC), and I can get second-hand DDR4 ECC for half the price of non-ECC DDR5. 🥲

For the bulk storage itself, I'll be using second-hand WD drives. Between full-price drives for ~$700 (5y warranty), refurbished drives for ~$450 (3y warranty), and pulled drives for ~$320 (1y warranty, 2-3y old), I have a feeling that the reduced lifespan/reduced price works out in favour of the pulled drives. We'll see if this is actually borne out in the calculations later.

Without applying the modelling described in that paper, I think it's probably worth mentioning Table 2 and 3, which found an association between certain SMART attributes and the probability a disk failure when non-zero.

Non-zero SMART	Name	Live drives	Failed drives
SMART5	Reallocated Sectors Count	1.1%	42.2%
SMART187	Reported Uncorrectable Errors	0.5%	43.5%
SMART188	Command Timeout	4.8%	44.8%
SMART197	Current Pending Sector Count	0.7%	43.1%
SMART198	Uncorrectable Sector Count	0.3%	33.0%

Notably, when all of these attributes are non-zero the chance of failure is around 77%.

1. Background#

2. Drive lifespan#

2.1. Data processing#

2.2. Model fitting#