1 Why Multivariate?

imputeTestbench currently excels at benchmarking imputation methods on univariate time series. But real data is rarely univariate. Weather stations record temperature, humidity, and pressure simultaneously; financial datasets include multiple correlated indicators.

To evaluate imputation algorithms realistically, we need to:

Simulate missingness that can affect several variables at once (e.g., a sensor array failure)
Test methods that use cross-variable relationships (e.g., MICE, VAR imputation, mtsdi)
Assess performance per variable — which sensors are hardest to impute?

Adding multivariate support will make imputeTestbench useful for a much wider range of applications.

2 Current Univariate Architecture

The existing pipeline for a single time series (vector or ts) is:

sample_dat() — takes a complete series and returns a list of repetition corrupted copies, each with exactly a given percentage b of NAs (MCAR or block MAR).
impute_errors() — loops over:
- Missing percentages (e.g., 10%, 20%, …)
- Imputation methods (e.g., na.approx, na.interp)
- Repetitions (each corrupted copy from sample_dat)
For each repetition:
- Apply the method to fill NAs → filled
- Compute an error metric (RMSE, MAE, …) between original and filled
- Store the single error value

Storage: Raw errors are kept in errall[[method]][[percentage_index]] as a vector of length repetition.

Final output: An errprof object — a list with Parameter, MissingPercent, and for each method a vector of average errors.

Everything is designed for vectors. To support matrices (rows = time, columns = variables), we must extend it without breaking existing code.

3 Proposed Multivariate Extensions

We will extend the existing functions — not create separate *_mv versions. This keeps the API clean and uses all current features (custom methods, extra arguments, user-defined error functions).

3.1 Input Detection

At the top of both sample_dat and impute_errors:

if (is.matrix(dataIn) || is.data.frame(dataIn)) {
  multivariate <- TRUE
  dataIn <- as.matrix(dataIn)
  n_vars <- ncol(dataIn)
  if (n_vars == 1) multivariate <- FALSE   # treat single column as univariate
} else {
  multivariate <- FALSE
}

All subsequent logic branches on multivariate.

3.2 Missing Data Generation — Extending `sample_dat()`

For multivariate input, sample_dat will return a list of corrupted matrices (one per repetition). New arguments control the missingness mechanisms:

Argument	Description
`mv_mechanism`	One of `"mcar"`, `"mar"`, `"block"`, `"system_blackout"`. Default `"mcar"`.
`mv_mar_func`	A function that takes the full data matrix and returns a probability matrix of the same dimensions (for flexible MAR).
`mv_block_size`	Block length (can be a scalar or a vector of length `n_vars`).
`mv_n_blocks`	Number of blocks per variable (for independent block missingness).
`mv_synchronize`	Logical; if `TRUE`, blocks occur at the same rows for all variables (simulates a system-wide outage).
`mv_protect_ends`	Logical (default `TRUE`) to keep first and last time points observed, matching the univariate behaviour.

3.2.1 How It Works Internally

For each repetition, we generate a logical mask — a matrix of the same dimensions as the data, with TRUE where we want NA. Then we apply it:

corrupted <- data
corrupted[mask] <- NA

The mask is created by helper functions specific to each mechanism:

# MCAR: each cell independently missing with probability b/100
mask <- matrix(runif(n * n_vars) < b / 100, nrow = n, ncol = n_vars)

# System blackout: all variables missing for same contiguous block
mask <- matrix(FALSE, n, n_vars)
start <- sample(2:(n - block_size), 1)   # protect ends
mask[start:(start + block_size - 1), ] <- TRUE

# Independent blocks: each variable gets its own blocks
mask <- matrix(FALSE, n, n_vars)
for (j in 1:n_vars) {
  for (k in 1:n_blocks[j]) {
    start <- sample(1:(n - block_size[j]), 1)
    mask[start:(start + block_size[j] - 1), j] <- TRUE
  }
}

# MAR: user function provides probabilities
prob_matrix <- mv_mar_func(data)        # user-defined
mask <- matrix(runif(n * n_vars) < prob_matrix, nrow = n, ncol = n_vars)

3.2.2 Extended `sample_dat()` Skeleton

sample_dat <- function(datin, ..., mv_mechanism = "mcar",
                       mv_mar_func = NULL,
                       mv_block_size = 10,
                       mv_n_blocks = 1,
                       mv_synchronize = FALSE,
                       mv_protect_ends = TRUE) {

  if (!multivariate) {
    # existing univariate code ...

  } else {
    out <- vector("list", repetition)

    for (i in 1:repetition) {
      mask <- switch(mv_mechanism,
        mcar = matrix(runif(n * n_vars) < b / 100, n, n_vars),
        block = generate_block_mask(
          n, n_vars, mv_block_size, mv_n_blocks,
          mv_synchronize, mv_protect_ends
        ),
        system_blackout = generate_system_blackout_mask(
          n, n_vars, mv_block_size, mv_protect_ends
        ),
        mar = generate_mar_mask(datin, mv_mar_func, ...)
      )

      corrupted <- datin
      corrupted[mask] <- NA
      out[[i]] <- corrupted
    }

    return(out)
  }
}

The helper functions (generate_block_mask, etc.) are implemented in separate internal files with unit tests.

3.3 Integration with `impute_errors()`

The three nested loops remain unchanged. The critical modifications are inside the innermost repetition loop and in the error storage.

3.3.1 Per-Variable Error Computation

errs <- lapply(out, function(y) {   # y is a corrupted matrix
  filled <- eval(parse(text = toeval))  # imputed matrix (same dimensions)

  if (multivariate) {
    # Compute error per variable using existing vector error functions
    err_vec <- numeric(n_vars)
    for (j in 1:n_vars) {
      err_vec[j] <- eval(
        parse(text = paste0(errorParameter, "(dataIn[,j], filled[,j])"))
      )
    }
    return(err_vec)   # vector of length n_vars

  } else {
    # univariate: single error value
    return(eval(parse(text = paste0(errorParameter, "(dataIn, filled)"))))
  }
})

3.3.2 Matrix Storage

After processing all repetitions for a given method and percentage, we combine the list of error vectors into a matrix:

err_matrix <- do.call(rbind, errs)   # dimensions: repetition x n_vars
errall[[method]][[x]] <- err_matrix

Each cell of errall now stores a matrix instead of a vector.

3.4 Final Output Structure

The final summary (the returned errprof object) is built by averaging over repetitions per variable:

out <- lapply(errall, function(method_list) {
  avg_per_perc <- lapply(method_list, colMeans)   # list of vectors
  do.call(rbind, avg_per_perc)                    # matrix: percentages x variables
})

names(out) <- methods
out <- c(list(Parameter = errorParameter, MissingPercent = percs), out)
class(out) <- c("errprof", "list")
attr(out, "errall") <- errall

For each method, the user now gets a matrix where:

Rows = missing percentages
Columns = variables

This immediately shows how a method performs on each variable separately.

3.5 Backward Compatibility

All changes are conditional on input type. If dataIn is a vector or ts, multivariate = FALSE and the original univariate code path is used.
New mv_* arguments are simply ignored for univariate input, so existing scripts continue to work unchanged.
Existing unit tests for univariate data must pass without modification.

4 Performance Optimisation

Multivariate benchmarking can be computationally heavy. We will add an optional parallel argument to impute_errors() (default FALSE). When TRUE, we use the future and future.apply packages to parallelise the outer loop over percentages. This is safe because percentages are independent.

5 Why This Design Is Better

Feature	Benefit
Seamless integration	Users don’t need to learn new functions
Handles repetitions correctly	Returns a list of corrupted matrices, matching the univariate contract
Per-variable error storage	Matrices inside `errall` are a natural extension of vectors
Realistic missingness	Multiple blocks, system blackouts, flexible MAR
Performance built-in	Optional parallelisation directly in the loop
Backward compatible	Existing univariate code unchanged

6 Future Extensions: True Multivariate Imputation Methods

The current proposal focuses on the infrastructure for multivariate missing data simulation and evaluation. The imputation methods themselves (e.g., na.approx, na.interp) are still applied column-wise and do not exploit correlations between variables.

However, the framework we build will be ready to accommodate truly multivariate methods that use cross-variable relationships, such as:

mtsdi — EM-based imputation for multivariate time series
mice with time series adaptations
Amelia — bootstrapping + EM for time-series cross-sectional data
missMDA — PCA-based imputation
VAR models — vector autoregressive imputation

Structured Missing Data Simulation for Multivariate Time Series

Extending imputeTestbench for Multivariate Support

2026-03-14

1 Why Multivariate?

2 Current Univariate Architecture

3 Proposed Multivariate Extensions

3.1 Input Detection

3.2 Missing Data Generation — Extending `sample_dat()`

3.2.1 How It Works Internally

3.2.2 Extended `sample_dat()` Skeleton

3.3 Integration with `impute_errors()`

3.3.1 Per-Variable Error Computation

3.3.2 Matrix Storage

3.4 Final Output Structure

3.5 Backward Compatibility

4 Performance Optimisation

5 Why This Design Is Better

6 Future Extensions: True Multivariate Imputation Methods

Structured Missing Data Simulation for Multivariate Time Series

Extending imputeTestbench for Multivariate Support

2026-03-14

1 Why Multivariate?

2 Current Univariate Architecture

3 Proposed Multivariate Extensions

3.1 Input Detection

3.2 Missing Data Generation — Extending sample_dat()

3.2.1 How It Works Internally

3.2.2 Extended sample_dat() Skeleton

3.3 Integration with impute_errors()

3.3.1 Per-Variable Error Computation

3.3.2 Matrix Storage

3.4 Final Output Structure

3.5 Backward Compatibility

4 Performance Optimisation

5 Why This Design Is Better

6 Future Extensions: True Multivariate Imputation Methods

3.2 Missing Data Generation — Extending `sample_dat()`

3.2.2 Extended `sample_dat()` Skeleton

3.3 Integration with `impute_errors()`