Data Deprecation in R-package

included in data

2020-05-08 946 words 5 minutes

Contents

After my recent article on marking deprecated code in R, I had the same problem on R data package. Unfortunately I was not able to find a convenient out-of-the-box solution, see this question on SO.

So after a first draft, I’ve applied a process that seems to be a reasonable–good enough–solution.

Move the data file

The first step is to move the data file from its default location ./data to another location in order to avoid its automatic loading–even if it’s a lazy loading.

Note: For this article I have created a dataset called my_data (see the Example section for a guideline on how to create an example dataset) that I will use as an example to deprecate it.

The target location is ./data-raw it’s a directory that is used by convention to store script and raw data in order to be able to update or reproduce the production of the exported dataset–more on that in the Data chapter of the book R packages.

I’m using by convention a leg_ prefix to flag it as a legacy dataset.

$ mv ./data/my_data.rda ./data-raw/leg_my_data.rda

Write a script to transform the dataset

The code used to transform the dataset from its legacy to its new format is stored along with the legacy dataset in ./data-raw/my_data.R. This will make the whole process reproducible.

# my_data new version

library(tidyverse)

# Load legacy data -----
load("data-raw/leg_my_data.rda")
leg_my_data <- my_data

# Create the new dataset -----
# Perform here every change that has to be performed
my_data <- leg_my_data %>%
  rename(cat = categ) %>%
  arrange(categ)

# Write the new dataset ----
usethis::use_data(my_data, overwrite = TRUE, compress = 'xz')

Source the file and you’re good the new version is live!

source('./data-raw/my_data.R', echo=TRUE)

# ✓ Saving 'my_data' to 'data/my_data.rda'
# ● Document your data (see 'https://r-pkgs.org/data.html')

my_data

# A tibble: 10 x 2
#   categ   val
#   <fct> <int>
# 1 a         9
# 2 a         6
# 3 a         4

Secret sauce

In the ./R/my_package-package.R file, create a legacy_mode function. This function will be a way for the users to load the previous (legacy) version of the datasets if they need to use them for compatibility reason.

#' Load legacy version of datasets.
#'
#' Load legacy (previous) version of all the datasets for compatibility reason.
#' The environment where data will be loaded can be chosen.
#'
#' @param envdir the environment where the data should be loaded.
#' @param verbose should item names be printed during loading?
#'
#' @export
#'
#' @examples
#' \dontrun{
#' # Default version
#' head(my_data, 3)
#'
#' # A tibble: 3 x 2
#' # categ   val
#' # <fct> <int>
#' # 1 a         9
#' # 2 a         6
#' # 3 a         4
#'
#' # Activate the compatibility mode
#' legacy_mode()
#'
#' # Loading objects:
#' #  my_data
#' # Warning message:
#' # This function replaces datasets with the previous (legacy) version for compatibility reason
#'
#' # Back to legacy (previous) version
#' head(my_data, 3)
#'
#' # A tibble: 3 x 2
#' # cat     val
#' # <fct> <int>
#' # 1 a         9
#' # 2 c         2
#' # 3 b         3
#' }
legacy_mode <- function(envdir = parent.frame(), verbose = TRUE) {
  .Deprecated(msg = "This function replaces datasets with the previous (legacy) version for compatibility reason")
  # TODO: To be improved to load a subset of datasets
  paths <- sort(Sys.glob(c("data-raw/leg_*.rda", "data-raw/leg_*.RData")))
  for (i in 1:length(paths)) {
    load(paths[i], envir = envdir, verbose = verbose)
  }
}

Result

And so now you have access to both the new version of the dataset available by default and the legacy version if needed for compatibility reasons. If the legacy data is used, a proper deprecation message is displayed.

# The current version
head(my_data, 3)

# A tibble: 3 x 2
  categ   val
  <fct> <int>
1 a         9
2 a         6
3 a         4

# Activation of the legacy mode
legacy_mode()

# Loading objects:
#   my_data
# Warning message:
# This function replaces datasets with the previous (legacy) version for # compatibility reason 

# Legacy version
head(my_data, 3)

# A tibble: 3 x 2
#   cat     val
#   <fct> <int>
# 1 a         9
# 2 c         2
# 3 b         3

Do not forget to document your changes by updating the dataset documentation in R/my_data.R. You can mention in a note the legacy mode.

Notes

Example

Here is a way to create the example dataset used in this article.

Create and save the dataset

# A test data frame
set.seed(2)
my_data <- tibble(cat = factor(sample(c("a", "b", "c", "d"), 10, replace = TRUE)), 
       val = sample(1:10))

# Writing the dataset
usethis::use_data(my_data, overwrite = TRUE, compress = 'xz')

# ✓ Saving 'my_data' to 'data/my_data.rda'
# ● Document your data (see 'https://r-pkgs.org/data.html')

Document and export the dataset

Document the dataset by creating the following R script R/my_data.R.

#' An example dataset
#'
#' This is an example dataset to illustrate the deprecation process.
#'
#' @docType data
#'
#' @format A tibble with 10 rows and 2 columns
#' \describe{
#'   \item{cat}{A dummy category}
#'   \item{val}{A dummy value}
#' }
#'
#' @rdname my_data
#'
#' @examples
#' head(my_data)
#'
"my_data"

Document and export it.

devtools::document(roclets = c('rd', 'collate', 'namespace'))

# Writing my_data.Rd
# Writing NAMESPACE

It can now be seen in the package and used directly.

data(package="my_package")

# Data sets in package ‘my_package’:
# my_data         An example dataset

my_data

# A tibble: 10 x 2
#   cat     val
#   <fct> <int>
# 1 a         9
# 2 c         2