Let’s set-up disk.frame

library(disk.frame)

# set-up disk.frame to use multiple workers
if(interactive()) {
  setup_disk.frame()
  # highly recommended, however it is pun into interactive() for CRAN because
  # change user options are not allowed on CRAN
  options(future.globals.maxSize = Inf)  
} else {
  setup_disk.frame(2)
}

One of the most important tasks to perform before using the disk.frame package is to make some disk.frames! There are a few functions to help you do that.

Convert a data.frame to disk.frame

Firstly there is as.disk.frame() which allows you to make a disk.frame from a data.frame, e.g.

flights.df = as.disk.frame(nycflights13::flights)

will convert the nycflights13::flights data.frame to a disk.frame somewhere in tempdir(). To find out the location of the disk.frame use:

attr(flights.df, "path")

You can also specify a location to output the disk.frame to using outdir

flights.df = as.disk.frame(nycflights13::flights, outdir = "some/path.df")

it is recommended that you use .df as the extension for a disk.frame, however this is not an enforced requirement.

However, one of the reasons for disk.frame to exist is to handle larger-than-RAM files, hence as.disk.frame is not all that useful because it can only convert data that can fit into RAM. disk.frame comes with a couple more ways to create disk.frame.

Creating disk.frame from CSVs

The function csv_to_disk.frame can convert CSV files to disk.frame. The most basic usage is

some.df = csv_to_disk.frame("some/path.csv", outdir = "some.df")

this will convert the CSV file "some/path.csv" to a disk.frame.

Multiple CSV files

However, sometimes we have multiple CSV files that you want to read in and row-bind into one large disk.frame. You can do so by supplying a vector of file paths e.g. from the result of list.files

some.df = csv_to_disk.frame(c("some/path/file1.csv", "some/path/file2.csv"))

# or
some.df = csv_to_disk.frame(list.files("some/path"))

Ingesting CSV files chunk-wise

The csv_to_disk.frame(path, ...) function reads the file located at path in full into RAM but sometimes the CSV file may be too large to read in one go, as that would require loading the whole file into RAM. In that case, you can read the files chunk-by-chunk by using the in_chunk_size argument which controls how many rows you read in per chunk

# to read in 1 million (=1e6) rows per chunk
csv_to_disk.frame(path, in_chunk_size = 1e6)

When in_chunk_size is specified, the input file is split into many smaller files using bigreadr’s split file functions. This is generally the fastest way to ingest large CSVs, as the split files can be processed in parallel using all CPU cores. But the disk space requirement is doubled because the split files are as large as the original file. If you run out of disk space, then you must clean R’s temporary folder at tempdir() and choose another chunk_reader e.g. csv_to_disk.frame(..., chunk_reader = "LaF").

Sharding

One of the most important aspects of disk.frame is sharding. One can shard a disk.frame at read time by using the shardby

csv_to_disk.frame(path, shardby = "id")

In the above case, all rows with the same id values will end up in the same chunk.

Just-in-time transformation

Sometimes, one may wish to perform some transformation on the CSV before writing out to disk. One can use the inmapfn argument to do that. The inmapfn name comes from INput MAPping FuNction. The general usage pattern is as follows:

csv_to_disk.frame(file.path(tempdir(), "df.csv"), inmapfn = function(chunk) {
  some_transformation(chunk)
})

As a contrived example, suppose you wish to convert a string into date at read time:

df = data.frame(date_str = c("2019-01-02", "2019-01-02"))

# write the data.frame 
write.csv(df, file.path(tempdir(), "df.csv"))


# this would show that date_str is a string
str(collect(csv_to_disk.frame(file.path(tempdir(), "df.csv")))$date_str)
## chr [1:2] "2019-01-02" "2019-01-02"

# this would show that date_str is a string
df = csv_to_disk.frame(file.path(tempdir(), "df.csv"), inmapfn = function(chunk) {
  # convert to date_str to date format and store as "date"
  chunk[, date := as.Date(date_str, "%Y-%m-%d")]
  chunk[, date_str:=NULL]
})

str(collect(df)$date)
## Date[1:2], format: "2019-01-02" "2019-01-02"

Reading CSVs from zip files

Often, CSV comes zipped in a zip files. You can use the zip_to_disk.frame to convert all CSVs within a zip file

zip_to_disk.frame(path_to_zip_file)

The arguments for zip_to_disk.frame are the same as csv_to_disk.frame’s.

Using add_chunk

What if the method of converting to a disk.frame isn’t implemented in disk.frame yet? One can use some lower level constructs provided by disk.frame to create disk.frames. For example, the add_chunk function can be used to add more chunks to a disk.frame, e.g.

a.df = disk.frame() # create an empty disk.frame
add_chunk(a.df, cars) # adds cars as chunk 1
add_chunk(a.df, cars) # adds cars as chunk 2

Another example of using add_chunk is via readr’s chunked read functions to create a delimited file reader

delimited_to_disk.frame <- function(file, outdir, ...) {
  res.df = disk.frame(outdir, ...)
  readr::read_delim_chunked(file, callback = function(chunk) {
    add_chunk(res.df, chunk)
  }, ...)
  
  res.df
}

delimited_to_disk.frame(path, outdir = "some.df")

The above code uses readr’s read_delim_chunked function to read file and call add_chunk. The problem with this approach is that is it sequential in nature and hence is not able to take advantage of parallelism.

Exploiting the structure of a disk.frame

Of course, a disk.frame is just a folder with many fst files named as 1.fst, 2.fst etc. So one can simply create these fst files and ensure they have the same variable names and put them in a folder.