ingesting-data.Rmd
One of the most important tasks to perform before using the {disk.frame}
package is to make some disk.frame
s! There are a few functions to help you do that. Before we do that, we set up the {disk.frame}
as usual
Setting up
library(disk.frame)
# set-up disk.frame to use multiple workers
if(interactive()) {
setup_disk.frame()
# highly recommended, however it is pun into interactive() for CRAN because
# change user options are not allowed on CRAN
options(future.globals.maxSize = Inf)
} else {
setup_disk.frame(2)
}
data.frame
to disk.frame
Firstly, there is as.disk.frame()
which allows you to make a disk.frame
from a data.frame
, e.g.
will convert the nycflights13::flights
data.frame
to a disk.frame
somewhere in tempdir()
. To find out the location of the disk.frame
use:
You can also specify a location to output the disk.frame
to using outdir
it is recommended that you use .df
as the extension for a disk.frame
, however this is not an enforced requirement.
However, one of the reasons for disk.frame
to exist is to handle larger-than-RAM files, hence as.disk.frame
is not all that useful because it can only convert data that can fit into RAM. disk.frame
comes with a couple more ways to create disk.frame
.
disk.frame
from CSVsThe function csv_to_disk.frame
can convert CSV files to disk.frame
. The most basic usage is
this will convert the CSV file "some/path.csv"
to a disk.frame
.
However, sometimes we have multiple CSV files that you want to read in and row-bind into one large disk.frame
. You can do so by supplying a vector of file paths e.g. from the result of list.files
The csv_to_disk.frame(path, ...)
function reads the file located at path
in full into RAM but sometimes the CSV file may be too large to read in one go, as that would require loading the whole file into RAM. In that case, you can read the files chunk-by-chunk by using the in_chunk_size
argument which controls how many rows you read in per chunk
When in_chunk_size
is specified, the input file is split into many smaller files using bigreadr
’s split file functions. This is generally the fastest way to ingest large CSVs, as the split files can be processed in parallel using all CPU cores. But the disk space requirement is doubled because the split files are as large as the original file. If you run out of disk space, then you must clean R’s temporary folder at tempdir()
and choose another chunk_reader
e.g. csv_to_disk.frame(..., chunk_reader = "LaF")
.
Sometimes, one may wish to perform some transformation on the CSV before writing out to disk. One can use the inmapfn
argument to do that. The inmapfn
name comes from INput MAPping FuNction. The general usage pattern is as follows:
csv_to_disk.frame(file.path(tempdir(), "df.csv"), inmapfn = function(chunk) {
some_transformation(chunk)
})
As a contrived example, suppose you wish to convert a string into date at read time:
df = data.frame(date_str = c("2019-01-02", "2019-01-02"))
# write the data.frame
write.csv(df, file.path(tempdir(), "df.csv"))
# this would show that date_str is a string
str(collect(csv_to_disk.frame(file.path(tempdir(), "df.csv")))$date_str)
## chr [1:2] "2019-01-02" "2019-01-02"
# this would show that date_str is a string
df = csv_to_disk.frame(file.path(tempdir(), "df.csv"), inmapfn = function(chunk) {
# convert to date_str to date format and store as "date"
chunk[, date := as.Date(date_str, "%Y-%m-%d")]
chunk[, date_str:=NULL]
})
str(collect(df)$date)
## Date[1:2], format: "2019-01-02" "2019-01-02"
Often, CSV comes zipped in a zip files. You can use the zip_to_disk.frame
to convert all CSVs within a zip file
The arguments for zip_to_disk.frame
are the same as csv_to_disk.frame
’s.
add_chunk
What if the method of converting to a disk.frame
isn’t implemented in disk.frame
yet? One can use some lower level constructs provided by disk.frame
to create disk.frame
s. For example, the add_chunk
function can be used to add more chunks to a disk.frame
, e.g.
a.df = disk.frame() # create an empty disk.frame
add_chunk(a.df, cars) # adds cars as chunk 1
add_chunk(a.df, cars) # adds cars as chunk 2
Another example of using add_chunk
is via readr
’s chunked read functions to create a delimited file reader
delimited_to_disk.frame <- function(file, outdir, ...) {
res.df = disk.frame(outdir, ...)
readr::read_delim_chunked(file, callback = function(chunk) {
add_chunk(res.df, chunk)
}, ...)
res.df
}
delimited_to_disk.frame(path, outdir = "some.df")
The above code uses readr
’s read_delim_chunked
function to read file
and call add_chunk
. The problem with this approach is that is it sequential in nature and hence is not able to take advantage of parallelism.