Please take a moment to star the disk.frame Github repo if you like disk.frame. It keeps me going.

Introduction

How can I manipulate structured tabular data that doesn’t fit into Random Access Memory (RAM)? Use {disk.frame}!

In a nutshell, {disk.frame} makes use of two simple ideas

1) split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder and 2) provide a convenient API to manipulate these chunks

{disk.frame} performs a similar role to distributed systems such as Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for medium data which are datasets that are too large for RAM but not quite large enough to qualify as big data that require distributing processing over many computers to be effective.

Installation

You can install the released version of {disk.frame} from CRAN with:

install.packages("disk.frame")

And the development version from GitHub with:

On some platforms, such as SageMaker, you may need to explicitly specify a repo like this

install.packages("disk.frame", repo="https://cran.rstudio.com")

Vignettes and articles

Please see these vignettes and articles about {disk.frame}

Common questions

a) What is {disk.frame} and why create it?

{disk.frame} is an R package that provides a framework for manipulating larger-than-RAM structured tabular data on disk efficiently. The reason one would want to manipulate data on disk is that it allows arbitrarily large datasets to be processed by R. In other words, we go from “R can only deal with data that fits in RAM” to “R can deal with any data that fits on disk”. See the next section.

b) How is it different to data.frame and data.table?

A data.frame in R is an in-memory data structure, which means that R must load the data in its entirety into RAM. A corollary of this is that only data that can fit into RAM can be processed using data.frames. This places significant restrictions on what R can process with minimal hassle.

In contrast, {disk.frame} provides a framework to store and manipulate data on the hard drive. It does this by loading only a small part of the data, called a chunk, into RAM; process the chunk, write out the results and repeat with the next chunk. This chunking strategy is widely applied in other packages to enable processing large amounts of data in R, for example, see chunkded arkdb, and iotools.

Furthermore, there is a row-limit of 2^31 for data.frames in R; hence an alternate approach is needed to apply R to these large datasets. The chunking mechanism in {disk.frame} provides such an avenue to enable data manipulation beyond the 2^31 row limit.

c) How is {disk.frame} different to previous “big” data solutions for R?

R has many packages that can deal with larger-than-RAM datasets, including ff and bigmemory. However, ff and bigmemory restrict the user to primitive data types such as double, which means they do not support character (string) and factor types. In contrast, {disk.frame} makes use of data.table::data.table and data.frame directly, so all data types are supported. Also, {disk.frame} strives to provide an API that is as similar to data.frame’s where possible. {disk.frame} supports many dplyr verbs for manipulating disk.frames.

Additionally, {disk.frame} supports parallel data operations using infrastructures provided by the excellent future package to take advantage of multi-core CPUs. Further, {disk.frame} uses state-of-the-art data storage techniques such as fast data compression, and random access to rows and columns provided by the fst package to provide superior data manipulation speeds.

d) How does {disk.frame} work?

{disk.frame} works by breaking large datasets into smaller individual chunks and storing the chunks in fst files inside a folder. Each chunk is a fst file containing a data.frame/data.table. One can construct the original large dataset by loading all the chunks into RAM and row-bind all the chunks into one large data.frame. Of course, in practice this isn’t always possible; hence why we store them as smaller individual chunks.

{disk.frame} makes it easy to manipulate the underlying chunks by implementing dplyr functions/verbs and other convenient functions (e.g. the map.disk.frame(a.disk.frame, fn, lazy = F) function which applies the function fn to each chunk of a.disk.frame in parallel). So that {disk.frame} can be manipulated in a similar fashion to in-memory data.frames.

e) How is {disk.frame} different from Spark, Dask, and JuliaDB.jl?

Spark is primarily a distributed system that also works on a single machine. Dask is a Python package that is most similar to {disk.frame}, and JuliaDB.jl is a Julia package. All three can distribute work over a cluster of computers. However, {disk.frame} currently cannot distribute data processes over many computers, and is, therefore, single machine focused.

In R, one can access Spark via sparklyr, but that requires a Spark cluster to be set up. On the other hand {disk.frame} requires zero-setup apart from running install.packages("disk.frame") or devtools::install_github("xiaodaigh/disk.frame").

Finally, Spark can only apply functions that are implemented for Spark, whereas {disk.frame} can use any function in R including user-defined functions.

Example usage

Set-up {disk.frame}

{disk.frame} works best if it can process multiple data chunks in parallel. The best way to set-up {disk.frame} so that each CPU core runs a background worker is by using

The setup_disk.frame() sets up background workers equal to the number of CPU cores; please note that, by default, hyper-threaded cores are counted as one not two.

Alternatively, one may specify the number of workers using setup_disk.frame(workers = n).

Quick-start

To find out where the disk.frame is stored on disk:

A number of data.frame functions are implemented for disk.frame

Example: dplyr verbs

Group by

Starting from {disk.frame} v0.2.2, there is support group_by for limited set of functions. For example:

result_from_disk.frame = iris %>% 
  as.disk.frame %>% 
  group_by(Species) %>% 
  summarize(
    mean(Petal.Length), 
    sumx = sum(Petal.Length/Sepal.Width), 
    sd(Sepal.Width/ Petal.Length), 
    var(Sepal.Width/ Sepal.Width), 
    l = length(Sepal.Width/ Sepal.Width + 2),
    max(Sepal.Width), 
    min(Sepal.Width), 
    median(Sepal.Width)
    ) %>% 
  collect

The results should be exactly the same as if applying the same group-by operations on a data.frame. If not then please report a bug. The current list of functions in a group_by-summarize are min, max, mean, sum, median, length, sd, var. If a function you like is missing, please make a feature request here.

Two-Stage Group by

Given the list of group-by functions is limited, so {disk.frame} supports a two-stage style grouping, enable maximum flexibility. The key is understand that chunk_group_by performs group-by within each chunk.

flights.df = as.disk.frame(nycflights13::flights)

flights.df %>%
  srckeep(c("year","distance")) %>%  # keep only carrier and distance columns
  chunk_group_by(year) %>% 
  chunk_summarise(sum_dist = sum(distance)) %>% # this does a count per chunk
  collect
#> # A tibble: 6 x 2
#>    year sum_dist
#>   <int>    <dbl>
#> 1  2013 57446059
#> 2  2013 59302212
#> 3  2013 56585094
#> 4  2013 58476357
#> 5  2013 59407019
#> 6  2013 59000866

This is two-stage group-by in action

You can mix group-by with other dplyr verbs as below, here is an example of using filter.

# filter
pt = proc.time()
df_filtered <-
  flights.df %>% 
  filter(month == 1)
cat("filtering a < 0.1 took: ", data.table::timetaken(pt), "\n")
#> filtering a < 0.1 took:  0.020s elapsed (0.020s cpu)
nrow(df_filtered)
#> [1] 336776

Hard group by

Another way to perform a one-stage group_by is to perform a hard_group_by on a disk.frame. This will rechunk thedisk.frame` by the by columns. This is not recommended for performance reasons, as it can quite slow to rechunk the chunks on disk.

pt = proc.time()
res1 <- flights.df %>% 
  srckeep(c("month", "dep_delay")) %>% 
  filter(month <= 6) %>% 
  mutate(qtr = ifelse(month <= 3, "Q1", "Q2")) %>% 
  hard_group_by(qtr) %>% # hard group_by is MUCH SLOWER but avoid a 2nd stage aggregation
  chunk_summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  collect
#> Hashing...
#> Hashing...
#> Hashing...
#> Hashing...
#> Hashing...
#> Hashing...
#> Appending disk.frames:
cat("group by took: ", data.table::timetaken(pt), "\n")
#> group by took:  1.020s elapsed (0.260s cpu)

collect(res1)
#> # A tibble: 2 x 2
#>   qtr   avg_delay
#>   <chr>     <dbl>
#> 1 Q1         11.4
#> 2 Q2         15.9

Contributors

This project exists thanks to all the people who contribute.

Current Priorities

The work priorities at this stage are

  1. Bugs
  2. Urgent feature implementations that can improve an awful user-experience
  3. More vignettes covering every aspect of disk.frame
  4. Comprehensive Tests
  5. Comprehensive Documentation
  6. More features

Blogs and other resources

Title Author Date Description
https://www.researchgate.net/post/What_is_the_Maximum_size_of_data_that_is_supported_by_R-datamining Knut Jägersberg 20191111 Great answer on using disk.frame
{disk.frame} is epic Bruno Rodriguez 20190903 It’s about loading a 30G file into {disk.frame}
My top 10 R packages for data analytics Jacky Poon 20190903 {disk.frame} was number 3
useR! 2019 presentation video Dai ZJ 20190803
useR! 2019 presentation slides Dai ZJ 20190803
Split-apply-combine for Maximum Likelihood Estimation of a linear model Bruno Rodriguez 20191006 {disk.frame} used in helping to create a maximum likelihood estimation program for linear models
Emma goes to useR! 2019 Emma Vestesson 20190716 The first mention of {disk.frame} in a blog post

Interested in learning {disk.frame} in a structured course?

Please register your interest at:

https://leanpub.com/c/taminglarger-than-ramwithdiskframe

Open Collective

If you like {disk.frame} and want to speed up its development or perhaps you have a feature request? Please consider sponsoring {disk.frame} on Open Collective

Backers

Thank you to all our backers! [Become a backer]

Sponsors

Support {disk.frame} development by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

Contact me for consulting

Do you need help with machine learning and data science in R, Python, or Julia? I am available for Machine Learning/Data Science/R/Python/Julia consulting! Email me

Download Counts & Build Status

Travis build status