Key `{disk.frame}` concepts

There are a number of concepts and terminologies that are useful to understand in order to use disk.frame effectively.

What is a `disk.frame` and what are chunks?

A disk.frame is a folder containing fst files named “1.fst”, “2.fst”, “3.fst” etc. Each of the “.fst” file is called a chunk.

Workers and parallelism

Parallelism in disk.frame is achieved using the future package. When performing many tasks, disk.frame uses multiple workers, where each worker is an R session, to perform the tasks in parallel.

It is recommended that you should run the following immediately after library(disk.frame) to set-up multiple workers. For example:

library(disk.frame)
setup_disk.frame()

# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)

For example, suppose we wish to compute the number of rows for each chunk, we can clearly perform this simultaneously in parallel. The code to do that is

# use only one column is fastest
df[,.N, keep = "first_col"]

or equivalent using the srckeep function

# use only one column is fastest
srckeep(df, "first_col")[,.N, keep = "first_col"]

Say there are n chunks in df, and there are m workers. Then the first m chunks will run chunk[,.N] simultaneously.

To see how many workers are at work, use

# see how many workers are available for work
future::nbrOfWorkers()

How `{disk.frame}` works

When df %>% some_fn %>% collect is callled. The some_fn is applied to each chunk of df. The collect will row-bind the results from some_fn(chunk)together if the returned value of some_fn is a data.frame, or it will return a list containing the results of some_fn.

The session that receives these results is called the main session. In general, we should try to minimise the amount of data passed from the worker sessions back to the main session, because passing data around can be slow.

Also, please note that there is no communication between the workers, except for workers passing data back to the main session.

Key `{disk.frame}` concepts

ZJ

Key `{disk.frame}` concepts

What is a `disk.frame` and what are chunks?

Workers and parallelism

How `{disk.frame}` works

Contents

Key {disk.frame} concepts

ZJ

Key {disk.frame} concepts

What is a disk.frame and what are chunks?

Workers and parallelism

How {disk.frame} works

Contents

Key `{disk.frame}` concepts

Key `{disk.frame}` concepts

What is a `disk.frame` and what are chunks?

How `{disk.frame}` works