Shard a data.frame/data.table or disk.frame into chunk and saves it into a disk.frame

`distribute` is an alias for `shard`

shard(
  df,
  shardby,
  outdir = tempfile(fileext = ".df"),
  ...,
  nchunks = recommend_nchunks(df),
  overwrite = FALSE,
  shardby_function = "hash",
  sort_splits = NULL,
  desc_vars = NULL
)

distribute(...)

Arguments

df

A data.frame/data.table or disk.frame. If disk.frame, then rechunk(df, ...) is run

shardby

The column(s) to shard the data by.

outdir

The output directory of the disk.frame

...

not used

nchunks

The number of chunks

overwrite

If TRUE then the chunks are overwritten

shardby_function

splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks

sort_splits

If shardby_function is "sort", the split values for sharding

desc_vars

for the "sort" shardby function, the variables to sort descending.

Examples

# shard the cars data.frame by speed so that rows with the same speed are in the same chunk iris.df = shard(iris, "Species")
#> Hashing...
# clean up cars.df delete(iris.df)