A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.

hard_group_by(df, ..., add = FALSE, .drop = FALSE)

# S3 method for data.frame
hard_group_by(df, ..., add = FALSE, .drop = FALSE)

# S3 method for disk.frame
hard_group_by(df, ...,
  outdir = tempfile("tmp_disk_frame_hard_group_by"),
  nchunks = disk.frame::nchunks(df), overwrite = TRUE,
  shardby_function = "hash", sort_splits = NULL, desc_vars = NULL,
  sort_split_sample_size = 100)

Arguments

df

a disk.frame

...

grouping variables

add

same as dplyr::group_by

.drop

same as dplyr::group_by

outdir

the output directory

nchunks

The number of chunks in the output. Defaults = nchunks.disk.frame(df)

overwrite

overwrite the out put directory

shardby_function

splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks

sort_splits

for the "sort" shardby function, a dataframe with the split values.

desc_vars

for the "sort" shardby function, the variables to sort descending.

sort_split_sample_size

for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits.

Examples

iris.df = as.disk.frame(iris, nchunks = 2) # group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk iris_hard.df = hard_group_by(iris.df, Species)
#> Hashing...
#> Hashing...
#> Appending disk.frames:
get_chunk(iris_hard.df, 1)
#> # A tibble: 150 x 5 #> # Groups: Species [3] #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <fct> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # ... with 140 more rows
get_chunk(iris_hard.df, 2)
#> Warning: The chunk NA does not exist; returning an empty data.table
#> Null data.table (0 rows and 0 cols)
# clean up cars.df delete(iris.df) delete(iris_hard.df)