Perform a hard group — hard_group

A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.

hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

# S3 method for data.frame
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

# S3 method for disk.frame
hard_group_by(
  df,
  ...,
  outdir = tempfile("tmp_disk_frame_hard_group_by"),
  nchunks = disk.frame::nchunks(df),
  overwrite = TRUE,
  shardby_function = "hash",
  sort_splits = NULL,
  desc_vars = NULL,
  sort_split_sample_size = 100
)

Arguments

df: a disk.frame
...: grouping variables
.add: same as dplyr::group_by
.drop: same as dplyr::group_by
outdir: the output directory
nchunks: The number of chunks in the output. Defaults = nchunks.disk.frame(df)
overwrite: overwrite the out put directory
shardby_function: splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks
sort_splits: for the "sort" shardby function, a dataframe with the split values.
desc_vars: for the "sort" shardby function, the variables to sort descending.
sort_split_sample_size: for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits.

Examples

iris.df = as.disk.frame(iris, nchunks = 2)

# group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk
iris_hard.df = hard_group_by(iris.df, Species)
#> Hashing...
#> Hashing...
#> Appending disk.frames: 

get_chunk(iris_hard.df, 1)
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows
get_chunk(iris_hard.df, 2)
#> Warning: The chunk NA does not exist; returning an empty data.table
#> Null data.table (0 rows and 0 cols)

# clean up cars.df
delete(iris.df)
delete(iris_hard.df)