A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.

hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

# S3 method for data.frame
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

# S3 method for disk.frame
hard_group_by(
  df,
  ...,
  outdir = tempfile("tmp_disk_frame_hard_group_by"),
  nchunks = disk.frame::nchunks(df),
  overwrite = TRUE,
  shardby_function = "hash",
  sort_splits = NULL,
  desc_vars = NULL,
  sort_split_sample_size = 100
)

Arguments

df

a disk.frame

...

grouping variables

.add

same as dplyr::group_by

.drop

same as dplyr::group_by

outdir

the output directory

nchunks

The number of chunks in the output. Defaults = nchunks.disk.frame(df)

overwrite

overwrite the out put directory

shardby_function

splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks

sort_splits

for the "sort" shardby function, a dataframe with the split values.

desc_vars

for the "sort" shardby function, the variables to sort descending.

sort_split_sample_size

for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits.

Examples

iris.df = as.disk.frame(iris, nchunks = 2)

# group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk
iris_hard.df = hard_group_by(iris.df, Species)
#> Hashing...
#> Hashing...
#> Appending disk.frames: 

get_chunk(iris_hard.df, 1)
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows
get_chunk(iris_hard.df, 2)
#> Warning: The chunk NA does not exist; returning an empty data.table
#> Null data.table (0 rows and 0 cols)

# clean up cars.df
delete(iris.df)
delete(iris_hard.df)