A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)
# S3 method for data.frame
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)
# S3 method for disk.frame
hard_group_by(
df,
...,
outdir = tempfile("tmp_disk_frame_hard_group_by"),
nchunks = disk.frame::nchunks(df),
overwrite = TRUE,
shardby_function = "hash",
sort_splits = NULL,
desc_vars = NULL,
sort_split_sample_size = 100
)
a disk.frame
grouping variables
same as dplyr::group_by
same as dplyr::group_by
the output directory
The number of chunks in the output. Defaults = nchunks.disk.frame(df)
overwrite the out put directory
splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks
for the "sort" shardby function, a dataframe with the split values.
for the "sort" shardby function, the variables to sort descending.
for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits.
iris.df = as.disk.frame(iris, nchunks = 2)
# group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk
iris_hard.df = hard_group_by(iris.df, Species)
#> Hashing...
#> Hashing...
#> Appending disk.frames:
get_chunk(iris_hard.df, 1)
#> # A tibble: 150 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
get_chunk(iris_hard.df, 2)
#> Warning: The chunk NA does not exist; returning an empty data.table
#> Null data.table (0 rows and 0 cols)
# clean up cars.df
delete(iris.df)
delete(iris_hard.df)