A hard_arrange is a sort by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.
a disk.frame
grouping variables
same as dplyr::arrange
same as dplyr::arrange
the output directory
The number of chunks in the output. Defaults = nchunks.disk.frame(df)
overwrite the out put directory
iris.df = as.disk.frame(iris, nchunks = 2)
# arrange iris.df by specifies and ensure rows with the same specifies are in the same chunk
iris_hard.df = hard_arrange(iris.df, Species)
#> Appending disk.frames:
get_chunk(iris_hard.df, 1)
#> # A tibble: 50 x 5
#> # Groups: Species [1]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # ... with 40 more rows
get_chunk(iris_hard.df, 2)
#> # A tibble: 50 x 5
#> # Groups: Species [1]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # ... with 40 more rows
# clean up cars.df
delete(iris.df)
delete(iris_hard.df)