A hard_arrange is a sort by that also reorganizes the chunks to ensure that every unique grouping of `by`` is in the same chunk. Or in other words, every row that share the same `by` value will end up in the same chunk.

hard_arrange(df, ..., add = FALSE, .drop = FALSE)

# S3 method for data.frame
hard_arrange(df, ...)

# S3 method for disk.frame
hard_arrange(
  df,
  ...,
  outdir = tempfile("tmp_disk_frame_hard_arrange"),
  nchunks = disk.frame::nchunks(df),
  overwrite = TRUE
)

Arguments

df

a disk.frame

...

grouping variables

add

same as dplyr::arrange

.drop

same as dplyr::arrange

outdir

the output directory

nchunks

The number of chunks in the output. Defaults = nchunks.disk.frame(df)

overwrite

overwrite the out put directory

Examples

iris.df = as.disk.frame(iris, nchunks = 2) # arrange iris.df by specifies and ensure rows with the same specifies are in the same chunk iris_hard.df = hard_arrange(iris.df, Species)
#> Appending disk.frames:
get_chunk(iris_hard.df, 1)
#> # A tibble: 50 x 5 #> # Groups: Species [1] #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <fct> #> 1 6.3 3.3 6 2.5 virginica #> 2 5.8 2.7 5.1 1.9 virginica #> 3 7.1 3 5.9 2.1 virginica #> 4 6.3 2.9 5.6 1.8 virginica #> 5 6.5 3 5.8 2.2 virginica #> 6 7.6 3 6.6 2.1 virginica #> 7 4.9 2.5 4.5 1.7 virginica #> 8 7.3 2.9 6.3 1.8 virginica #> 9 6.7 2.5 5.8 1.8 virginica #> 10 7.2 3.6 6.1 2.5 virginica #> # ... with 40 more rows
get_chunk(iris_hard.df, 2)
#> # A tibble: 50 x 5 #> # Groups: Species [1] #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <fct> #> 1 6.3 3.3 6 2.5 virginica #> 2 5.8 2.7 5.1 1.9 virginica #> 3 7.1 3 5.9 2.1 virginica #> 4 6.3 2.9 5.6 1.8 virginica #> 5 6.5 3 5.8 2.2 virginica #> 6 7.6 3 6.6 2.1 virginica #> 7 4.9 2.5 4.5 1.7 virginica #> 8 7.3 2.9 6.3 1.8 virginica #> 9 6.7 2.5 5.8 1.8 virginica #> 10 7.2 3.6 6.1 2.5 virginica #> # ... with 40 more rows
# clean up cars.df delete(iris.df) delete(iris_hard.df)