Someone wrote a blog about {disk.frame} being “epic”

Mr Bruno Rodrigues had kindly written a nice blog post titled ‘{disk.frame} is epic’. In the post, he compared {disk.frame}’s performance vs Spark’s on doing a simple group-by operation. The task is done on a single-machine and hence is not the ideal use-case for Spark. However, in an earlier post, Bruno had shown how Spark can be used for this use case. This is symptomatic of what I call the “premature adoption of Spark syndrome”. It’s not hard to find examples where Spark is used when a more light-weight tool like Dask or {disk.frame} could be more efficient. There is no need to use a sledgehammer to crack a nut. In data science, what a “nut” can represent is getting larger and larger. Ten years ago, a 1GB (in CSV) dataset can be a struggle for R and Python, as many machines back then were still 32bit and hence did not have more than 4GB of RAM in general. Nowadays, 16GB laptop are common place, so any file less than 4GB in size can be considered a “nut”. Furthermore, if you use a tool like {disk.frame} which keeps the data on-disk until you need to process it, then the “nut” you can crack is now substantially larger than 4GB. I would argue, the sledgehammer that is Spark should only be considered for much larger datasets. I don’t want to put a number to it, but the largest dataset I have dealt with is the Fannie Mae dataset which is about 200GB in size in CSV format.

How to make {disk.frame} more ‘epic’?

If you follow this post by Bruno, you will end up with a 30GB CSV file. But this 30GB file was made by row-appending 303 files (size ranging from 70MB to 143MB) together. I will you a few ways to handle such data. As usual, let’s start by setting up {disk.frame}

Load the individuals files

The best way to use {disk.frame} is to load the individual files, as csv_to_disk.frame allows a vector of files to be used as input:

However, there is always a danger with importing from CSVs, especially when importing from multiple CSVs. The chief danger is this: CSV do not encode the type of the columns. Hence the column types need to be inferred. However, when inferring the column type, we only use the first few thousand rows. This is fine most of the time. However, there are times when the column type inferred using the first few thousand rows isn’t compatible with data that comes after. That’s why you will see an error

when running the below code

list_files = list.files(path_to_files, patter="air", full.names = TRUE)
system.time(flights.df <- csv_to_disk.frame(list_files, co))

I need to come up with a framework to help with these types of task, but for now, you should use the colClasses argument whenever possible to set the column types. For example,

list_files = list.files(path_to_files, patter="air", full.names = TRUE)

system.time(a <- csv_to_disk.frame(
   list_files,
   colClasses = list(character = c("WHEELS_OFF", "WHEELS_ON"))
))

which yielded

As you can see the the 303 files are converted to a disk.frame in about ~3mins, this much faster than Spark.

Load one large-file (splitting)

However, in real life you may be given one large CSV and told to work from there. In Linux or MacOS, we can use the split command to split up the file first, and apply the previous strategy. But on Windows, this isn’t trivial unless you have installed some nix tools (e.g. git bash).

Starting from {disk.frame} v0.1.1, we have something to help with this. When you call csv_to_disk.frame with the in_chunk_size argument, then the big file will be split into smaller files using {bigreadr} before reading. This happen automatically.

Splitting up a large file might take a few minutes, but once the file is split, we can use multiple cores to process the split-up files in parallel! The performance gain there might offset the overhead of splitting a file.

As mentioned, the key is setting argument in_chunk_size to a value, which will activate the file splitting

which yielded

As you can see, this takes longer than loading the individual files approach. I am working on a guess the in_chunk_size feature so that in_chunk_size="guess" will work in the future, so the user doesn’t have to input a manual in_chunk_size.

Load one large-file (no splitting)

Split up the file will effectively double the amount of disk space needed. We have the original file and we need additional space equal to the size of the original file to store the split files. This is not always feasible and that’s why you can specify the LaF backend which was shown in Bruno’s post but repeated here for completeness. The problem with a LaF backend and reading one big file is that it can NOT be parallelized and hence it is much slower than the other approaches.

there were multiple warnings

and the timings were

As can be seen LaF works and can detect the column types automatically. However it is slower than the default backend which is data.table.

Summary

Here is a chart summarizing the performance of the various approaches for reading large CSVs on my computer

Can {disk.frame} be even more “epic”?

Well yes! We can actually speed up the group-by operation that Bruno did by using srckeep. The use of srckeep can’t be emphasized enough! It works by reading from disk only the columns needed for the analysis, and hence disk IO time is (drastically) reduced!

Compare the above the to timing without srckeep

On my computer, with srckeep the timing is only ~3 seconds and without it is 16 seconds!

Conclusion

So there you go! {disk.frame} can be even more “epic”! Here are the two main take-aways

  1. Load CSV files as many individual files if possible to take advantage of multi-core parallelism
  2. srckeep is your friend! Disk IO is often the bottleneck in data manipulation, and you can reduce disk IO by specifying only columns that you will use with srckeep(c(columns1, columns2, ...)).