
Brandon Bertelsen asked a very reasonable question in the comments: why use so much data when a sub-sample (the initial set, after all, is a sample -)) might be adequate for the purposes of creating a cross-tabulation? Not to drift too far into statistics, but the data arises from cases where the TRUE observations are very rare, for both variables. If you don't know whether or not all of your entries are NA, you could test with any(is.na(yourVector)) - this would be wise before you adopt some of the ideas arising in this Q&A. :) I've been debugging NA and other non-finite values - they've been a headache for me recently. Lest anyone point out that logical vectors in R support NA, and that that will be a wrench in the system for these cross tabulations (which is true in most cases), I should point out that my vectors come from is.na() or is.finite(). Notice that bigtabulate, with an extra type conversion ( 1 * cbind.) still beats table. Yet, it still beats the pants off of table. Not only that, but it's not even compiled or parallelized.

CASEWARE IDEA CROSS TABULATION FREE
Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.įeel free to vary N and p, if that will demonstrate anything interesting behavior of the tabulation functions. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster. bigtabulate seems like overkill for a pair of logical vectors. I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.

I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time). I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.Įxample code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big I chose a total size just under 2.5GB (2.4GB). For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
