hotlinekasce.blogg.se - Caseware idea cross tabulation

CASEWARE IDEA CROSS TABULATION FREE

Brandon Bertelsen asked a very reasonable question in the comments: why use so much data when a sub-sample (the initial set, after all, is a sample -)) might be adequate for the purposes of creating a cross-tabulation? Not to drift too far into statistics, but the data arises from cases where the TRUE observations are very rare, for both variables. If you don't know whether or not all of your entries are NA, you could test with any(is.na(yourVector)) - this would be wise before you adopt some of the ideas arising in this Q&A. :) I've been debugging NA and other non-finite values - they've been a headache for me recently. Lest anyone point out that logical vectors in R support NA, and that that will be a wrench in the system for these cross tabulations (which is true in most cases), I should point out that my vectors come from is.na() or is.finite(). Notice that bigtabulate, with an extra type conversion ( 1 * cbind.) still beats table. Yet, it still beats the pants off of table. Not only that, but it's not even compiled or parallelized.

8 assignments (1 for the logical operation, 1 for the summation).

4 type conversions (logical to integer or FP - for sum).

Yet, the key thing to realize is that the "logical" method is grossly inefficient. My first answer gives timings on three naive methods, which is the basis for believing table is slow.

CASEWARE IDEA CROSS TABULATION FREE

Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.įeel free to vary N and p, if that will demonstrate anything interesting behavior of the tabulation functions. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster. bigtabulate seems like overkill for a pair of logical vectors. I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.

Some of the above, with parallel from the multicore package (or the new parallel package).

Y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE) X = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time). I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.Įxample code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big I chose a total size just under 2.5GB (2.4GB). For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?