R CMD check
R CMD check
passes cleanly in future R-devel.R CMD check
passes cleanly in future R-devel.R CMD check
passes cleanly on R and R-devel.R CMD check
passes cleanly on R and R-devel.R CMD check
passes cleanly on R and R-devel.loop_apply()
as Rcpp version was appears to be
having PROTECTion problems. (Also fixes #256)Update for changes in R namespace best-practices.
New parameter .id
to adply()
that specifies the name(s) of
the index column(s). (Thanks to Kirill Müller, #191)
Fix bug in split_indices()
when n
isn't supplied.
Fix bug in .id
parameter to ldply()
and rdply()
allowing for
.id = NULL
to work as described in the help. (Thanks to Doug Mitarotonda, #207,
and Marek, #224 and #225)
Deprecate exotic functions liply()
and isplit2()
, remove unused and
unexported functions dots()
and parallel_fe()
(Thanks to Kirill
Müller, #242, #248)
Warn on duplicate names that cause certain array functions to fail. (Thanks to Kirill Müller, #211)
Parameter .inform
is now honored for ?_ply()
calls. (Thanks to
Kirill Müller, #209)
New parameter .id
to ldply()
and rdply()
that specifies the name of
the index column. (Thanks to Kirill Müller, #107, #140, #142)
The .id column in ldply()
is generated as a factor to preserve
the sort order, but only if the new .id
parameter is set. (Thanks to Kirill
Müller, #137)
rbind.fill
now silently drops NULL inputs (#138)
rbind.fill
avoids array copying which had produced quadratic time
complexity. *dply
of large numbers of groups should be faster.
(Contributed by Peter Meilstrup)
rbind.fill
handles non-numeric matrix columns (i.e. factor arrays,
character arrays, list arrays); also arrays with more than 2
dimensions can be used. Dimnames of array columns are now preserved.
(Contributed by Peter Meilstrup)
rbind.fill(x,y)
converts factor columns of Y to character when
columns of X are character. join(x,y)
and match_df(x,y)
now work
when the key column in X is character and Y is factor. (Contributed
by Peter Meilstrup)
Fix faulty array allocation which caused problems when using split_indices
with large (> 2^24) vectors. (Fixes #131)
list_to_array()
incorrectly determined dimensions if column of labels
contained any missing values (#169).
r*ply
expression is evaluated exactly .n
times, evaluation results are
consistent with side effects. (#158, thanks to Kirill Müller)
**ply
gain a .inform
argument (previously only available in llply
) - this gives more useful debugging information at the cost of some speed. (Thanks to Brian Diggs, #57)
if .dims = TRUE
alply
's output gains dimensions and dimnames, similar to apply
. Sequential indexing of a list produced by alply
should be unaffected. (Peter Meilstrup)
colwise
, numcolwise
and catcolwise
now all accept additional arguments in .... (Thanks to Stavros Macrakis, #62)
here
makes it possible to use **ply
+ a function that uses non-standard evaluation (e.g. summarise
, mutate
, subset
, arrange
) inside a function. (Thanks to Peter Meilstrup, #3)
join_all
recursively joins a list of data frames. (Fixes #29)
name_rows
provides a convenient way of saving and then restoring row names so that you can preserve them if you need to. (#61)
progress_time
(used with .progress = "time"
) estimates the amount of time remaining before the job is completed. (Thanks to Mike Lawrence, #78)
summarise
now works iteratively so that later columns can refer to earlier. (Thanks to Jim Hester, #44)
take
makes it easy to subset along an arbitrary dimension.
Improved documentation thanks to patches from Tim Bates.
**ply
gains a .paropts
argument, a list of options that is passed onto
foreach
for controlling parallel computation.
*_ply
now accepts .parallel
argument to enable parallel processing.
(Fixes #60)
Progress bars are disabled when using parallel plyr (Fixes #32)
a*ply
: 25x speedup when indexing array objects, 3x speedup when indexing
data frames. This should substantially reduce the overhead of using a*ply
d*ply
subsetting has been considerably optimised: this will have a small
impact unless you have a very large number of groups, in which case it will be
considerably faster.
idata.frame
: Subsetting immutable data frames with [.idf
is now
faster (Peter Meilstrup)
quickdf
is around 20% faster
split_indices
, which powers much internal splitting code (like
vaggregate
, join
and d*ply
) is about 2x faster. It was already
incredibly fast ~0.2s for 1,000,000 obs, so this won't have much impact on
overall performance
*aply
functions now bind list mode results into a list-array
(Peter Meilstrup)
*aply
now accepts 0-dimension arrays as inputs. (#88)
count
now works correctly for factor and Date inputs. (Fixes #130)
*dply
now deals better with matrix results, converting them to data frames,
rather than vectors. (Fixes #12)
d*ply
will now preserve factor levels input if drop = FALSE
(#81)
join
works correctly when there are no common rows (Fixes #74), or when
one input has no rows (Fixes #48). It also consistently orders the columns:
common columns, then x cols, then y cols (Fixes #40).
quickdf
correctly handles NA variable names. (Fixes #66. Thanks to Scott
Kostyshak)
rbind.fill
and rbind.fill.matrix
work consistently with matrices and data
frames with zero rows. Fixes #79. (Peter Meilstrup)
rbind.fill
now stops if inputs are not data frames. (Fixes #51)
rbind.fill
now works consistently with 0 column data frames
round_any
now works with POSIXct
objects, thanks to Jean-Olivier
Irisson (#76)
rbind.fill
: if a column contains both factors and characters (in different
inputs), the resulting column will be coerced to character
When there are more than 2^31 distinct combinations id
, switches to a
slower fallback strategy using strings (inspired by merge
) that guarantees
correct results. This fixes problems with join
when joining across many
columns. (Fixes #63)
split_indices
checks input more aggressively to prevent segfaults.
Fixes #43.
fix small bug in loop_apply
which lead to segfaults in certain
circumstances. (Thanks to Pål Westermark for patch)
itertools
and iterators
moved to suggests from imports so that plyr now
only depends on base R.
documentation improved using new features of roxygen2
fixed namespacing issue which lead to lost labels when subsetting the
results of *lply
colwise
automatically strips off split variables.
rlply
now correctly deals with rlply(4, NULL)
(thanks to bug report from
Eric Goldlust)
rbind.fill
tries harder to keep attributes, retaining the attributes from
the first occurrence of each column it finds. It also now works with
variables of class POSIXlt
and preserves the ordered status of factors.
arrange
now works with one column data frames
d*ply
returns correct number of rows when function returns vector
fix NAMESPACE bug which was causing problems with ggplot2
rbind.fill
now treats 1d arrays in the same way as rbind
(i.e. it turns
them into ordinary vectors)
fix bug in rename when renaming multiple columns
new strip_splits
function removes splitting variables from the data frames
returned by ddply
.
rename
moved in from reshape, and rewritten.
new match_df
function makes it easy to subset a data frame to only contain
values matching another data frame. Inspired by
http://stackoverflow.com/questions/4693849.
**ply
now works when passed a list of functions
*dply
now correctly names output even when some output combinations are
missing (NULL) (Thanks to bug report from Karl Ove Hufthammer)
*dply
preserves the class of many more object types.
a*ply
now correctly works with zero length margins, operating on the
entire object (Thanks to bug report from Stavros Macrakis)
join
now implements joins in a more SQL like way, returning all possible
matches, not just the first one. It is still a (little) faster than merge.
The previous behaviour is accessible with match = "first"
.
join
is now more symmetric so that join(x, y, "left")
is closer to
join(y, x, "right")
, modulo column ordering
named.quoted
failed when quoted expressions were longer than 50
characters. (Thanks to bug report from Eric Goldlust)
rbind.fill
now correctly maintains POSIXct tzone attributes and preserves
missing factor levels
split_labels
correctly preserves empty factor levels, which means that
drop = FALSE
should work in more places. Use base::droplevels
to remove
levels that don't occur in the data, and drop = T
to remove combinations
of levels that don't occur.
vaggregate
now passes ...
to the aggregation function when working out
the output type (thanks to bug report by Pavan Racherla)
count
now takes an additional parameter wt_var
which allows you to
compute weighted sums. This is as fast, or faster than, tapply
or xtabs
.
Really fix bug in names.quoted
.
now captures the environment in which it was evaluated. This should fix
an esoteric class of bugs which no-one probably ever encountered, but will
form the basis for an improved version of ggplot2::aes
.
names.quoted
that interfered with ggplot2mutate
that works like transform to add new columns or
overwrite existing columns, but computes new columns iteratively so later
transformations can use columns created by earlier transformations. (It's
also about 10x faster) (Fixes #21)split column names are no longer coerced to valid R names.
quickdf
now adds names if missing
summarise
preserves variable names if explicit names not provided
(Fixes #17)
arrays
with names should be sorted correctly once again (also fixed a bug
in the test case that prevented me from catching this automatically)
m_ply
no longer possesses .parallel argument (mistakenly added)
ldply
(and hence adply
and ddply
) now correctly passes on .parallel
argument (Fixes #16)
id
uses a better strategy for converting to integers, making it possible
to use for cases with larger potential numbers of combinations
l*ply
, d*ply
, a*ply
and m*ply
all gain a .parallel argument that when
TRUE
, applies functions in parallel using a parallel backend registered with
the foreach package:
x <- seq_len(20)
wait <- function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
# user system elapsed
# 0.007 0.005 2.005
doParallel::registerDoParallel(2)
system.time(llply(x, wait, .parallel = TRUE))
# user system elapsed
# 0.020 0.011 1.038
This work has been generously supported by BD (Becton Dickinson).
aply and mply gain an .expand argument that controls whether data frames produce a single output dimension (one element for each row), or an output dimension for each variable.
new vaggregate (vector aggregate) function, which is equivalent to tapply, but much faster (~ 10x), since it avoids copying the data.
llply: for simple lists and vectors, with no progress bar, no extra info, and no parallelisation, llply calls lapply directly to avoid all the overhead associated with those unused extra features.
llply: in serial case, for loop replaced with custom C function that takes about 40% less time (or about 20% less time than lapply). Note that as a whole, llply still has much more overhead than lapply.
round_any now lives in plyr instead of reshape
list_to_array
works correct even when there are missing values in the array.
This is particularly important for daply.*dply
deals more gracefully with the case when all results are NULL
(fixes #10)
*aply
correctly orders output regardless of dimension names
(fixes #11)
join gains type = "full" which preserves all x and y rows
experimental immutable data frame (idata.frame) that vastly speeds up subsetting - for large datasets with large numbers of groups, this can yield 10-fold speed ups. See examples in ?idata.frame to see how to use it.
rbind.fill rewritten again to increase speed and work with more data types
d*ply now much faster with nested groups
This work has been generously supported by BD (Becton Dickinson).
d*ply
when .drop = FALSEa*ply
now works correctly with array-listsr*ply
now works with ...