Package 'classifly' reference manual

Title:	Explore Classification Models in High Dimensions
Description:	Given $p$-dimensional training data containing $d$ groups (the design space), a classification algorithm (classifier) predicts which group new data belongs to. Generally the input to these algorithms is high dimensional, and the boundaries between groups will be high dimensional and perhaps curvilinear or multi-faceted. This package implements methods for understanding the division of space between the groups.
Authors:	Hadley Wickham <[email protected]>
Maintainer:	Hadley Wickham <[email protected]>
License:	MIT + file LICENSE
Version:	0.4.1.9000
Built:	2025-03-04 02:45:53 UTC
Source:	https://github.com/hadley/classifly

Calculate the advantage the most likely class has over the next most likely.

Description

This is used to identify the boundaries between classification regions. Points with low (close to 0) advantage are likely to be near boundaries.

Usage

advantage(post)
advantage(post)

Arguments

post

matrix of posterior probabilities

Classifly provides a convenient method to fit a classification function and then explore the results in the original high dimensional space.

Description

This is a convenient function to fit a classification function and then explore the results using GGobi. You can also do this in two separate steps using the classification function and then explore.

Usage

classifly(
  data,
  model,
  classifier,
  ...,
  n = 10000,
  method = "nonaligned",
  type = "range"
)
classifly(
  data,
  model,
  classifier,
  ...,
  n = 10000,
  method = "nonaligned",
  type = "range"
)

Arguments

`data`	Data set use for classification
`model`	Classification formula, usually of the form `response ~ predictors`
`classifier`	Function to use for the classification, eg. `lda`
`...`	Other arguments passed to classification function. For example. if you use `svm` you need to use `probabiltiy = TRUE` so that posterior probabilities can be retrieved.
`n`	Number of points to simulate. To maintain the illusion of a filled solid this needs to increase with dimension. 10,000 points seems adequate for up to four of five dimensions, but if you have more predictors than that, you will need to increase this number.
`method`	method to simulate points: grid, random or nonaligned (default). See `simvar` for more details on the methods used.
`type`	type of scaling to apply to data. Defaults to commmon range. See `rescaler` for more details.

Details

By default in GGobi, points that are not on the boundary (ie. that have an advantage greater than the 5 to brush mode and choose include shadowed points from the brush menu on the plot window. You can then brush them yourself to explore how the certainty of classification varies throughout the space

Special notes:

You should make sure the response variable is a factor
For SVM, make sure to include probability = TRUE in the arguments to classifly

Examples

data(kyphosis, package = "rpart")
library(MASS)
classifly(kyphosis, Kyphosis ~ . , lda)
classifly(kyphosis, Kyphosis ~ . , qda)
classifly(kyphosis, Kyphosis ~ . , glm, family="binomial")
classifly(kyphosis, Kyphosis ~ . , knnf, k=3)

library(rpart)
classifly(kyphosis, Kyphosis ~ . , rpart)


if (require("e1071")) {
classifly(kyphosis, Kyphosis ~ . , svm, probability=TRUE)
classifly(kyphosis, Kyphosis ~ . , svm, probability=TRUE, kernel="linear")
classifly(kyphosis, Kyphosis ~ . , best.svm, probability=TRUE,
   kernel="linear")

# Also can use explore directorly
bsvm <- best.svm(Species~., data = iris, gamma = 2^(-1:1),
  cost = 2^(2:+ 4), probability=TRUE)
explore(bsvm, iris)
}

data(kyphosis, package = "rpart")
library(MASS)
classifly(kyphosis, Kyphosis ~ . , lda)
classifly(kyphosis, Kyphosis ~ . , qda)
classifly(kyphosis, Kyphosis ~ . , glm, family="binomial")
classifly(kyphosis, Kyphosis ~ . , knnf, k=3)

library(rpart)
classifly(kyphosis, Kyphosis ~ . , rpart)


if (require("e1071")) {
classifly(kyphosis, Kyphosis ~ . , svm, probability=TRUE)
classifly(kyphosis, Kyphosis ~ . , svm, probability=TRUE, kernel="linear")
classifly(kyphosis, Kyphosis ~ . , best.svm, probability=TRUE,
   kernel="linear")

# Also can use explore directorly
bsvm <- best.svm(Species~., data = iris, gamma = 2^(-1:1),
  cost = 2^(2:+ 4), probability=TRUE)
explore(bsvm, iris)
}

Default method for exploring objects

Description

The default method currently works for classification functions.

Usage

explore(model, data, n = 10000, method = "nonaligned", advantage = TRUE, ...)
explore(model, data, n = 10000, method = "nonaligned", advantage = TRUE, ...)

Arguments

`model`	classification object
`data`	data set used with classifier
`n`	number of points to generate when searching for boundaries
`method`	method to generate points, see `generate_data`
`advantage`	only display boundaries
`...`	other arguments not currently used

Details

It generates a data set filling the design space, finds class boundaries (if desired) and then displays in a new ggobi instance.

Value

A invisible data frame of class classifly that contains all the simulated and true data. This can be saved and then printed later to open with rggobi.

Examples

if (require("e1071")) {
bsvm <- best.svm(Species~., data = iris, gamma = 2^(-1:1),
  cost = 2^(2:+ 4), probability=TRUE)
explore(bsvm, iris)
}
if (require("e1071")) {
bsvm <- best.svm(Species~., data = iris, gamma = 2^(-1:1),
  cost = 2^(2:+ 4), probability=TRUE)
explore(bsvm, iris)
}

Generate classification data.

Description

Given a model, this function generates points within the range of the data, classifies them, and attempts to locate boundaries by looking at advantage.

Usage

generate_classification_data(model, data, n, method, advantage)
generate_classification_data(model, data, n, method, advantage)

Arguments

`model`	classification model
`data`	data set used in model
`n`	number of points to generate
`method`	method to use, currently either grid (an evenly spaced grid), random (uniform random distribution across cube), or nonaligned (grid + some random peturbationb)
`advantage`	if `TRUE`, compute advantage, otherwise don't

Details

If posterior probabilities of classification are available, then the advantage will be calculated directly. If not, knn is used calculate the advantage based on the number of neighbouring points that share the same classification. Because knn is $O(n^2)$ this method is rather slow for large (>20,000 say) data sets.

By default, the boundary points are identified as those below the 5th-percentile for advantage.

Value

data.frame of classified data

Generate new data from a data frame.

Description

This method generates new data that fills the range of the supplied datasets.

Usage

generate_data(data, n = 10000, method = "grid")
generate_data(data, n = 10000, method = "grid")

Arguments

`data`	data frame
`n`	desired number of new observations
`method`	method to use, see `simvar`

A wrapper function for `knn` to allow use with classifly.

Description

A wrapper function for knn to allow use with classifly.

Usage

knnf(formula, data, k = 2)
knnf(formula, data, k = 2)

Arguments

`formula`	classification formula
`data`	training data set
`k`	number of neighbours to use

Olives

Description

The olive oil data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria).

Format

A data frame with 244 rows and 7 variables

References

Forina, M. and Armanino, C. and Lanteri, S. and Tiscornia, E., Classification of olive oils from their fatty acid composition, 1983, in Food Research and Data Analysis, edited by Martens, H. and Russwurm Jr, H, pages 189-214.

Extract posterior group probabilities

Description

Every classification method seems to provide a slighly different way of retrieving the posterior probability of group membership. This function provides a common interface to all of them

Usage

posterior(model, data)
posterior(model, data)

Arguments

`model`	model object
`data`	data set used in model

Simulate observations from a vector

Description

Given a vector of data this function will simulate data that could have come from that vector.

Usage

simvar(x, n = 10, method = "grid")
simvar(x, n = 10, method = "grid")

Arguments

`x`	data vector
`n`	desired number of points (will not always be achieved)
`method`	grid simulation method. See details.

Details

There are three methods to choose from:

nonaligned (default): grid + some random peturbation
grid: grid of evenly spaced observations. If a factor, all levels in a factor will be used, regardless of n
random: a random uniform sample from the range of the variable

Extract predictor and response variables for a model object.

Description

Due to the way that most model objects are stored, you also need to supply the data set you used with the original data set. It currently doesn't support models fitted without using a data argument.

Usage

variables(model)
variables(model)

Arguments

model

model object

Value

list containing response and predictor variables

Package 'classifly'

Help Index

Calculate the advantage the most likely class has over the next most likely.

Description

Usage

Arguments

Classifly provides a convenient method to fit a classification function and then explore the results in the original high dimensional space.

Description

Usage

Arguments

Details

See Also

Examples

Default method for exploring objects

Description

Usage

Arguments

Details

Value

See Also

Examples

Generate classification data.

Description

Usage

Arguments

Details

Value

Generate new data from a data frame.

Description

Usage

Arguments

A wrapper function for knn to allow use with classifly.

Description

Usage

Arguments

Olives

Description

Format

References

Extract posterior group probabilities

Description

Usage

Arguments

Simulate observations from a vector

Description

Usage

Arguments

Details

Extract predictor and response variables for a model object.

Description

Usage

Arguments

Value

A wrapper function for `knn` to allow use with classifly.