Title: | Comparison of Bioregionalisation Methods |
---|---|
Description: | The main purpose of this package is to propose a transparent methodological framework to compare bioregionalisation methods based on hierarchical and non-hierarchical clustering algorithms (Kreft & Jetz (2010) <doi:10.1111/j.1365-2699.2010.02375.x>) and network algorithms (Lenormand et al. (2019) <doi:10.1002/ece3.4718> and Leroy et al. (2019) <doi:10.1111/jbi.13674>). |
Authors: | Maxime Lenormand [aut, cre] |
Maintainer: | Maxime Lenormand <[email protected]> |
License: | GPL-3 |
Version: | 1.2.0 |
Built: | 2025-01-31 17:22:24 UTC |
Source: | https://github.com/biorgeo/bioregion |
This function converts dissimilarity results produced by the betapart package (and packages using betapart, such as phyloregion) into a dissimilarity object compatible with the bioregion package. This function only converts object types to make them compatible with bioregion; it does not modify the beta-diversity values. This function allows the inclusion of phylogenetic beta diversity to compute bioregions with bioregion.
betapart_to_bioregion(betapart_result)
betapart_to_bioregion(betapart_result)
betapart_result |
An object produced by the betapart package (e.g.,
using the |
A dissimilarity object of class bioregion.pairwise.metric
,
compatible with the bioregion package.
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) ## Not run: beta_div <- betapart::beta.pair.abund(comat) betapart_to_bioregion(beta_div) ## End(Not run)
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) ## Not run: beta_div <- betapart::beta.pair.abund(comat) betapart_to_bioregion(beta_div) ## End(Not run)
This function calculates the number of sites per bioregion, as well as the number of species these sites have, the number of endemic species, and the proportion of endemism.
bioregion_metrics(bioregionalization, comat, map = NULL, col_bioregion = NULL)
bioregion_metrics(bioregionalization, comat, map = NULL, col_bioregion = NULL)
bioregionalization |
A |
comat |
A co-occurrence |
map |
A spatial |
col_bioregion |
An |
Endemic species are species found only in the sites belonging to one bioregion.
A data.frame
with 5 columns, or 6 if spatial coherence is computed.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a5_3_summary_metrics.html.
Associated functions: site_species_metrics bioregionalization_metrics
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") clust <- netclu_louvain(net) bioregion_metrics(bioregionalization = clust, comat = comat)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") clust <- netclu_louvain(net) bioregion_metrics(bioregionalization = clust, comat = comat)
This function calculates metrics for one or several bioregionalizations,
typically based on outputs from netclu_
, hclu_
, or nhclu_
functions.
Some metrics may require users to provide either a similarity or dissimilarity
matrix, or the initial species-site table.
bioregionalization_metrics( bioregionalization, dissimilarity = NULL, dissimilarity_index = NULL, net = NULL, site_col = 1, species_col = 2, eval_metric = "all" )
bioregionalization_metrics( bioregionalization, dissimilarity = NULL, dissimilarity_index = NULL, net = NULL, site_col = 1, species_col = 2, eval_metric = "all" )
bioregionalization |
A |
dissimilarity |
A |
dissimilarity_index |
A |
net |
The site-species network (i.e., bipartite network). Should be
provided as a |
site_col |
The name or index of the column representing site nodes
(i.e., primary nodes). Should be provided if |
species_col |
The name or index of the column representing species nodes
(i.e., feature nodes). Should be provided if |
eval_metric |
A |
Evaluation metrics:
pc_distance
: This metric, as used by Holt et al. (2013), is the
ratio of the between-cluster sum of dissimilarities (beta-diversity) to the
total sum of dissimilarities for the full dissimilarity matrix. It is calculated
in two steps:
Compute the total sum of dissimilarities by summing all elements of the dissimilarity matrix.
Compute the between-cluster sum of dissimilarities by setting within-cluster
dissimilarities to zero and summing the matrix.
The pc_distance
ratio is obtained by dividing the between-cluster sum of
dissimilarities by the total sum of dissimilarities.
anosim
: This metric is the statistic used in the Analysis of
Similarities, as described in Castro-Insua et al. (2018). It compares
between-cluster and within-cluster dissimilarities. The statistic is computed as:
R = (r_B - r_W) / (N (N-1) / 4),
where r_B and r_W are the average ranks of between-cluster and within-cluster
dissimilarities, respectively, and N is the total number of sites.
Note: This function does not estimate significance; for significance testing,
use vegan::anosim().
avg_endemism
: This metric is the average percentage of
endemism in clusters, as recommended by Kreft & Jetz (2010). It is calculated as:
End_mean = sum_i (E_i / S_i) / K,
where E_i is the number of endemic species in cluster i, S_i is the number of
species in cluster i, and K is the total number of clusters.
tot_endemism
: This metric is the total endemism across all clusters,
as recommended by Kreft & Jetz (2010). It is calculated as:
End_tot = E / C,
where E is the total number of endemic species (i.e., species found in only one
cluster) and C is the number of non-endemic species.
A list
of class bioregion.bioregionalization.metrics
with two to three elements:
args
: Input arguments.
evaluation_df
: A data.frame
containing the eval_metric
values for all explored numbers of clusters.
endemism_results
: If endemism calculations are requested, a list
with the endemism results for each bioregionalization.
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Castro-Insua A, Gómez-Rodríguez C & Baselga A (2018) Dissimilarity measures affected by richness differences yield biased delimitations of biogeographic realms. Nature Communications 9, 9-11.
Holt BG, Lessard J, Borregaard MK, Fritz SA, Araújo MB, Dimitrov D, Fabre P, Graham CH, Graves GR, Jønsson Ka, Nogués-Bravo D, Wang Z, Whittaker RJ, Fjeldså J & Rahbek C (2013) An update of Wallace's zoogeographic regions of the world. Science 339, 74-78.
Kreft H & Jetz W (2010) A framework for delineating biogeographical regions based on species distributions. Journal of Biogeography 37, 2029-2053.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html#optimaln.
Associated functions: compare_bioregionalizations find_optimal_n
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") # User-defined number of clusters tree1 <- hclu_hierarclust(dissim, n_clust = 10:15, index = "Simpson") tree1 a <- bioregionalization_metrics(tree1, dissimilarity = dissim, net = comnet, site_col = "Node1", species_col = "Node2", eval_metric = c("tot_endemism", "avg_endemism", "pc_distance", "anosim")) a
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") # User-defined number of clusters tree1 <- hclu_hierarclust(dissim, n_clust = 10:15, index = "Simpson") tree1 a <- bioregionalization_metrics(tree1, dissimilarity = dissim, net = comnet, site_col = "Node1", species_col = "Node2", eval_metric = c("tot_endemism", "avg_endemism", "pc_distance", "anosim")) a
This function computes pairwise comparisons for several
bioregionalizations, usually outputs from netclu_
, hclu_
, or nhclu_
functions. It also provides the confusion matrix from pairwise comparisons,
enabling the user to compute additional comparison metrics.
compare_bioregionalizations( bioregionalizations, indices = c("rand", "jaccard"), cor_frequency = FALSE, store_pairwise_membership = TRUE, store_confusion_matrix = TRUE )
compare_bioregionalizations( bioregionalizations, indices = c("rand", "jaccard"), cor_frequency = FALSE, store_pairwise_membership = TRUE, store_confusion_matrix = TRUE )
bioregionalizations |
A |
indices |
|
cor_frequency |
A |
store_pairwise_membership |
A |
store_confusion_matrix |
A |
This function operates in two main steps:
Within each bioregionalization, the function compares all pairs of items
and documents whether they are clustered together (TRUE
) or separately
(FALSE
). For example, if site 1 and site 2 are clustered in the same
cluster in bioregionalization 1, their pairwise membership site1_site2
will be TRUE
. This output is stored in the pairwise_membership
slot if
store_pairwise_membership = TRUE
.
Across all bioregionalizations, the function compares their pairwise memberships to determine similarity. For each pair of bioregionalizations, it computes a confusion matrix with the following elements:
a
: Number of item pairs grouped in both bioregionalizations.
b
: Number of item pairs grouped in the first but not in the second
bioregionalization.
c
: Number of item pairs grouped in the second but not in the first
bioregionalization.
d
: Number of item pairs not grouped in either bioregionalization.
The confusion matrix is stored in confusion_matrix
if
store_confusion_matrix = TRUE
.
Based on these confusion matrices, various indices can be computed to measure agreement among bioregionalizations. The currently implemented indices are:
Rand index: (a + d) / (a + b + c + d)
Measures agreement by considering both grouped and ungrouped item pairs.
Jaccard index: a / (a + b + c)
Measures agreement based only on grouped item pairs.
These indices are complementary: the Jaccard index evaluates clustering similarity, while the Rand index considers both clustering and separation. For example, if two bioregionalizations never group the same pairs, their Jaccard index will be 0, but their Rand index may be > 0 due to ungrouped pairs.
Users can compute additional indices manually using the list of confusion matrices.
To identify which bioregionalization is most representative of the others,
the function can compute the correlation between the pairwise membership of
each bioregionalization and the total frequency of pairwise membership across
all bioregionalizations. This is enabled by setting cor_frequency = TRUE
.
A list
containing 4 to 7 elements:
args: A list
of user-provided arguments.
inputs: A list
containing information on the input
bioregionalizations, such as the number of items clustered.
pairwise_membership (optional): If store_pairwise_membership = TRUE
,
a boolean matrix
where TRUE
indicates two items are in the same cluster,
and FALSE
indicates they are not.
freq_item_pw_membership: A numeric vector
containing the number of
times each item pair is clustered together, corresponding to the sum of rows
in pairwise_membership
.
bioregionalization_freq_cor (optional): If cor_frequency = TRUE
,
a numeric vector
of correlations between individual bioregionalizations
and the total frequency of pairwise membership.
confusion_matrix (optional): If store_confusion_matrix = TRUE
,
a list
of confusion matrices for each pair of bioregionalizations.
bioregionalization_comparison: A data.frame
containing comparison
results, where the first column indicates the bioregionalizations compared,
and the remaining columns contain the requested indices
.
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a5_2_compare_bioregionalizations.html.
Associated functions: bioregionalization_metrics
# We here compare three different bioregionalizations comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") bioregion1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson") net <- similarity(comat, metric = "Simpson") bioregion2 <- netclu_greedy(net) bioregion3 <- netclu_walktrap(net) # Make one single data.frame with the bioregionalizations to compare compare_df <- merge(bioregion1$clusters, bioregion2$clusters, by = "ID") compare_df <- merge(compare_df, bioregion3$clusters, by = "ID") colnames(compare_df) <- c("Site", "Hclu", "Greedy", "Walktrap") rownames(compare_df) <- compare_df$Site compare_df <- compare_df[, c("Hclu", "Greedy", "Walktrap")] # Running the function compare_bioregionalizations(compare_df) # Find out which bioregionalizations are most representative compare_bioregionalizations(compare_df, cor_frequency = TRUE)
# We here compare three different bioregionalizations comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") bioregion1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson") net <- similarity(comat, metric = "Simpson") bioregion2 <- netclu_greedy(net) bioregion3 <- netclu_walktrap(net) # Make one single data.frame with the bioregionalizations to compare compare_df <- merge(bioregion1$clusters, bioregion2$clusters, by = "ID") compare_df <- merge(compare_df, bioregion3$clusters, by = "ID") colnames(compare_df) <- c("Site", "Hclu", "Greedy", "Walktrap") rownames(compare_df) <- compare_df$Site compare_df <- compare_df[, c("Hclu", "Greedy", "Walktrap")] # Running the function compare_bioregionalizations(compare_df) # Find out which bioregionalizations are most representative compare_bioregionalizations(compare_df, cor_frequency = TRUE)
This function is designed to work on a hierarchical tree and cut it
at user-selected heights. It works with outputs from either
hclu_hierarclust
or hclust
objects. The function allows for cutting
the tree based on the chosen number(s) of clusters or specified height(s).
Additionally, it includes a procedure to automatically determine the cutting
height for the requested number(s) of clusters.
cut_tree( tree, n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0, dynamic_tree_cut = FALSE, dynamic_method = "tree", dynamic_minClusterSize = 5, dissimilarity = NULL, ... )
cut_tree( tree, n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0, dynamic_tree_cut = FALSE, dynamic_method = "tree", dynamic_minClusterSize = 5, dissimilarity = NULL, ... )
tree |
A |
n_clust |
An |
cut_height |
A |
find_h |
A |
h_max |
A |
h_min |
A |
dynamic_tree_cut |
A |
dynamic_method |
A |
dynamic_minClusterSize |
An |
dissimilarity |
Relevant only if |
... |
Additional arguments passed to dynamicTreeCut::cutreeDynamic() to customize the dynamic tree cut method. |
The function supports two main methods for cutting the tree. First, the tree
can be cut at a uniform height (specified by cut_height
or determined
automatically for the requested n_clust
). Second, the dynamic tree cut
method (Langfelder et al., 2008) can be applied, which adapts to the shape
of branches in the tree, cutting at varying heights based on cluster
positions.
The dynamic tree cut method has two variants:
The tree-based variant (dynamic_method = "tree"
) uses a top-down
approach, relying solely on the tree and the order of clustered objects.
The hybrid variant (dynamic_method = "hybrid"
) employs a bottom-up
approach, leveraging both the tree and the dissimilarity matrix to identify
clusters based on dissimilarity among sites. This approach is useful for
detecting outliers within clusters.
If tree
is an output from hclu_hierarclust()
, the same
object is returned with updated content (i.e., args
and clusters
). If
tree
is an hclust
object, a data.frame
containing the clusters is
returned.
The find_h
argument is ignored if dynamic_tree_cut = TRUE
,
as cutting heights cannot be determined in this case.
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
Boris Leroy ([email protected])
Langfelder P, Zhang B & Horvath S (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. BIOINFORMATICS 24, 719-720.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html.
Associated functions: hclu_hierarclust
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site", 1:20) colnames(comat) <- paste0("Species", 1:25) simil <- similarity(comat, metric = "all") dissimilarity <- similarity_to_dissimilarity(simil) # User-defined number of clusters tree1 <- hclu_hierarclust(dissimilarity, n_clust = 5) tree2 <- cut_tree(tree1, cut_height = .05) tree3 <- cut_tree(tree1, n_clust = c(3, 5, 10)) tree4 <- cut_tree(tree1, cut_height = c(.05, .1, .15, .2, .25)) tree5 <- cut_tree(tree1, n_clust = c(3, 5, 10), find_h = FALSE) hclust_tree <- tree2$algorithm$final.tree clusters_2 <- cut_tree(hclust_tree, n_clust = 10) cluster_dynamic <- cut_tree(tree1, dynamic_tree_cut = TRUE, dissimilarity = dissimilarity)
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site", 1:20) colnames(comat) <- paste0("Species", 1:25) simil <- similarity(comat, metric = "all") dissimilarity <- similarity_to_dissimilarity(simil) # User-defined number of clusters tree1 <- hclu_hierarclust(dissimilarity, n_clust = 5) tree2 <- cut_tree(tree1, cut_height = .05) tree3 <- cut_tree(tree1, n_clust = c(3, 5, 10)) tree4 <- cut_tree(tree1, cut_height = c(.05, .1, .15, .2, .25)) tree5 <- cut_tree(tree1, n_clust = c(3, 5, 10), find_h = FALSE) hclust_tree <- tree2$algorithm$final.tree clusters_2 <- cut_tree(hclust_tree, n_clust = 10) cluster_dynamic <- cut_tree(tree1, dynamic_tree_cut = TRUE, dissimilarity = dissimilarity)
This function generates a data.frame
where each row provides one or
several dissimilarity metrics between pairs of sites, based on a
co-occurrence matrix
with sites as rows and species as columns.
dissimilarity(comat, metric = "Simpson", formula = NULL, method = "prodmat")
dissimilarity(comat, metric = "Simpson", formula = NULL, method = "prodmat")
comat |
A co-occurrence |
metric |
A |
formula |
A |
method |
A |
With a
the number of species shared by a pair of sites, b
species only
present in the first site and c
species only present in the second site.
Jaccard = (b + c) / (a + b + c)
Jaccardturn = 2min(b, c) / (a + 2min(b, c)) (Baselga, 2012)
Sorensen = (b + c) / (2a + b + c)
Simpson = min(b, c) / (a + min(b, c))
If abundances data are available, Bray-Curtis and its turnover component can also be computed with the following equation:
Bray = (B + C) / (2A + B + C)
Brayturn = min(B, C)/(A + min(B, C)) (Baselga, 2013)
with A
the sum of the lesser values for common species shared by a pair of
sites. B
and C
are the total number of specimens counted at both sites
minus A
.
formula
can be used to compute customized metrics with the terms
a
, b
, c
, A
, B
, and C
. For example
formula = c("pmin(b,c) / (a + pmin(b,c))", "(B + C) / (2*A + B + C)")
will compute the Simpson and Bray-Curtis dissimilarity metrics, respectively.
Note that pmin
is used in the Simpson formula because a
, b
, c
, A
,
B
and C
are numeric
vectors.
Euclidean computes the Euclidean distance between each pair of sites.
A data.frame
with the additional class
bioregion.pairwise.metric
, containing one or several dissimilarity
metrics between pairs of sites. The first two columns represent the pairs of
sites. There is one column per similarity metric provided in metric
and
formula
, except for the abc
and ABC
metrics, which are stored in three
separate columns (one for each letter).
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Baselga, A. (2012) The Relationship between Species Replacement, Dissimilarity Derived from Nestedness, and Nestedness. Global Ecology and Biogeography, 21(12), 1223–1232.
Baselga, A. (2013) Separating the two components of abundance-based dissimilarity: balanced changes in abundance vs. abundance gradients. Methods in Ecology and Evolution, 4(6), 552–557.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a3_pairwise_metrics.html.
Associated functions: similarity dissimilarity_to_similarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) dissim <- dissimilarity(comat, metric = c("abc", "ABC", "Simpson", "Brayturn")) dissim <- dissimilarity(comat, metric = "all", formula = "1 - (b + c) / (a + b + c)")
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) dissim <- dissimilarity(comat, metric = c("abc", "ABC", "Simpson", "Brayturn")) dissim <- dissimilarity(comat, metric = "all", formula = "1 - (b + c) / (a + b + c)")
This function converts a data.frame
of dissimilarity metrics
(beta diversity) between sites into similarity metrics.
dissimilarity_to_similarity(dissimilarity, include_formula = TRUE)
dissimilarity_to_similarity(dissimilarity, include_formula = TRUE)
dissimilarity |
the output object from |
include_formula |
a |
A data.frame
with the additional class
bioregion.pairwise.metric
, providing similarity metrics for each pair of
sites based on a dissimilarity object.
The behavior of this function changes depending on column names. Columns
Site1
and Site2
are copied identically. If there are columns called
a
, b
, c
, A
, B
, C
they will also be copied identically. If there
are columns based on your own formula (argument formula
in
dissimilarity()
) or not in the original list of dissimilarity metrics
(argument metrics
in dissimilarity()
) and if the argument
include_formula
is set to FALSE
, they will also be copied identically.
Otherwise there are going to be converted like they other columns (default
behavior).
If a column is called Euclidean
, the similarity will be calculated based
on the following formula:
Euclidean similarity = 1 / (1 - Euclidean distance)
Otherwise, all other columns will be transformed into dissimilarity with the following formula:
similarity = 1 - dissimilarity
Maxime Lenormand ([email protected])
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a3_pairwise_metrics.html.
Associated functions: similarity dissimilarity_to_similarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) dissimil <- dissimilarity(comat, metric = "all") dissimil similarity <- dissimilarity_to_similarity(dissimil) similarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) dissimil <- dissimilarity(comat, metric = "all") dissimil similarity <- dissimilarity_to_similarity(dissimil) similarity
This function aims to optimize one or several criteria on a set of ordered bioregionalizations. It is typically used to find one or more optimal cluster counts on hierarchical trees to cut or ranges of bioregionalizations from k-means or PAM. Users should exercise caution in other cases (e.g., unordered bioregionalizations or unrelated bioregionalizations).
find_optimal_n( bioregionalizations, metrics_to_use = "all", criterion = "elbow", step_quantile = 0.99, step_levels = NULL, step_round_above = TRUE, metric_cutoffs = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999), n_breakpoints = 1, plot = TRUE )
find_optimal_n( bioregionalizations, metrics_to_use = "all", criterion = "elbow", step_quantile = 0.99, step_levels = NULL, step_round_above = TRUE, metric_cutoffs = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999), n_breakpoints = 1, plot = TRUE )
bioregionalizations |
A |
metrics_to_use |
A |
criterion |
A |
step_quantile |
For |
step_levels |
For |
step_round_above |
A |
metric_cutoffs |
For |
n_breakpoints |
Specifies the number of breakpoints to find in the curve. Defaults to 1. |
plot |
A |
This function explores evaluation metric ~ cluster relationships, applying criteria to find optimal cluster counts.
Note on criteria: Several criteria can return multiple optimal cluster counts, emphasizing hierarchical or nested bioregionalizations. This approach aligns with modern recommendations for biological datasets, as seen in Ficetola et al. (2017)'s reanalysis of Holt et al. (2013).
Criteria for optimal clusters:
elbow
: Identifies the "elbow" point in the evaluation metric curve,
where incremental improvements diminish. Based on a method to find the
maximum distance from a straight line linking curve endpoints.
increasing_step
or decreasing_step
: Highlights significant
increases or decreases in metrics by analyzing pairwise differences between
bioregionalizations. Users specify step_quantile
or step_levels
.
cutoffs
: Derives clusters from specified metric cutoffs, e.g., as in
Holt et al. (2013). Adjust cutoffs based on spatial scale.
breakpoints
: Uses segmented regression to find breakpoints. Requires
specifying n_breakpoints
.
min
& max
: Selects clusters at minimum or maximum metric values.
A list
of class bioregion.optimal.n
with these elements:
args
: Input arguments.
evaluation_df
: The input evaluation data.frame
, appended with
boolean
columns for optimal cluster counts.
optimal_nb_clusters
: A list
with optimal cluster counts for each
metric in "metrics_to_use"
, based on the chosen criterion
.
plot
: The plot (if requested).
Please note that finding the optimal number of clusters is a procedure which normally requires decisions from the users, and as such can hardly be fully automatized. Users are strongly advised to read the references indicated below to look for guidance on how to choose their optimal number(s) of clusters. Consider the "optimal" numbers of clusters returned by this function as first approximation of the best numbers for your bioregionalization.
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Holt BG, Lessard J, Borregaard MK, Fritz SA, Araújo MB, Dimitrov D, Fabre P, Graham CH, Graves GR, Jønsson Ka, Nogués-Bravo D, Wang Z, Whittaker RJ, Fjeldså J & Rahbek C (2013) An update of Wallace's zoogeographic regions of the world. Science 339, 74-78.
Ficetola GF, Mazel F & Thuiller W (2017) Global determinants of zoogeographical boundaries. Nature Ecology & Evolution 1, 0089.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html#optimaln.
Associated functions: hclu_hierarclust
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") # User-defined number of clusters tree <- hclu_hierarclust(dissim, optimal_tree_method = "best", n_clust = 5:10) tree a <- bioregionalization_metrics(tree, dissimilarity = dissim, species_col = "Node2", site_col = "Node1", eval_metric = "anosim") find_optimal_n(a, criterion = 'increasing_step', plot = FALSE)
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") # User-defined number of clusters tree <- hclu_hierarclust(dissim, optimal_tree_method = "best", n_clust = 5:10) tree a <- bioregionalization_metrics(tree, dissimilarity = dissim, species_col = "Node2", site_col = "Node1", eval_metric = "anosim") find_optimal_n(a, criterion = 'increasing_step', plot = FALSE)
A dataset containing the abundance of 195 species in 338 sites.
fishdf
fishdf
A data.frame
with 2,703 rows and 3 columns:
Unique site identifier (corresponding to the field ID of fishsf)
Unique species identifier
Species abundance
A dataset containing the abundance of each of the 195 species in each of the 338 sites.
fishmat
fishmat
A co-occurrence matrix
with sites as rows and species as
columns. Each element of the matrix
represents the abundance of the species in the site.
A dataset containing the geometry of the 338 sites.
fishsf
fishsf
A
Unique site identifier
Geometry of the site
This function computes a divisive hierarchical clustering from a
dissimilarity (beta-diversity) data.frame
, calculates the cophenetic
correlation coefficient, and can generate clusters from the tree if requested
by the user. The function implements randomization of the dissimilarity matrix
to generate the tree, with a selection method based on the optimal cophenetic
correlation coefficient. Typically, the dissimilarity data.frame
is a
bioregion.pairwise.metric
object obtained by running similarity
or similarity
followed by similarity_to_dissimilarity
.
hclu_diana( dissimilarity, index = names(dissimilarity)[3], n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0 )
hclu_diana( dissimilarity, index = names(dissimilarity)[3], n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0 )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By default,
the third column name of |
n_clust |
An |
cut_height |
A |
find_h |
A |
h_max |
A |
h_min |
A |
The function is based on diana. Chapter 6 of Kaufman & Rousseeuw (1990) fully details the functioning of the diana algorithm.
To find an optimal number of clusters, see bioregionalization_metrics()
A list
of class bioregion.clusters
with five slots:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
describing the characteristics of the clustering process.
algorithm: A list
containing all objects associated with the
clustering procedure, such as the original cluster objects.
clusters: A data.frame
containing the clustering results.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Kaufman L & Rousseeuw PJ (2009) Finding groups in data: An introduction to cluster analysis. In & Sons. JW (ed.), Finding groups in data: An introduction to cluster analysis.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html.
Associated functions: cut_tree
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") data("fishmat") fishdissim <- dissimilarity(fishmat) fish_diana <- hclu_diana(fishdissim, index = "Simpson")
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") data("fishmat") fishdissim <- dissimilarity(fishmat) fish_diana <- hclu_diana(fishdissim, index = "Simpson")
This function generates a hierarchical tree from a dissimilarity
(beta-diversity) data.frame
, calculates the cophenetic correlation
coefficient, and optionally retrieves clusters from the tree upon user
request. The function includes a randomization process for the dissimilarity
matrix to generate the tree, with two methods available for constructing the
final tree. Typically, the dissimilarity data.frame
is a
bioregion.pairwise.metric
object obtained by running similarity
,
or by running similarity
followed by similarity_to_dissimilarity
.
hclu_hierarclust( dissimilarity, index = names(dissimilarity)[3], method = "average", randomize = TRUE, n_runs = 100, keep_trials = FALSE, optimal_tree_method = "iterative_consensus_tree", n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0, consensus_p = 0.5, verbose = TRUE )
hclu_hierarclust( dissimilarity, index = names(dissimilarity)[3], method = "average", randomize = TRUE, n_runs = 100, keep_trials = FALSE, optimal_tree_method = "iterative_consensus_tree", n_clust = NULL, cut_height = NULL, find_h = TRUE, h_max = 1, h_min = 0, consensus_p = 0.5, verbose = TRUE )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
method |
The name of the hierarchical classification method, as in
hclust. Should be one of |
randomize |
A |
n_runs |
The number of trials for randomizing the dissimilarity matrix. |
keep_trials |
A |
optimal_tree_method |
A |
n_clust |
An |
cut_height |
A |
find_h |
A |
h_max |
A |
h_min |
A |
consensus_p |
A |
verbose |
A |
The function is based on hclust.
The default method for the hierarchical tree is average
, i.e.
UPGMA as it has been recommended as the best method to generate a tree
from beta diversity dissimilarity (Kreft & Jetz, 2010).
Clusters can be obtained by two methods:
Specifying a desired number of clusters in n_clust
Specifying one or several heights of cut in cut_height
To find an optimal number of clusters, see bioregionalization_metrics()
It is important to pay attention to the fact that the order of rows in the input distance matrix influences the tree topology as explained in Dapporto (2013). To address this, the function generates multiple trees by randomizing the distance matrix.
Two methods are available to obtain the final tree:
optimal_tree_method = "iterative_consensus_tree"
: The Iterative
Hierarchical Consensus Tree (IHCT) method reconstructs a consensus tree by
iteratively splitting the dataset into two subclusters based on the pairwise
dissimilarity of sites across n_runs
trees based on n_runs
randomizations
of the distance matrix. At each iteration, it
identifies the majority membership of sites into two stable groups across
all trees,
calculates the height based on the selected linkage method (method
),
and enforces monotonic constraints on
node heights to produce a coherent tree structure.
This approach provides a robust, hierarchical representation of site
relationships, balancing
cluster stability and hierarchical constraints.
optimal_tree_method = "best"
: This method selects one tree among with
the highest cophenetic correlation coefficient, representing the best fit
between the hierarchical structure and the original distance matrix.
optimal_tree_method = "consensus"
: This method constructs a consensus
tree using phylogenetic methods with the function
consensus.
When using this option, you must set the consensus_p
parameter, which
indicates
the proportion of trees that must contain a region/cluster for it to be
included
in the final consensus tree.
Consensus trees lack an inherent height because they represent a majority
structure rather than an actual hierarchical clustering. To assign heights,
we use a non-negative least squares method (nnls.tree)
based on the initial distance matrix, ensuring that the consensus
tree preserves
approximate distances among clusters.
We recommend using the "iterative_consensus_tree"
as all the branches of
this tree will always reflect the majority decision among many randomized
versions of the distance matrix. This method is inspired by
Dapporto et al. (2015), which also used the majority decision
among many randomized versions of the distance matrix, but it expands it
to reconstruct the entire topology of the tree iteratively.
We do not recommend using the basic consensus
method because in many
contexts it provides inconsistent results, with a meaningless tree topology
and a very low cophenetic correlation coefficient.
For a fast exploration of the tree, we recommend using the best
method
which will only select the tree with the highest cophenetic correlation
coefficient among all randomized versions of the distance matrix.
A list
of class bioregion.clusters
with five slots:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
describing the characteristics of the clustering process.
algorithm: A list
containing all objects associated with the
clustering procedure, such as the original cluster objects.
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, users can find the following elements:
trials
: A list containing all randomization trials. Each trial
includes the dissimilarity matrix with randomized site order, the
associated tree, and the cophenetic correlation coefficient (Spearman) for
that tree.
final.tree
: An hclust
object representing the final
hierarchical tree to be used.
final.tree.coph.cor
: The cophenetic correlation coefficient
between the initial dissimilarity matrix and the final.tree
.
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
Kreft H & Jetz W (2010) A framework for delineating biogeographical regions based on species distributions. Journal of Biogeography 37, 2029-2053.
Dapporto L, Ramazzotti M, Fattorini S, Talavera G, Vila R & Dennis, RLH (2013) Recluster: an unbiased clustering procedure for beta-diversity turnover. Ecography 36, 1070–1075.
Dapporto L, Ciolli G, Dennis RLH, Fox R & Shreeve TG (2015) A new procedure for extrapolating turnover regionalization at mid-small spatial scales, tested on British butterflies. Methods in Ecology and Evolution 6 , 1287–1297.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html.
Associated functions: cut_tree
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") # User-defined number of clusters tree1 <- hclu_hierarclust(dissim, n_clust = 5) tree1 plot(tree1) str(tree1) tree1$clusters # User-defined height cut # Only one height tree2 <- hclu_hierarclust(dissim, cut_height = .05) tree2 tree2$clusters # Multiple heights tree3 <- hclu_hierarclust(dissim, cut_height = c(.05, .15, .25)) tree3$clusters # Mind the order of height cuts: from deep to shallow cuts # Info on each partition can be found in table cluster_info tree3$cluster_info plot(tree3)
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") # User-defined number of clusters tree1 <- hclu_hierarclust(dissim, n_clust = 5) tree1 plot(tree1) str(tree1) tree1$clusters # User-defined height cut # Only one height tree2 <- hclu_hierarclust(dissim, cut_height = .05) tree2 tree2$clusters # Multiple heights tree3 <- hclu_hierarclust(dissim, cut_height = c(.05, .15, .25)) tree3$clusters # Mind the order of height cuts: from deep to shallow cuts # Info on each partition can be found in table cluster_info tree3$cluster_info plot(tree3)
This function performs semi-hierarchical clustering based on dissimilarity using the OPTICS algorithm (Ordering Points To Identify the Clustering Structure).
hclu_optics( dissimilarity, index = names(dissimilarity)[3], minPts = NULL, eps = NULL, xi = 0.05, minimum = FALSE, show_hierarchy = FALSE, algorithm_in_output = TRUE, ... )
hclu_optics( dissimilarity, index = names(dissimilarity)[3], minPts = NULL, eps = NULL, xi = 0.05, minimum = FALSE, show_hierarchy = FALSE, algorithm_in_output = TRUE, ... )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
minPts |
A |
eps |
A |
xi |
A |
minimum |
A |
show_hierarchy |
A |
algorithm_in_output |
A |
... |
Additional arguments to be passed to |
The OPTICS (Ordering points to identify the clustering structure) is a
semi-hierarchical clustering algorithm which orders the points in the
dataset such that points which are closest become neighbors, and calculates
a reachability distance for each point. Then, clusters can be extracted in a
hierarchical manner from this reachability distance, by identifying clusters
depending on changes in the relative cluster density. The reachability plot
should be explored to understand the clusters and their hierarchical nature,
by running plot on the output of the function
if algorithm_in_output = TRUE
: plot(object$algorithm)
.
We recommend reading (Hahsler et al., 2019) to grasp the
algorithm, how it works, and what the clusters mean.
To extract the clusters, we use the extractXi function which is based on the steepness of the reachability plot (see optics)
A list
of class bioregion.clusters
with five slots:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
describing the characteristics of the clustering process.
algorithm: A list
containing all objects associated with the
clustering procedure, such as the original cluster objects.
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of optics.
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
Hahsler M, Piekenbrock M & Doran D (2019) Dbscan: Fast density-based clustering with R. Journal of Statistical Software 91, 1–30.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html.
Associated functions: nhclu_dbscan
dissim <- dissimilarity(fishmat, metric = "all") clust1 <- hclu_optics(dissim, index = "Simpson") clust1 # Visualize the optics plot (the hierarchy of clusters is illustrated at the # bottom) plot(clust1$algorithm) # Extract the hierarchy of clusters clust1 <- hclu_optics(dissim, index = "Simpson", show_hierarchy = TRUE) clust1
dissim <- dissimilarity(fishmat, metric = "all") clust1 <- hclu_optics(dissim, index = "Simpson") clust1 # Visualize the optics plot (the hierarchy of clusters is illustrated at the # bottom) plot(clust1$algorithm) # Extract the hierarchy of clusters clust1 <- hclu_optics(dissim, index = "Simpson", show_hierarchy = TRUE) clust1
This function downloads and unzips the 'bin' folder required to run certain
functions of the bioregion
package. It also verifies if the files have the
necessary permissions to be executed as programs. Finally, it tests whether
the binary files are running correctly.
install_binaries( binpath = "tempdir", download_only = FALSE, infomap_version = c("2.1.0", "2.6.0", "2.7.1", "2.8.0") )
install_binaries( binpath = "tempdir", download_only = FALSE, infomap_version = c("2.1.0", "2.6.0", "2.7.1", "2.8.0") )
binpath |
A |
download_only |
A |
infomap_version |
A |
By default, the binary files are installed in R's temporary
directory (binpath = "tempdir"
). In this case, the bin
folder will be
automatically removed at the end of the R session. Alternatively, the binary
files can be installed in the bioregion
package folder
(binpath = "pkgfolder"
).
A custom folder path can also be specified. In this case, and only in this
case, download_only
can be set to TRUE
, but you must ensure that the
files have the required permissions to be executed as programs.
In all cases, PLEASE MAKE SURE to update the binpath
and check_install
parameters accordingly in netclu_infomap, netclu_louvain, and
netclu_oslom.
No return value.
Currently, only Infomap versions 2.1.0, 2.6.0, 2.7.1, and 2.8.0 are available.
Maxime Lenormand ([email protected])
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a1_install_binary_files.html.
This plot function can be used to visualize bioregions based on a
bioregion.clusters
object combined with a geometry (sf
objects).
map_bioregions(clusters, geometry, write_clusters = FALSE, plot = TRUE, ...)
map_bioregions(clusters, geometry, write_clusters = FALSE, plot = TRUE, ...)
clusters |
An object of class |
geometry |
A spatial object that can be handled by the |
write_clusters |
A |
plot |
A |
... |
Further arguments to be passed to |
The clusters
and geometry
site IDs should correspond. They should
have the same type (i.e., character
if clusters
is a
bioregion.clusters
object) and the sites of clusters
should be
included in the sites of geometry
.
One or several maps of bioregions if plot = TRUE
and the
geometry with additional clusters' attributes if write_clusters = TRUE
.
Maxime Lenormand ([email protected])
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
data(fishmat) data(fishsf) net <- similarity(fishmat, metric = "Simpson") clu <- netclu_greedy(net) map <- map_bioregions(clu, fishsf, write_clusters = TRUE, plot = FALSE)
data(fishmat) data(fishsf) net <- similarity(fishmat, metric = "Simpson") clu <- netclu_greedy(net) map <- map_bioregions(clu, fishsf, write_clusters = TRUE, plot = FALSE)
This function generates a two- or three-column data.frame
, where
each row represents the interaction between two nodes (e.g., site and species)
and an optional third column indicates the weight of the interaction
(if weight = TRUE
). The input is a contingency table, with rows
representing one set of entities (e.g., site) and columns representing
another set (e.g., species).
mat_to_net( mat, weight = FALSE, remove_zeroes = TRUE, include_diag = TRUE, include_lower = TRUE )
mat_to_net( mat, weight = FALSE, remove_zeroes = TRUE, include_diag = TRUE, include_lower = TRUE )
mat |
A contingency table (i.e., a |
weight |
A |
remove_zeroes |
A |
include_diag |
A |
include_lower |
A |
A data.frame
where each row represents the interaction
between two nodes. If weight = TRUE
, the data.frame
includes a third
column representing the weight of each interaction.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a2_matrix_and_network_formats.html.
Associated functions: net_to_mat
mat <- matrix(sample(1000, 50), 5, 10) rownames(mat) <- paste0("Site", 1:5) colnames(mat) <- paste0("Species", 1:10) net <- mat_to_net(mat, weight = TRUE)
mat <- matrix(sample(1000, 50), 5, 10) rownames(mat) <- paste0("Site", 1:5) colnames(mat) <- paste0("Species", 1:10) net <- mat_to_net(mat, weight = TRUE)
This function generates a contingency table from a two- or three-column
data.frame
, where each row represents the interaction between two
nodes (e.g., site and species) and an optional third column indicates
the weight of the interaction (if weight = TRUE
).
net_to_mat( net, weight = FALSE, squared = FALSE, symmetrical = FALSE, missing_value = 0 )
net_to_mat( net, weight = FALSE, squared = FALSE, symmetrical = FALSE, missing_value = 0 )
net |
A two- or three-column |
weight |
A |
squared |
A |
symmetrical |
A |
missing_value |
The value to assign to pairs of nodes not present
in |
A matrix
with the first nodes (from the first column of net
)
as rows and the second nodes (from the second column of net
) as columns.
If squared = TRUE
, the rows and columns will have the same number of
elements, corresponding to the unique union of objects in the first and
second columns of net
. If squared = TRUE
and symmetrical = TRUE
,
the matrix will be forced to be symmetrical based on the upper triangular
part of the matrix.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a2_matrix_and_network_formats.html.
Associated functions: mat_to_net
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20) ) mat <- net_to_mat(net, weight = TRUE)
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20) ) mat <- net_to_mat(net, weight = TRUE)
This function takes a bipartite weighted graph and computes modules by applying Newman’s modularity measure in a bipartite weighted version.
netclu_beckett( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, forceLPA = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_beckett( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, forceLPA = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
A |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
seed |
The seed for the random number generator ( |
forceLPA |
A |
site_col |
The name or number of the column for site nodes (i.e., primary nodes). |
species_col |
The name or number of the column for species nodes (i.e., feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on the modularity optimization algorithm provided by Stephen Beckett (Beckett, 2016) as implemented in the bipartite package (computeModules).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, users can find the output of
computeModules in the algorithm
slot.
Beckett's algorithm is designed to handle weighted bipartite networks. If
weight = FALSE
, a weight of 1 will be assigned to each pair of nodes.
Ensure that the site_col
and species_col
arguments correctly identify
the respective columns for site nodes (primary nodes) and species nodes
(feature nodes). The type of nodes returned in the output can be selected
using the return_node_type
argument: "both"
to include both node types,
"site"
to return only site nodes, or "species"
to return only species
nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Beckett SJ (2016) Improved community detection in weighted bipartite networks. Royal Society Open Science 3, 140536.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20)) com <- netclu_beckett(net)
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20)) com <- netclu_beckett(net)
This function finds communities in a (un)weighted undirected network via greedy optimization of modularity.
netclu_greedy( net, weight = TRUE, cut_weight = 0, index = names(net)[3], bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_greedy( net, weight = TRUE, cut_weight = 0, index = names(net)[3], bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e. primary nodes). |
species_col |
The name or number for the column of species nodes (i.e. feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on the fast greedy modularity optimization algorithm (Clauset et al., 2004) as implemented in the igraph package (cluster_fast_greedy).
A list
of class bioregion.clusters
with five slots:
name: character
containing the name of the algorithm
args: list
of input arguments as provided by the user
inputs: list
of characteristics of the clustering process
algorithm: list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
)
clusters: data.frame
containing the clustering results
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of
cluster_fast_greedy.
Although this algorithm was not primarily designed to deal with bipartite
network, it is possible to consider the bipartite network as unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e. primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to both
to keep both types of nodes,
sites
to preserve only the sites nodes and species
to
preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Clauset A, Newman MEJ & Moore C (2004) Finding community structure in very large networks. Phys. Rev. E 70, 066111.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_greedy(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_greedy(net_bip, bipartite = TRUE)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_greedy(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_greedy(net_bip, bipartite = TRUE)
This function finds communities in a (un)weighted (un)directed network based on the Infomap algorithm (https://github.com/mapequation/infomap).
netclu_infomap( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, nbmod = 0, markovtime = 1, numtrials = 1, twolevel = FALSE, show_hierarchy = FALSE, directed = FALSE, bipartite_version = FALSE, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", version = "2.8.0", binpath = "tempdir", check_install = TRUE, path_temp = "infomap_temp", delete_temp = TRUE )
netclu_infomap( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, nbmod = 0, markovtime = 1, numtrials = 1, twolevel = FALSE, show_hierarchy = FALSE, directed = FALSE, bipartite_version = FALSE, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", version = "2.8.0", binpath = "tempdir", check_install = TRUE, path_temp = "infomap_temp", delete_temp = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
seed |
The seed for the random number generator ( |
nbmod |
Penalize solutions the more they differ from this number ( |
markovtime |
Scales link flow to change the cost of moving between
modules, higher values result in fewer modules ( |
numtrials |
For the number of trials before picking up the best solution. |
twolevel |
A |
show_hierarchy |
A |
directed |
A |
bipartite_version |
A |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e. primary nodes). |
species_col |
The name or number for the column of species nodes (i.e. feature nodes). |
return_node_type |
A |
version |
A |
binpath |
A |
check_install |
A |
path_temp |
A |
delete_temp |
A |
Infomap is a network clustering algorithm based on the Map equation proposed in Rosvall & Bergstrom (2008) that finds communities in (un)weighted and (un)directed networks.
This function is based on the C++ version of Infomap (https://github.com/mapequation/infomap/releases). This function needs binary files to run. They can be installed with install_binaries.
If you changed the default path to the bin
folder
while running install_binaries PLEASE MAKE SURE to set binpath
accordingly.
If you did not use install_binaries to change the permissions and test
the binary files PLEASE MAKE SURE to set check_install
accordingly.
The C++ version of Infomap generates temporary folders and/or files that are
stored in the path_temp
folder ("infomap_temp" with a unique timestamp
located in the bin folder in binpath
by default). This temporary folder is
removed by default (delete_temp = TRUE
).
Several versions of Infomap are available in the package. See install_binaries for more details.
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects.
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, users can find the following elements:
cmd
: The command line used to run Infomap.
version
: The Infomap version.
web
: Infomap's GitHub repository.
Infomap has been designed to deal with bipartite networks. To use this
functionality, set the bipartite_version
argument to TRUE in order to
approximate a two-step random walker (see
https://www.mapequation.org/infomap/ for more information). Note that
a bipartite network can also be considered as a unipartite network
(bipartite = TRUE
).
In both cases, do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e., primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to "both"
to keep both types of nodes, "site"
to preserve only the site nodes, and "species"
to preserve only the
species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Rosvall M & Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105, 1118-1123.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_greedy netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_infomap(net)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_infomap(net)
This function finds communities in a (un)weighted undirected network based on propagating labels.
netclu_labelprop( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_labelprop( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
seed |
The seed for the random number generator ( |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e. primary nodes). |
species_col |
The name or number for the column of species nodes (i.e. feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on propagating labels (Raghavan et al., 2007) as implemented in the igraph package (cluster_label_prop).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find a "communities" object, output of
cluster_label_prop.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as a unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e., primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to "both"
to keep both types of nodes,
"site"
to preserve only the site nodes, and "species"
to
preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Raghavan UN, Albert R & Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76, 036106.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_labelprop(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_labelprop(net_bip, bipartite = TRUE)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_labelprop(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_labelprop(net_bip, bipartite = TRUE)
This function finds communities in a (un)weighted undirected network based on the leading eigenvector of the community matrix.
netclu_leadingeigen( net, weight = TRUE, cut_weight = 0, index = names(net)[3], bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_leadingeigen( net, weight = TRUE, cut_weight = 0, index = names(net)[3], bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e., primary nodes). |
species_col |
The name or number for the column of species nodes (i.e., feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on the leading eigenvector of the community matrix (Newman, 2006) as implemented in the igraph package (cluster_leading_eigen).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of cluster_leading_eigen.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as a unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e., primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to "both"
to keep both types of nodes,
"site"
to preserve only the site nodes, and "species"
to
preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 036104.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_leadingeigen(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_leadingeigen(net_bip, bipartite = TRUE)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_leadingeigen(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_leadingeigen(net_bip, bipartite = TRUE)
This function finds communities in a (un)weighted undirected network based on the Leiden algorithm of Traag, van Eck & Waltman.
netclu_leiden( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, objective_function = "CPM", resolution_parameter = 1, beta = 0.01, n_iterations = 2, vertex_weights = NULL, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_leiden( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, objective_function = "CPM", resolution_parameter = 1, beta = 0.01, n_iterations = 2, vertex_weights = NULL, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
seed |
The random number generator seed (NULL for random by default). |
objective_function |
A string indicating the objective function to use, either the Constant Potts Model ("CPM") or "modularity" ("CPM" by default). |
resolution_parameter |
The resolution parameter to use. Higher resolutions lead to smaller communities, while lower resolutions lead to larger communities. |
beta |
A parameter affecting the randomness in the Leiden algorithm. This affects only the refinement step of the algorithm. |
n_iterations |
The number of iterations for the Leiden algorithm. Each iteration may further improve the partition. |
vertex_weights |
The vertex weights used in the Leiden algorithm. If not provided, they will be automatically determined based on the objective_function. Please see the details of this function to understand how to interpret the vertex weights. |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e., primary nodes). |
species_col |
The name or number for the column of species nodes (i.e., feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on the Leiden algorithm (Traag et al., 2019) as implemented in the igraph package (cluster_leiden).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of cluster_leiden.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as a unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e., primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to "both"
to keep both types of nodes,
"site"
to preserve only the site nodes, and "species"
to
preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Traag VA, Waltman L & Van Eck NJ (2019) From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports 9, 5233.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_leiden(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_leiden(net_bip, bipartite = TRUE)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_leiden(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_leiden(net_bip, bipartite = TRUE)
This function finds communities in a (un)weighted undirected network based on the Louvain algorithm.
netclu_louvain( net, weight = TRUE, cut_weight = 0, index = names(net)[3], lang = "igraph", resolution = 1, seed = NULL, q = 0, c = 0.5, k = 1, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", binpath = "tempdir", check_install = TRUE, path_temp = "louvain_temp", delete_temp = TRUE, algorithm_in_output = TRUE )
netclu_louvain( net, weight = TRUE, cut_weight = 0, index = names(net)[3], lang = "igraph", resolution = 1, seed = NULL, q = 0, c = 0.5, k = 1, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", binpath = "tempdir", check_install = TRUE, path_temp = "louvain_temp", delete_temp = TRUE, algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
The name or number of the column to use as weight. By default,
the third column name of |
lang |
A string indicating which version of Louvain should be used
( |
resolution |
A resolution parameter to adjust the modularity (1 is chosen by default, see Details). |
seed |
The random number generator seed (only when |
q |
The quality function used to compute the partition of the graph (modularity is chosen by default, see Details). |
c |
The parameter for the Owsinski-Zadrozny quality function (between 0 and 1, 0.5 is chosen by default). |
k |
The kappa_min value for the Shi-Malik quality function (it must be > 0, 1 is chosen by default). |
bipartite |
A |
site_col |
The name or number for the column of site nodes (i.e., primary nodes). |
species_col |
The name or number for the column of species nodes (i.e., feature nodes). |
return_node_type |
A |
binpath |
A |
check_install |
A |
path_temp |
A |
delete_temp |
A |
algorithm_in_output |
A |
Louvain is a network community detection algorithm proposed in
(Blondel et al., 2008). This function offers two
implementations of the Louvain algorithm (controlled by the lang
parameter):
the igraph
implementation (cluster_louvain) and the C++
implementation (https://sourceforge.net/projects/louvain/, version 0.3).
The igraph
implementation allows adjustment of the resolution parameter of
the modularity function (resolution
argument) used internally by the
algorithm. Lower values typically yield fewer, larger clusters. The original
definition of modularity is recovered when the resolution parameter
is set to 1 (by default).
The C++ implementation provides several quality functions:
q = 0
for the classical Newman-Girvan criterion (Modularity),
q = 1
for the Zahn-Condorcet criterion, q = 2
for the Owsinski-Zadrozny
criterion (parameterized by c
), q = 3
for the Goldberg Density criterion,
q = 4
for the A-weighted Condorcet criterion, q = 5
for the Deviation to
Indetermination criterion, q = 6
for the Deviation to Uniformity criterion,
q = 7
for the Profile Difference criterion, q = 8
for the Shi-Malik
criterion (parameterized by k
), and q = 9
for the Balanced Modularity
criterion.
The C++ version is based on version 0.3 (https://sourceforge.net/projects/louvain/). Binary files are required to run it, and can be installed with install_binaries.
If you changed the default path to the bin
folder
while running install_binaries, PLEASE MAKE SURE to set binpath
accordingly.
If you did not use install_binaries to change the permissions or test
the binary files, PLEASE MAKE SURE to set check_install
accordingly.
The C++ version generates temporary folders and/or files in the path_temp
folder ("louvain_temp" with a unique timestamp located in the bin folder in
binpath
by default). This temporary folder is removed by default
(delete_temp = TRUE
).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of cluster_louvain if
lang = "igraph"
and the following element if lang = "cpp"
:
cmd
: The command line used to run Louvain.
version
: The Louvain version.
web
: The Louvain's website.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as a unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is dedicated to the
site nodes (i.e., primary nodes) and species nodes (i.e., feature nodes) using
the arguments site_col
and species_col
. The type of nodes returned in
the output can be chosen with the argument return_node_type
equal to
"both"
to keep both types of nodes, "site"
to preserve only the site
nodes, and "species"
to preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Blondel VD, Guillaume JL, Lambiotte R & Mech ELJS (2008) Fast unfolding of communities in large networks. J. Stat. Mech. 10, P10008.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_greedy netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_louvain(net, lang = "igraph")
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_louvain(net, lang = "igraph")
This function finds communities in a (un)weighted (un)directed network based on the OSLOM algorithm (http://oslom.org/, version 2.4).
netclu_oslom( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, reassign = "no", r = 10, hr = 50, t = 0.1, cp = 0.5, directed = FALSE, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", binpath = "tempdir", check_install = TRUE, path_temp = "oslom_temp", delete_temp = TRUE )
netclu_oslom( net, weight = TRUE, cut_weight = 0, index = names(net)[3], seed = NULL, reassign = "no", r = 10, hr = 50, t = 0.1, cp = 0.5, directed = FALSE, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", binpath = "tempdir", check_install = TRUE, path_temp = "oslom_temp", delete_temp = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
Name or number of the column to use as weight. By default,
the third column name of |
seed |
For the random number generator (NULL for random by default). |
reassign |
A |
r |
The number of runs for the first hierarchical level (10 by default). |
hr |
The number of runs for the higher hierarchical level (50 by default, 0 if you are not interested in hierarchies). |
t |
The p-value, the default value is 0.10. Increase this value if you want more modules. |
cp |
Kind of resolution parameter used to decide between taking some modules or their union (default value is 0.5; a bigger value leads to bigger clusters). |
directed |
A |
bipartite |
A |
site_col |
Name or number for the column of site nodes (i.e. primary nodes). |
species_col |
Name or number for the column of species nodes (i.e. feature nodes). |
return_node_type |
A |
binpath |
A |
check_install |
A |
path_temp |
A |
delete_temp |
A |
OSLOM is a network community detection algorithm proposed in Lancichinetti et al. (2011) that finds statistically significant (overlapping) communities in (un)weighted and (un)directed networks.
This function is based on the 2.4 C++ version of OSLOM (http://www.oslom.org/software.htm). This function needs files to run. They can be installed with install_binaries.
If you changed the default path to the bin
folder
while running install_binaries, PLEASE MAKE SURE to set binpath
accordingly.
If you did not use install_binaries to change the permissions and test
the binary files, PLEASE MAKE SURE to set check_install
accordingly.
The C++ version of OSLOM generates temporary folders and/or files that are
stored in the path_temp
folder (folder "oslom_temp" with a unique timestamp
located in the bin folder in binpath
by default). This temporary folder is
removed by default (delete_temp = TRUE
).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, users can find the following elements:
cmd
: The command line used to run OSLOM.
version
: The OSLOM version.
web
: The OSLOM's web site.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as unipartite
network (bipartite = TRUE
). Do not forget to indicate which of the
first two columns is dedicated to the site nodes (i.e. primary nodes) and
species nodes (i.e. feature nodes) using the arguments site_col
and
species_col
. The type of nodes returned in the output can be chosen
with the argument return_node_type
equal to both
to keep both
types of nodes, sites
to preserve only the sites nodes, and
species
to preserve only the species nodes.
Since OSLOM potentially returns overlapping communities, we propose two
methods to reassign the 'overlapping' nodes: randomly (reassign = "random"
)
or based on the closest candidate community (reassign = "simil"
) (only for
weighted networks, in this case the closest candidate community is
determined with the average similarity). By default, reassign = "no"
and
all the information will be provided. The number of partitions will depend
on the number of overlapping modules (up to three). The suffix _semel
,
_bis
, and _ter
are added to the column names. The first partition
(_semel
) assigns a module to each node. A value of NA
in the second
(_bis
) and third (_ter
) columns indicates that no overlapping module
was found for this node (i.e. non-overlapping nodes).
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Lancichinetti A, Radicchi F, Ramasco JJ & Fortunato S (2011) Finding statistically significant communities in networks. PLOS ONE 6, e18961.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_greedy netclu_infomap netclu_louvain
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_oslom(net)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_oslom(net)
This function finds communities in a (un)weighted undirected network via short random walks.
netclu_walktrap( net, weight = TRUE, cut_weight = 0, index = names(net)[3], steps = 4, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
netclu_walktrap( net, weight = TRUE, cut_weight = 0, index = names(net)[3], steps = 4, bipartite = FALSE, site_col = 1, species_col = 2, return_node_type = "both", algorithm_in_output = TRUE )
net |
The output object from |
weight |
A |
cut_weight |
A minimal weight value. If |
index |
Name or number of the column to use as weight. By default,
the third column name of |
steps |
The length of the random walks to perform. |
bipartite |
A |
site_col |
Name or number for the column of site nodes (i.e. primary nodes). |
species_col |
Name or number for the column of species nodes (i.e. feature nodes). |
return_node_type |
A |
algorithm_in_output |
A |
This function is based on random walks (Pons & Latapy, 2005) as implemented in the igraph package (cluster_walktrap).
A list
of class bioregion.clusters
with five slots:
name: A character
containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
In the algorithm
slot, if algorithm_in_output = TRUE
, users can
find the output of
cluster_walktrap.
Although this algorithm was not primarily designed to deal with bipartite
networks, it is possible to consider the bipartite network as unipartite
network (bipartite = TRUE
).
Do not forget to indicate which of the first two columns is
dedicated to the site nodes (i.e. primary nodes) and species nodes (i.e.
feature nodes) using the arguments site_col
and species_col
.
The type of nodes returned in the output can be chosen with the argument
return_node_type
equal to both
to keep both types of nodes,
sites
to preserve only the site nodes, and species
to
preserve only the species nodes.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Pons P & Latapy M (2005) Computing Communities in Large Networks Using Random Walks. In Yolum I, Güngör T, Gürgen F, Özturan C (eds.), Computer and Information Sciences - ISCIS 2005, Lecture Notes in Computer Science, 284-293.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_3_network_clustering.html.
Associated functions: netclu_infomap netclu_louvain netclu_oslom
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_walktrap(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_walktrap(net_bip, bipartite = TRUE)
comat <- matrix(sample(1000, 50), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) net <- similarity(comat, metric = "Simpson") com <- netclu_walktrap(net) net_bip <- mat_to_net(comat, weight = TRUE) clust2 <- netclu_walktrap(net_bip, bipartite = TRUE)
This function performs non-hierarchical clustering using the Affinity Propagation algorithm.
nhclu_affprop( similarity, index = names(similarity)[3], seed = NULL, p = NA, q = NA, maxits = 1000, convits = 100, lam = 0.9, details = FALSE, nonoise = FALSE, K = NULL, prc = NULL, bimaxit = NULL, exact = NULL, algorithm_in_output = TRUE )
nhclu_affprop( similarity, index = names(similarity)[3], seed = NULL, p = NA, q = NA, maxits = 1000, convits = 100, lam = 0.9, details = FALSE, nonoise = FALSE, K = NULL, prc = NULL, bimaxit = NULL, exact = NULL, algorithm_in_output = TRUE )
similarity |
The output object from |
index |
The name or number of the similarity column to use. By default,
the third column name of |
seed |
The seed for the random number generator used when
|
p |
Input preference, which can be a vector specifying individual
preferences for each data point. If scalar, the same value is used for all
data points. If |
q |
If |
maxits |
The maximum number of iterations to execute. |
convits |
The algorithm terminates if the exemplars do not change for
|
lam |
The damping factor, a value in the range [0.5, 1). Higher values correspond to heavier damping, which may help prevent oscillations. |
details |
If |
nonoise |
If |
K |
The desired number of clusters. If not |
prc |
A parameter needed when |
bimaxit |
A parameter needed when |
exact |
A flag indicating whether to compute the initial preference range exactly. |
algorithm_in_output |
A |
This function is based on the apcluster package (apcluster).
A list
of class bioregion.clusters
with five slots:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
describing the characteristics of the clustering
process.
algorithm: A list
of objects associated with the clustering
procedure, such as original cluster objects
(if algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
apcluster.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Frey B & Dueck D (2007) Clustering by Passing Messages Between Data Points. Science 315, 972-976.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_clarans nhclu_dbscan nhclu_kmeans nhclu_affprop
comat_1 <- matrix(sample(0:1000, size = 10*12, replace = TRUE, prob = 1/1:1001), 10, 12) rownames(comat_1) <- paste0("Site", 1:10) colnames(comat_1) <- paste0("Species", 1:12) comat_1 <- cbind(comat_1, matrix(0, 10, 8, dimnames = list(paste0("Site", 1:10), paste0("Species", 13:20)))) comat_2 <- matrix(sample(0:1000, size = 10*12, replace = TRUE, prob = 1/1:1001), 10, 12) rownames(comat_2) <- paste0("Site", 11:20) colnames(comat_2) <- paste0("Species", 9:20) comat_2 <- cbind(matrix(0, 10, 8, dimnames = list(paste0("Site", 11:20), paste0("Species", 1:8))), comat_2) comat <- rbind(comat_1, comat_2) dissim <- dissimilarity(comat, metric = "Simpson") sim <- dissimilarity_to_similarity(dissim) clust1 <- nhclu_affprop(sim) clust2 <- nhclu_affprop(sim, q = 1) # Fixed number of clusters clust3 <- nhclu_affprop(sim, K = 2, prc = 10, bimaxit = 20, exact = FALSE)
comat_1 <- matrix(sample(0:1000, size = 10*12, replace = TRUE, prob = 1/1:1001), 10, 12) rownames(comat_1) <- paste0("Site", 1:10) colnames(comat_1) <- paste0("Species", 1:12) comat_1 <- cbind(comat_1, matrix(0, 10, 8, dimnames = list(paste0("Site", 1:10), paste0("Species", 13:20)))) comat_2 <- matrix(sample(0:1000, size = 10*12, replace = TRUE, prob = 1/1:1001), 10, 12) rownames(comat_2) <- paste0("Site", 11:20) colnames(comat_2) <- paste0("Species", 9:20) comat_2 <- cbind(matrix(0, 10, 8, dimnames = list(paste0("Site", 11:20), paste0("Species", 1:8))), comat_2) comat <- rbind(comat_1, comat_2) dissim <- dissimilarity(comat, metric = "Simpson") sim <- dissimilarity_to_similarity(dissim) clust1 <- nhclu_affprop(sim) clust2 <- nhclu_affprop(sim, q = 1) # Fixed number of clusters clust3 <- nhclu_affprop(sim, K = 2, prc = 10, bimaxit = 20, exact = FALSE)
This function performs non-hierarchical clustering based on dissimilarity using partitioning around medoids, implemented via the Clustering Large Applications (CLARA) algorithm.
nhclu_clara( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), maxiter = 0, initializer = "LAB", fasttol = 1, numsamples = 5, sampling = 0.25, independent = FALSE, algorithm_in_output = TRUE )
nhclu_clara( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), maxiter = 0, initializer = "LAB", fasttol = 1, numsamples = 5, sampling = 0.25, independent = FALSE, algorithm_in_output = TRUE )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
seed |
A value for the random number generator (set to |
n_clust |
An |
maxiter |
An |
initializer |
A |
fasttol |
A positive |
numsamples |
A positive |
sampling |
A positive |
independent |
A |
algorithm_in_output |
A |
Based on fastkmedoids package (fastclara).
A list
of class bioregion.clusters
with five components:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the
clustering procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
fastclara.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Schubert E & Rousseeuw PJ (2019) Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. Similarity Search and Applications 11807, 171-187.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clarans nhclu_dbscan nhclu_kmeans nhclu_pam nhclu_affprop
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") #clust <- nhclu_clara(dissim, index = "Simpson", n_clust = 5)
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") #clust <- nhclu_clara(dissim, index = "Simpson", n_clust = 5)
This function performs non-hierarchical clustering based on dissimilarity using partitioning around medoids, implemented via the Clustering Large Applications based on RANdomized Search (CLARANS) algorithm.
nhclu_clarans( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), numlocal = 2, maxneighbor = 0.025, algorithm_in_output = TRUE )
nhclu_clarans( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), numlocal = 2, maxneighbor = 0.025, algorithm_in_output = TRUE )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
seed |
A value for the random number generator ( |
n_clust |
An |
numlocal |
An |
maxneighbor |
A positive |
algorithm_in_output |
A |
Based on fastkmedoids package (fastclarans).
A list
of class bioregion.clusters
with five components:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the clustering
procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
fastclarans.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Schubert E & Rousseeuw PJ (2019) Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. Similarity Search and Applications 11807, 171-187.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_dbscan nhclu_kmeans nhclu_pam nhclu_affprop
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") #clust <- nhclu_clarans(dissim, index = "Simpson", n_clust = 5)
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") #clust <- nhclu_clarans(dissim, index = "Simpson", n_clust = 5)
This function performs non-hierarchical clustering based on dissimilarity using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.
nhclu_dbscan( dissimilarity, index = names(dissimilarity)[3], minPts = NULL, eps = NULL, plot = TRUE, algorithm_in_output = TRUE, ... )
nhclu_dbscan( dissimilarity, index = names(dissimilarity)[3], minPts = NULL, eps = NULL, plot = TRUE, algorithm_in_output = TRUE, ... )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
minPts |
A |
eps |
A |
plot |
A |
algorithm_in_output |
A |
... |
Additional arguments to be passed to |
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
algorithm clusters points based on the density of neighbors around each
data point. It requires two main arguments: minPts
, the minimum number of
points to identify a core, and eps
, the radius used to find neighbors.
Choosing minPts: This determines how many points are necessary to form a cluster. For example, what is the minimum number of sites expected in a bioregion? Choose a value sufficiently large for your dataset and expectations.
Choosing eps: This determines how similar sites should be to form a
cluster. If eps
is too small, most points will be considered too distinct
and marked as noise. If eps
is too large, clusters may merge. The value of
eps
depends on minPts
. It is recommended to choose eps
by identifying
a knee in the k-nearest neighbor distance plot.
By default, the function attempts to find a knee in this curve
automatically, but the result is uncertain. Users should inspect the graph
and modify eps
accordingly. To explore eps
values, run the function
initially without defining eps
, review the recommendations, and adjust
as needed based on clustering results.
A list
of class bioregion.clusters
with five components:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the clustering
procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
dbscan::dbscan.
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
Hahsler M, Piekenbrock M & Doran D (2019) Dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1–30.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_clarans nhclu_kmeans nhclu_pam nhclu_affprop
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") clust1 <- nhclu_dbscan(dissim, index = "Simpson") clust2 <- nhclu_dbscan(dissim, index = "Simpson", eps = 0.2) clust3 <- nhclu_dbscan(dissim, index = "Simpson", minPts = c(5, 10, 15, 20), eps = c(.1, .15, .2, .25, .3))
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "all") clust1 <- nhclu_dbscan(dissim, index = "Simpson") clust2 <- nhclu_dbscan(dissim, index = "Simpson", eps = 0.2) clust3 <- nhclu_dbscan(dissim, index = "Simpson", minPts = c(5, 10, 15, 20), eps = c(.1, .15, .2, .25, .3))
This function performs non-hierarchical clustering based on dissimilarity using a k-means analysis.
nhclu_kmeans( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), iter_max = 10, nstart = 10, algorithm = "Hartigan-Wong", algorithm_in_output = TRUE )
nhclu_kmeans( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), iter_max = 10, nstart = 10, algorithm = "Hartigan-Wong", algorithm_in_output = TRUE )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
seed |
A value for the random number generator ( |
n_clust |
An |
iter_max |
An |
nstart |
An |
algorithm |
A |
algorithm_in_output |
A |
This method partitions data into k groups such that the sum of squares of Euclidean distances from points to the assigned cluster centers is minimized. K-means cannot be applied directly to dissimilarity or beta-diversity metrics because these distances are not Euclidean. Therefore, it first requires transforming the dissimilarity matrix using Principal Coordinate Analysis (PCoA) with pcoa, and then applying k-means to the coordinates of points in the PCoA.
Because this additional transformation alters the initial dissimilarity matrix, the partitioning around medoids method (nhclu_pam) is preferred.
A list
of class bioregion.clusters
with five components:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the clustering
procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
kmeans.
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_clarans nhclu_dbscan nhclu_pam nhclu_affprop
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") clust <- nhclu_kmeans(dissim, n_clust = 2:10, index = "Simpson")
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") clust <- nhclu_kmeans(dissim, n_clust = 2:10, index = "Simpson")
This function performs non-hierarchical clustering based on dissimilarity using partitioning around medoids (PAM).
nhclu_pam( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), variant = "faster", nstart = 1, cluster_only = FALSE, algorithm_in_output = TRUE, ... )
nhclu_pam( dissimilarity, index = names(dissimilarity)[3], seed = NULL, n_clust = c(1, 2, 3), variant = "faster", nstart = 1, cluster_only = FALSE, algorithm_in_output = TRUE, ... )
dissimilarity |
The output object from |
index |
The name or number of the dissimilarity column to use. By
default, the third column name of |
seed |
A value for the random number generator ( |
n_clust |
An |
variant |
A |
nstart |
An |
cluster_only |
A |
algorithm_in_output |
A |
... |
Additional arguments to pass to |
This method partitions the data into the chosen number of clusters based on the input dissimilarity matrix. It is more robust than k-means because it minimizes the sum of dissimilarities between cluster centers (medoids) and points assigned to the cluster. In contrast, k-means minimizes the sum of squared Euclidean distances, which makes it unsuitable for dissimilarity matrices that are not based on Euclidean distances.
A list
of class bioregion.clusters
with five components:
name: A character
string containing the name of the algorithm.
args: A list
of input arguments as provided by the user.
inputs: A list
of characteristics of the clustering process.
algorithm: A list
of all objects associated with the clustering
procedure, such as original cluster objects (only if
algorithm_in_output = TRUE
).
clusters: A data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
pam.
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
Maxime Lenormand ([email protected])
Kaufman L & Rousseeuw PJ (2009) Finding groups in data: An introduction to cluster analysis. In & Sons. JW (ed.), Finding groups in data: An introduction to cluster analysis.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_clarans nhclu_dbscan nhclu_kmeans nhclu_affprop
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") clust <- nhclu_pam(dissim, n_clust = 2:15, index = "Simpson")
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) comnet <- mat_to_net(comat) dissim <- dissimilarity(comat, metric = "all") clust <- nhclu_pam(dissim, n_clust = 2:15, index = "Simpson")
This function generates a data.frame
where each row provides one or
several similarity metrics between pairs of sites, based on a co-occurrence
matrix
with sites as rows and species as columns.
similarity(comat, metric = "Simpson", formula = NULL, method = "prodmat")
similarity(comat, metric = "Simpson", formula = NULL, method = "prodmat")
comat |
A co-occurrence |
metric |
A |
formula |
A |
method |
A |
With a
the number of species shared by a pair of sites, b
species only present in the first site and c
species only present in
the second site.
Jaccard = 1 - (b + c) / (a + b + c)
Jaccardturn = 1 - 2min(b, c) / (a + 2min(b, c)) (Baselga, 2012)
Sorensen = 1 - (b + c) / (2a + b + c)
Simpson = 1 - min(b, c) / (a + min(b, c))
If abundances data are available, Bray-Curtis and its turnover component can also be computed with the following equation:
Bray = 1 - (B + C) / (2A + B + C)
Brayturn = 1 - min(B, C) / (A + min(B, C)) (Baselga, 2013)
with A
the sum of the lesser values for common species shared by a pair of
sites. B
and C
are the total number of specimens counted at both sites
minus A
.
formula
can be used to compute customized metrics with the terms
a
, b
, c
, A
, B
, and C
. For example
formula = c("1 - pmin(b,c) / (a + pmin(b,c))", "1 - (B + C) / (2*A + B + C)")
will compute the Simpson and Bray-Curtis similarity metrics, respectively.
Note that pmin
is used in the Simpson formula because a
, b
, c
, A
,
B
and C
are numeric
vectors.
Euclidean computes the Euclidean similarity between each pair of site following this equation:
Euclidean = 1 / (1 + d_ij)
Where d_ij is the Euclidean distance between site i and site j in terms of species composition.
A data.frame
with the additional class
bioregion.pairwise.metric
, containing one or several similarity
metrics between pairs of sites. The first two columns represent the pairs of
sites. There is one column per similarity metric provided in metric
and
formula
, except for the abc
and ABC
metrics, which are stored in three
separate columns (one for each letter).
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Baselga A (2012) The Relationship between Species Replacement, Dissimilarity Derived from Nestedness, and Nestedness. Global Ecology and Biogeography 21, 1223–1232.
Baselga A (2013) Separating the two components of abundance-based dissimilarity: balanced changes in abundance vs. abundance gradients. Methods in Ecology and Evolution 4, 552–557.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a3_pairwise_metrics.html.
Associated functions: dissimilarity similarity_to_dissimilarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) sim <- similarity(comat, metric = c("abc", "ABC", "Simpson", "Brayturn")) sim <- similarity(comat, metric = "all", formula = "1 - (b + c) / (a + b + c)")
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) sim <- similarity(comat, metric = c("abc", "ABC", "Simpson", "Brayturn")) sim <- similarity(comat, metric = "all", formula = "1 - (b + c) / (a + b + c)")
This function converts a data.frame
of similarity metrics between sites
into dissimilarity metrics (beta diversity).
similarity_to_dissimilarity(similarity, include_formula = TRUE)
similarity_to_dissimilarity(similarity, include_formula = TRUE)
similarity |
The output object from |
include_formula |
A |
A data.frame
with additional class
bioregion.pairwise.metric
, providing dissimilarity
metric(s) between each pair of sites based on a similarity object.
The behavior of this function changes depending on column names. Columns
Site1
and Site2
are copied identically. If there are columns called
a
, b
, c
, A
, B
, C
they will also be copied identically. If there
are columns based on your own formula (argument formula
in similarity()
)
or not in the original list of similarity metrics (argument metrics
in
similarity()
) and if the argument include_formula
is set to FALSE
,
they will also be copied identically. Otherwise there are going to be
converted like they other columns (default behavior).
If a column is called Euclidean
, its distance will be calculated based
on the following formula:
Euclidean distance = (1 - Euclidean similarity) / Euclidean similarity
Otherwise, all other columns will be transformed into dissimilarity with the following formula:
dissimilarity = 1 - similarity
Maxime Lenormand ([email protected])
Boris Leroy ([email protected])
Pierre Denelle ([email protected])
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a3_pairwise_metrics.html.
Associated functions: dissimilarity similarity_to_dissimilarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) simil <- similarity(comat, metric = "all") simil dissimilarity <- similarity_to_dissimilarity(simil) dissimilarity
comat <- matrix(sample(0:1000, size = 50, replace = TRUE, prob = 1 / 1:1001), 5, 10) rownames(comat) <- paste0("Site", 1:5) colnames(comat) <- paste0("Species", 1:10) simil <- similarity(comat, metric = "all") simil dissimilarity <- similarity_to_dissimilarity(simil) dissimilarity
This function calculates metrics to assess the contribution of a given species or site to its bioregion.
site_species_metrics( bioregionalization, comat, indices = c("rho"), net = NULL, site_col = 1, species_col = 2 )
site_species_metrics( bioregionalization, comat, indices = c("rho"), net = NULL, site_col = 1, species_col = 2 )
bioregionalization |
A |
comat |
A co-occurrence |
indices |
A |
net |
|
site_col |
A number indicating the position of the column containing
the sites in |
species_col |
A number indicating the position of the column
containing the species in |
The metric is derived from Lenormand et al. (2019) with the
following formula:
where is the number of sites,
is the number of sites in
which species
is present,
is the number of sites in
bioregion
, and
is the number of occurrences of species
in sites of bioregion
.
Affinity , fidelity
, and individual contributions
describe how species are linked to their bioregions. These
metrics are described in Bernardo-Madrid et al. (2019):
Affinity of species to their region:
, where
is the occurrence/range size
of species
in its associated bioregion, and
is the total
size (number of sites) of the bioregion. High affinity indicates that the
species occupies most sites in its bioregion.
Fidelity of species to their region:
, where
is the occurrence/range size
of species
in its bioregion, and
is its total range size.
High fidelity indicates that the species is not present in other regions.
Indicator Value of species:
.
Cz
metrics are derived from Guimerà & Amaral (2005):
Participation coefficient:
, where
is the number of links of node
to nodes in bioregion
, and
is the total degree of node
. A high value
means links are uniformly distributed; a low value means links are within
the node's bioregion.
Within-bioregion degree z-score:
, where
is the number of links of node
to nodes in its bioregion
,
is the average degree of nodes in
, and
is the standard deviation of degrees
in
.
A data.frame
with columns Bioregion
, Species
, and the desired summary
statistics, or a list of data.frame
s if Cz
and other indices are
selected.
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
Maxime Lenormand ([email protected])
Bernardo-Madrid R, Calatayud J, González‐Suárez M, Rosvall M, Lucas P, Antonelli A & Revilla E (2019) Human activity is altering the world’s zoogeographical regions. Ecology Letters 22, 1297–1305.
Guimerà R & Amaral LAN (2005) Functional cartography of complex metabolic networks. Nature 433, 895–900.
Lenormand M, Papuga G, Argagnon O, Soubeyrand M, Alleaume S & Luque S (2019) Biogeographical network analysis of plant species distribution in the Mediterranean region. Ecology and Evolution 9, 237–250.
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a5_3_summary_metrics.html.
Associated functions: bioregion_metrics bioregionalization_metrics
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") clust1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson") net <- similarity(comat, metric = "Simpson") com <- netclu_greedy(net) site_species_metrics(bioregionalization = clust1, comat = comat, indices = "rho") # Contribution metrics site_species_metrics(bioregionalization = com, comat = comat, indices = c("rho", "affinity", "fidelity", "indicator_value")) # Cz indices net_bip <- mat_to_net(comat, weight = TRUE) clust_bip <- netclu_greedy(net_bip, bipartite = TRUE) site_species_metrics(bioregionalization = clust_bip, comat = comat, net = net_bip, indices = "Cz")
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001), 20, 25) rownames(comat) <- paste0("Site",1:20) colnames(comat) <- paste0("Species",1:25) dissim <- dissimilarity(comat, metric = "Simpson") clust1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson") net <- similarity(comat, metric = "Simpson") com <- netclu_greedy(net) site_species_metrics(bioregionalization = clust1, comat = comat, indices = "rho") # Contribution metrics site_species_metrics(bioregionalization = com, comat = comat, indices = c("rho", "affinity", "fidelity", "indicator_value")) # Cz indices net_bip <- mat_to_net(comat, weight = TRUE) clust_bip <- netclu_greedy(net_bip, bipartite = TRUE) site_species_metrics(bioregionalization = clust_bip, comat = comat, net = net_bip, indices = "Cz")
bioregion.clusters
objectThis function extracts a subset of nodes based on their type ("site"
or
"species"
) from a bioregion.clusters
object, which contains both types of
nodes (sites and species).
site_species_subset(clusters, node_type = "site")
site_species_subset(clusters, node_type = "site")
clusters |
An object of class |
node_type |
A |
An object of class bioregion.clusters
containing only the specified
node type (sites or species).
Network clustering functions (prefixed with netclu_
) may return both types
of nodes (sites and species) when applied to bipartite networks (using the
bipartite
argument). In such cases, the type of nodes included in the
output can be specified with the return_node_type
argument. This function
allows you to extract a particular type of nodes (sites or species) from the
output and adjust the return_node_type
attribute accordingly.
Maxime Lenormand ([email protected])
Pierre Denelle ([email protected])
Boris Leroy ([email protected])
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20) ) clusters <- netclu_louvain(net, lang = "igraph", bipartite = TRUE) clusters_sites <- site_species_subset(clusters, node_type = "site")
net <- data.frame( Site = c(rep("A", 2), rep("B", 3), rep("C", 2)), Species = c("a", "b", "a", "c", "d", "b", "d"), Weight = c(10, 100, 1, 20, 50, 10, 20) ) clusters <- netclu_louvain(net, lang = "igraph", bipartite = TRUE) clusters_sites <- site_species_subset(clusters, node_type = "site")
A dataset containing the abundance of 3,697 species in 715 sites.
vegedf
vegedf
A data.frame
with 460,878 rows and 3 columns:
Unique site identifier (corresponding to the field ID of vegesp)
Unique species identifier
Species abundance
A dataset containing the abundance of each of the 3,697 species in each of the 715 sites.
vegemat
vegemat
A co-occurrence matrix
with sites as rows and species as
columns. Each element of the matrix
represents the abundance of the species in the site.
A dataset containing the geometry of the 715 sites.
vegesf
vegesf
A
Unique site identifier
Geometry of the site