<no title>

cytoflow.utility.consensus.resample(x: ndarray, frac: float) → tuple[ndarray, ndarray][source]#

Resample a matrix x by a fraction frac

Parameters:

x (np.ndarray) – Matrix to resample
frac (float) – Fraction of rows to resample

Returns:

Indices of resampled rows and resampled matrix

Return type:

tuple[np.ndarray, np.ndarray]

cytoflow.utility.consensus.compute_connectivity_matrix(labels: ndarray) → ndarray[source]#

Compute a connectivity matrix from a vector of labels - the connectivity matrix is a symmetric matrix where the (i, j)th entry is 1 if the ith and jth elements of the labels vector are the same and 0 otherwise.

Computation of connectivity matrices handles out-of-bag samples by setting the (i, j)th entry to 0 if either the ith or jth element of the labels vector is -1, where a -1 indicates the item is not included in the sample. IF YOU ARE USING A CLUSTERING ALGORITHM THAT DESIGNATES LABELS AS -1, YOU MUST CHANGE THE LABELS TO A DIFFERENT VALUE BEFORE CALLING THIS FUNCTION.

Parameters:: labels (np.ndarray) – Vector of labels
Returns:: Connectivity matrix
Return type:: np.ndarray

cytoflow.utility.consensus.compute_identity_matrix(x: ndarray, resampled_indices: ndarray) → ndarray[source]#

Compute an identity matrix from a matrix x and a vector of resampled indices - the identity matrix is a symmetric matrix where the (i, j)th entry is 1 if the ith and jth elements are both in the resampled indices and 0 otherwise.

Parameters:

x (np.ndarray) – Matrix that was resampled
resampled_indices (np.ndarray) – Indices of resampled rows

Returns:

Identity matrix

Return type:

np.ndarray

cytoflow.utility.consensus.compute_consensus_matrix(connectivity_matrices: list[ndarray], identity_matrices: list[ndarray]) → ndarray[source]#

Compute a consensus matrix from a list of connectivity matrices and a list of identity matrices - the consensus matrix is a symmetric matrix where the (i, j)th entry is the proportion of connectivity matrices where the ith and jth elements are connected normalized by the proportion of identity matrices where the ith and jth elements are sampled.

Parameters:

connectivity_matrices (list[np.ndarray]) – List of connectivity matrices
identity_matrices (list[np.ndarray]) – List of identity matrices

Returns:

Consensus matrix

Return type:

np.ndarray

cytoflow.utility.consensus.valid_clustering_obj(clustering_obj, k_param: str) → bool[source]#

Check if a clustering object has the necessary methods to be used in consensus clustering (i.e. Scikit-learn signatures)

Parameters:

clustering_obj (object) – Clustering object to check
k_param (str) – Name of the parameter that sets the number of clusters

Returns:

True if clustering_obj has fit_predict and set_params methods, False otherwise.

Return type:

bool

cytoflow.utility.consensus.cluster(x: ndarray, resample_frac: float, k: int, clustering_obj, k_param: str = 'n_clusters') → tuple[Type, ndarray, ndarray][source]#

Sample a matrix x and cluster the resampled matrix, returning the clustering object, the indices of the resampled rows, and labels for the items in x with a value of -1 for items not included in the resampled matrix.

Parameters:

x (np.ndarray) – Matrix to resample and cluster
resample_frac (float) – Fraction of rows to resample
k (int) – Number of clusters
clustering_obj – Clustering object to use; must have fit_predict and set_params methods
k_param (str) – Name of the parameter that sets the number of clusters

Returns:

Clustering object, indices of resampled rows, and labels for the items in x

Return type:

tuple[Type, np.ndarray, np.ndarray]

class cytoflow.utility.consensus.ConsensusClustering(clustering_obj, min_clusters: int, max_clusters: int, n_resamples: int, resample_frac: float = 0.5, k_param: str = 'n_clusters')[source]#

Bases: object

Consensus clustering for measuring the stability of clusters and selecting the optimal number of clusters. Consensus clustering is originally described in https://link.springer.com/article/10.1023/A:1023949509487 and implemented in R in the ConsensusClusterPlus package (https://academic.oup.com/bioinformatics/article/26/12/1572/281699). This class is a Python implementation of the same algorithm.

Clustering stability is measured by resampling rows of a matrix, clustering the resampled matrix, and computing a consensus matrix from the resampled clusters. The consensus distribution of each pair of items is used to measure cluster stability and the optimal number of clusters chosen by maximizing the change in the area under the cumulative distribution function (CDF) of the consensus matrix.

property cluster_range_: list[int]#

consensus_k(k: int) → ndarray[source]#

Get the consensus matrix for a given number of clusters.

Parameters:: k (int) – Number of clusters
Returns:: Consensus matrix for k
Return type:: np.ndarray

fit(x: ndarray, progress_bar: bool = False, n_jobs: int = 0) → None[source]#

Compute consensus matrices for all values of k.

Parameters:

x (np.ndarray) – Matrix to cluster
progress_bar (bool (default: False)) – Whether to display a progress bar
n_jobs (int (default: 0)) – Number of jobs to run in parallel. If 0, run in serial.

Returns:

Consensus matrices are stored in self.consensus_matrices_

Return type:

None

hist(k: int) → tuple[ndarray, ndarray][source]#

Compute the histogram of the consensus matrix for a given number of clusters.

Parameters:: k (int) – Number of clusters
Returns:: Histogram and bins
Return type:: tuple[np.ndarray, np.ndarray]

cdf(k: int) → tuple[ndarray, ndarray, ndarray][source]#

Compute the cumulative distribution function (CDF) of the consensus matrix for a given number of clusters.

Parameters:: k (int) – Number of clusters
Returns:: CDF, histogram, and bins
Return type:: tuple[np.ndarray, np.ndarray, np.ndarray]

area_under_cdf(k: int) → float[source]#

Compute the area under the cumulative distribution function (CDF) of the consensus matrix for a given number of clusters.

Parameters:: k (int) – Number of clusters
Returns:: Area under the CDF
Return type:: float

change_in_area_under_cdf() → ndarray[source]#

Compute the proportional change in the area under the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.

Returns:: Proportional change in area under the CDF as a function of the number of clusters
Return type:: np.ndarray

best_k(method: str = 'knee') → int[source]#

Compute the optimal number of clusters by maximizing the change in the area under the cumulative distribution function (CDF) of the consensus matrix.

Returns:: Optimal number of clusters
Return type:: int

plot_auc_cdf(include_knee: bool = True, ax: Axes | None = None)[source]#

Plot the area under the cumulative distribution function (CDF) of the consensus matrix as a function of the number of clusters.

Parameters:

include_knee (bool (default: True)) – Whether to include a vertical line at the knee
ax (plt.Axes (default: None)) – Axis on which to plot

Return type:

None

plot_clustermap(k: int, **kwargs)[source]#

Plot a clustermap of the consensus matrix for a given number of clusters.

Parameters:

k (int) – Number of clusters
kwargs – Keyword arguments to pass to seaborn.clustermap

Return type:

seaborn.ClusterGrid

plot_hist(k: int, ax: Axes | None = None) → Axes[source]#

Plot a histogram of the consensus matrix for a given number of clusters.

Parameters:

k (int) – Number of clusters
ax (plt.Axes | None) – Axis to plot on, or None to create a new figure

Return type:

matplotlib.pyplot.Axes

plot_cdf(ax: Axes | None = None) → Axes[source]#

Plot the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.

Parameters:: ax (plt.Axes | None) – Axis to plot on, or None to create a new figure
Return type:: matplotlib.pyplot.Axes

plot_change_area_under_cdf(ax: Axes | None = None) → Axes[source]#

Plot the change in the area under the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.

Parameters:: ax (plt.Axes | None) – Axis to plot on, or None to create a new figure
Return type:: matplotlib.pyplot.Axes