- cytoflow.utility.consensus.resample(x: ndarray, frac: float) tuple[ndarray, ndarray][source]#
Resample a matrix x by a fraction frac
- Parameters:
x (np.ndarray) – Matrix to resample
frac (float) – Fraction of rows to resample
- Returns:
Indices of resampled rows and resampled matrix
- Return type:
tuple[np.ndarray, np.ndarray]
- cytoflow.utility.consensus.compute_connectivity_matrix(labels: ndarray) ndarray[source]#
Compute a connectivity matrix from a vector of labels - the connectivity matrix is a symmetric matrix where the (i, j)th entry is 1 if the ith and jth elements of the labels vector are the same and 0 otherwise.
Computation of connectivity matrices handles out-of-bag samples by setting the (i, j)th entry to 0 if either the ith or jth element of the labels vector is -1, where a -1 indicates the item is not included in the sample. IF YOU ARE USING A CLUSTERING ALGORITHM THAT DESIGNATES LABELS AS -1, YOU MUST CHANGE THE LABELS TO A DIFFERENT VALUE BEFORE CALLING THIS FUNCTION.
- Parameters:
labels (np.ndarray) – Vector of labels
- Returns:
Connectivity matrix
- Return type:
np.ndarray
- cytoflow.utility.consensus.compute_identity_matrix(x: ndarray, resampled_indices: ndarray) ndarray[source]#
Compute an identity matrix from a matrix x and a vector of resampled indices - the identity matrix is a symmetric matrix where the (i, j)th entry is 1 if the ith and jth elements are both in the resampled indices and 0 otherwise.
- Parameters:
x (np.ndarray) – Matrix that was resampled
resampled_indices (np.ndarray) – Indices of resampled rows
- Returns:
Identity matrix
- Return type:
np.ndarray
- cytoflow.utility.consensus.compute_consensus_matrix(connectivity_matrices: list[ndarray], identity_matrices: list[ndarray]) ndarray[source]#
Compute a consensus matrix from a list of connectivity matrices and a list of identity matrices - the consensus matrix is a symmetric matrix where the (i, j)th entry is the proportion of connectivity matrices where the ith and jth elements are connected normalized by the proportion of identity matrices where the ith and jth elements are sampled.
- Parameters:
connectivity_matrices (list[np.ndarray]) – List of connectivity matrices
identity_matrices (list[np.ndarray]) – List of identity matrices
- Returns:
Consensus matrix
- Return type:
np.ndarray
- cytoflow.utility.consensus.valid_clustering_obj(clustering_obj, k_param: str) bool[source]#
Check if a clustering object has the necessary methods to be used in consensus clustering (i.e. Scikit-learn signatures)
- Parameters:
clustering_obj (object) – Clustering object to check
k_param (str) – Name of the parameter that sets the number of clusters
- Returns:
True if clustering_obj has fit_predict and set_params methods, False otherwise.
- Return type:
- cytoflow.utility.consensus.cluster(x: ndarray, resample_frac: float, k: int, clustering_obj, k_param: str = 'n_clusters') tuple[Type, ndarray, ndarray][source]#
Sample a matrix x and cluster the resampled matrix, returning the clustering object, the indices of the resampled rows, and labels for the items in x with a value of -1 for items not included in the resampled matrix.
- Parameters:
x (np.ndarray) – Matrix to resample and cluster
resample_frac (float) – Fraction of rows to resample
k (int) – Number of clusters
clustering_obj – Clustering object to use; must have fit_predict and set_params methods
k_param (str) – Name of the parameter that sets the number of clusters
- Returns:
Clustering object, indices of resampled rows, and labels for the items in x
- Return type:
tuple[Type, np.ndarray, np.ndarray]
- class cytoflow.utility.consensus.ConsensusClustering(clustering_obj, min_clusters: int, max_clusters: int, n_resamples: int, resample_frac: float = 0.5, k_param: str = 'n_clusters')[source]#
Bases:
objectConsensus clustering for measuring the stability of clusters and selecting the optimal number of clusters. Consensus clustering is originally described in https://link.springer.com/article/10.1023/A:1023949509487 and implemented in R in the ConsensusClusterPlus package (https://academic.oup.com/bioinformatics/article/26/12/1572/281699). This class is a Python implementation of the same algorithm.
Clustering stability is measured by resampling rows of a matrix, clustering the resampled matrix, and computing a consensus matrix from the resampled clusters. The consensus distribution of each pair of items is used to measure cluster stability and the optimal number of clusters chosen by maximizing the change in the area under the cumulative distribution function (CDF) of the consensus matrix.
- consensus_k(k: int) ndarray[source]#
Get the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
- Returns:
Consensus matrix for k
- Return type:
np.ndarray
- fit(x: ndarray, progress_bar: bool = False, n_jobs: int = 0) None[source]#
Compute consensus matrices for all values of k.
- Parameters:
x (np.ndarray) – Matrix to cluster
progress_bar (bool (default: False)) – Whether to display a progress bar
n_jobs (int (default: 0)) – Number of jobs to run in parallel. If 0, run in serial.
- Returns:
Consensus matrices are stored in
self.consensus_matrices_- Return type:
None
- hist(k: int) tuple[ndarray, ndarray][source]#
Compute the histogram of the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
- Returns:
Histogram and bins
- Return type:
tuple[np.ndarray, np.ndarray]
- cdf(k: int) tuple[ndarray, ndarray, ndarray][source]#
Compute the cumulative distribution function (CDF) of the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
- Returns:
CDF, histogram, and bins
- Return type:
tuple[np.ndarray, np.ndarray, np.ndarray]
- area_under_cdf(k: int) float[source]#
Compute the area under the cumulative distribution function (CDF) of the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
- Returns:
Area under the CDF
- Return type:
- change_in_area_under_cdf() ndarray[source]#
Compute the proportional change in the area under the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.
- Returns:
Proportional change in area under the CDF as a function of the number of clusters
- Return type:
np.ndarray
- best_k(method: str = 'knee') int[source]#
Compute the optimal number of clusters by maximizing the change in the area under the cumulative distribution function (CDF) of the consensus matrix.
- Returns:
Optimal number of clusters
- Return type:
- plot_auc_cdf(include_knee: bool = True, ax: Axes | None = None)[source]#
Plot the area under the cumulative distribution function (CDF) of the consensus matrix as a function of the number of clusters.
- Parameters:
include_knee (bool (default: True)) – Whether to include a vertical line at the knee
ax (plt.Axes (default: None)) – Axis on which to plot
- Return type:
None
- plot_clustermap(k: int, **kwargs)[source]#
Plot a clustermap of the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
kwargs – Keyword arguments to pass to seaborn.clustermap
- Return type:
seaborn.ClusterGrid
- plot_hist(k: int, ax: Axes | None = None) Axes[source]#
Plot a histogram of the consensus matrix for a given number of clusters.
- Parameters:
k (int) – Number of clusters
ax (plt.Axes | None) – Axis to plot on, or None to create a new figure
- Return type:
matplotlib.pyplot.Axes
- plot_cdf(ax: Axes | None = None) Axes[source]#
Plot the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.
- Parameters:
ax (plt.Axes | None) – Axis to plot on, or None to create a new figure
- Return type:
matplotlib.pyplot.Axes
- plot_change_area_under_cdf(ax: Axes | None = None) Axes[source]#
Plot the change in the area under the cumulative distribution function (CDF) of the consensus matrix for each number of clusters.
- Parameters:
ax (plt.Axes | None) – Axis to plot on, or None to create a new figure
- Return type:
matplotlib.pyplot.Axes