cytoflow.utility.util_functions

Useful (mostly numeric) utility functions.

iqr – calculate the interquartile range for an array of numberes

num_hist_bins – calculate the number of histogram bins using Freedman-Diaconis

geom_mean – compute the geometric mean

geom_sd – compute the geometric standard deviation

geom_sd_range – compute [geom_mean / geom_sd, geom_mean * geom_sd]

geom_sem – compute the geometric standard error of the mean

geom_sem_range – compute [geom_mean / geom_sem, geom_mean * geom_sem]

cartesian – generate the cartesian product of input arrays

sanitize_identifier – makes a string a valid Python identifier by replacing all non-safe characters with ‘_’

random_string – Makes a random string of ascii digits and lowercase letters

is_numeric – determine if a pandas.Series or numpy.ndarray is numeric from its dtype

cov2corr – compute the correlation matrix from the covariance matric

cytoflow.utility.util_functions.iqr(a)[source]

Calculate the inter-quartile range for an array of numbers.

Parameters

a (array_like) – The array of numbers to compute the IQR for.

Returns

The IQR of the data.

Return type

float

cytoflow.utility.util_functions.num_hist_bins(a)[source]

Calculate number of histogram bins using Freedman-Diaconis rule.

From http://stats.stackexchange.com/questions/798/

Parameters

a (array_like) – The data to make a histogram of.

Returns

The number of bins in the histogram

Return type

int

cytoflow.utility.util_functions.geom_mean(a)[source]

Compute the geometric mean for an “arbitrary” data set, ie one that contains zeros and negative numbers.

Parameters

a (array-like) – A numpy.ndarray, or something that can be converted to an ndarray

Returns

Return type

The geometric mean of the input array

Notes

The traditional geometric mean can not be computed on a mixture of positive and negative numbers. The approach here, validated rigorously in the cited paper[1], is to compute the geometric mean of the absolute value of the negative numbers separately, and then take a weighted arithmetic mean of that and the geometric mean of the positive numbers. We’re going to discard 0 values, operating under the assumption that in this context there are going to be few or no observations with a value of exactly 0.

References

[1] Geometric mean for negative and zero values

Elsayed A. E. Habib International Journal of Research and Reviews in Applied Sciences 11:419 (2012) http://www.arpapress.com/Volumes/Vol11Issue3/IJRRAS_11_3_08.pdf

cytoflow.utility.util_functions.geom_sd(a)[source]

Compute the geometric standard deviation for an “abitrary” data set, ie one that contains zeros and negative numbers. Since we’re in log space, this gives a dimensionless scaling factor, not a measure. If you want traditional “error bars”, don’t plot [geom_mean - geom_sd, geom_mean + geom_sd]; rather, plot [geom_mean / geom_sd, geom_mean * geom_sd].

Parameters

a (array-like) – A numpy.ndarray, or something that can be converted to an ndarray

Returns

Return type

The geometric standard deviation of the distribution.

Notes

As with geom_mean, non-positive numbers pose a problem. The approach here, though less rigorously validated than the one above, is to replace negative numbers with their absolute value plus 2 * geometric mean, then go about our business as per the Wikipedia page for geometric sd[1].

References

[1] https://en.wikipedia.org/wiki/Geometric_standard_deviation

cytoflow.utility.util_functions.geom_sd_range(a)[source]

A convenience function to compute [geom_mean / geom_sd, geom_mean * geom_sd].

Parameters

a (array-like) – A numpy.ndarray, or something that can be converted to an ndarray

Returns

Return type

A tuple, with (geom_mean / geom_sd, geom_mean * geom_sd)

cytoflow.utility.util_functions.geom_sem(a)[source]

Compute the geometric standard error of the mean for an “arbirary” data set, ie one that contains zeros and negative numbers.

Parameters

a (array-like) – A numpy.ndarray, or something that can be converted to an ndarray

Returns

Return type

The geometric mean of the distribution.

Notes

As with geom_mean, non-positive numbers pose a problem. The approach here, though less rigorously validated than the one above, is to replace negative numbers with their absolute value plus 2 * geometric mean. The geometric SEM is computed as in [1].

References

[1] The Standard Errors of the Geometric and Harmonic Means and Their Application to Index Numbers

Nilan Norris The Annals of Mathematical Statistics Vol. 11, No. 4 (Dec., 1940), pp. 445-448

http://www.jstor.org/stable/2235723?seq=1#page_scan_tab_contents

cytoflow.utility.util_functions.geom_sem_range(a)[source]

A convenience function to compute [geom_mean / geom_sem, geom_mean * geom_sem].

Parameters

a (array-like) – A numpy.ndarray, or something that can be converted to an ndarray

Returns

Return type

A tuple, with (geom_mean / geom_sem, geom_mean * geom_sem)

cytoflow.utility.util_functions.cartesian(arrays, out=None)[source]

Generate a cartesian product of input arrays.

Parameters
  • arrays (list of array-like) – 1-D arrays to form the cartesian product of.

  • out (ndarray) – Array to place the cartesian product in.

Returns

out – 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays.

Return type

ndarray

Examples

>>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

References

Originally from http://stackoverflow.com/a/1235363/4755587

cytoflow.utility.util_functions.sanitize_identifier(name)[source]

Makes name a Python identifier by replacing all nonsafe characters with ‘_’

cytoflow.utility.util_functions.random_string(n)[source]

Makes a random string of ascii digits and lowercase letters of length n

from http://stackoverflow.com/questions/2257441/random-string-generation-with-upper-case-letters-and-digits-in-python

cytoflow.utility.util_functions.is_numeric(s)[source]

Determine if a pandas.Series or numpy.ndarray is numeric from its dtype.

cytoflow.utility.util_functions.cov2corr(covariance)[source]

Compute the correlation matrix from the covariance matrix.

From https://github.com/AndreaCensi/procgraph/blob/master/src/procgraph_statistics/cov2corr.py