# Statistics in `cytoflow`

¶

One of the most powerful concepts in `cytoflow`

is that it makes it
easy to summarize your data, then track how those subsets change as your
experimental variables change. This notebook demonstrates several
different modules that create and plot statistics.

Set up the Jupyter `matplotlib`

integration, and import the
`cytoflow`

module.

```
# in some buggy versions of the Jupyter notebook, this needs to be in its OWN CELL.
%matplotlib inline
```

```
import cytoflow as flow
# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)
```

We use the same data set as the **Yeast Dose Response** example
notebook, with one variant: we load each tube three times, grabbing only
100 events from each.

```
inputs = {
"Yeast_B1_B01.fcs" : 5.0,
"Yeast_B2_B02.fcs" : 3.75,
"Yeast_B3_B03.fcs" : 2.8125,
"Yeast_B4_B04.fcs" : 2.109,
"Yeast_B5_B05.fcs" : 1.5820,
"Yeast_B6_B06.fcs" : 1.1865,
"Yeast_B7_B07.fcs" : 0.8899,
"Yeast_B8_B08.fcs" : 0.6674,
"Yeast_B9_B09.fcs" : 0.5,
"Yeast_B10_B10.fcs" : 0.3754,
"Yeast_B11_B11.fcs" : 0.2816,
"Yeast_B12_B12.fcs" : 0.2112,
"Yeast_C1_C01.fcs" : 0.1584,
"Yeast_C2_C02.fcs" : 0.1188,
"Yeast_C3_C03.fcs" : 0.0892,
"Yeast_C4_C04.fcs" : 0.0668,
"Yeast_C5_C05.fcs" : 0.05,
"Yeast_C6_C06.fcs" : 0.0376,
"Yeast_C7_C07.fcs" : 0.0282,
"Yeast_C8_C08.fcs" : 0.0211,
"Yeast_C9_C09.fcs" : 0.0159
}
tubes = []
for filename, ip in inputs.items():
tubes.append(flow.Tube(file = "data/" + filename, conditions = {'IP' : ip, 'Replicate' : 1}))
tubes.append(flow.Tube(file = "data/" + filename, conditions = {'IP' : ip, 'Replicate' : 2}))
tubes.append(flow.Tube(file = "data/" + filename, conditions = {'IP' : ip, 'Replicate' : 3}))
ex = flow.ImportOp(conditions = {'IP' : "float", "Replicate" : "int"},
tubes = tubes,
events = 100).apply()
```

In `cytoflow`

, a *statistic* is a value that summarizes something
about some data. `cytoflow`

makes it easy to compute statistics for
various subsets. For example, if we expect the geometric mean of
`FITC-A`

channel to change as the `IP`

variable changes, we can
compute the geometric mean of the `FITC-A`

channel of each different
`IP`

condition with the `ChannelStatisticOp`

operation:

```
op = flow.ChannelStatisticOp(name = "ByIP",
by = ["IP"],
channel = "FITC-A",
function = flow.geom_mean)
ex2 = op.apply(ex)
```

This operation splits the data set by different values of `IP`

, then
applies the function `flow.geom_mean`

to the `FITC-A`

channel in
each subset. The result is stored in the `statistics`

attribute of an
`Experiment`

. The `statistics`

attribute is a dictionary. The keys
are tuples, where the first element in the tuple is the name of the
operation that created the statistic, and the second element is specific
to the operation:

```
ex2.statistics.keys()
```

```
dict_keys([('ByIP', 'geom_mean')])
```

The `Statistics1DOp`

operation sets the second element of the tuple to
the function name. (You can override this by setting
`Statistics1DOp.statistic_name`

; this is useful if `function`

is a
lambda function.)

The value of each entry in `Experiment.statistics`

is a
`pandas.Series`

. The series index is all the subsets for which the
statistic was computed, and the contents are the values of the statstic
itself.

```
ex2.statistics[('ByIP', 'geom_mean')]
```

```
IP
0.0159 106.760464
0.0211 145.242912
0.0282 152.152489
0.0376 194.243504
0.0500 240.788179
0.0668 413.218589
0.0892 670.798705
0.1188 971.121453
0.1584 1345.581190
0.2112 1203.016726
0.2816 1463.993731
0.3754 1733.702533
0.5000 1871.968527
0.6674 2218.190937
0.8899 2324.630935
1.1865 2486.103908
1.5820 2456.024025
2.1090 2326.584475
2.8125 2350.038991
3.7500 2676.275419
5.0000 2269.741514
Name: ByIP : geom_mean, dtype: float64
```

We can also specify multiple variables to break data set into. In the
example above, `Statistics1DOp`

lumps all events with the same value
of `IP`

together, but each amount of `IP`

actually has three values
of `Replicate`

as well. Let’s apply `geom_mean`

to each unique
combination of `IP`

and `Replicate`

:

```
op = flow.ChannelStatisticOp(name = "ByIP",
by = ["IP", "Replicate"],
channel = "FITC-A",
function = flow.geom_mean)
ex2 = op.apply(ex)
ex2.statistics[("ByIP", "geom_mean")][0:12]
```

```
IP Replicate
0.0159 1 109.294581
2 106.956580
3 104.099598
0.0211 1 144.564288
2 147.712897
3 143.484638
0.0282 1 155.558697
2 156.508248
3 144.679053
0.0376 1 182.294878
2 206.147145
3 195.023849
Name: ByIP : geom_mean, dtype: float64
```

Note that the `pandas.Series`

now has a `MultiIndex`

: there are
values for each unique combination of `IP`

and `Replicate`

.

Now that we have computed a statistic, we can plot it with one of the statistics views. We can use a bar chart:

```
flow.BarChartView(statistic = ("ByIP", "geom_mean"),
variable = "IP").plot(ex2)
```

```
---------------------------------------------------------------------------
CytoflowViewError Traceback (most recent call last)
<ipython-input-8-4102c59fbf66> in <module>
1 flow.BarChartView(statistic = ("ByIP", "geom_mean"),
----> 2 variable = "IP").plot(ex2)
~/src/cytoflow/cytoflow/views/bar_chart.py in plot(self, experiment, plot_name, **kwargs)
129 """
130
--> 131 super().plot(experiment, plot_name, **kwargs)
132
133 def _grid_plot(self, experiment, grid, **kwargs):
~/src/cytoflow/cytoflow/views/base_views.py in plot(self, experiment, plot_name, **kwargs)
859 plot_name = plot_name,
860 scale = scale,
--> 861 **kwargs)
862
863 def _make_data(self, experiment):
~/src/cytoflow/cytoflow/views/base_views.py in plot(self, experiment, data, plot_name, **kwargs)
712 "plot variables or the plot name. "
713 "Possible plot names: {}"
--> 714 .format(unused_names, list(groupby.groups.keys())))
715
716 if plot_name not in set(groupby.groups.keys()):
CytoflowViewError: ('plot_name', "You must use facets ['Replicate'] in either the plot variables or the plot name. Possible plot names: [1, 2, 3]")
```

Oops! We forgot to specify how to plot the different `Replicate`

s.
Each index of a statistic must be specified as either a variable or a
facet of the plot, like so:

```
flow.BarChartView(statistic = ("ByIP", "geom_mean"),
variable = "IP",
huefacet = "Replicate").plot(ex2)
```

A quick aside - sometimes you get ugly axes because of overlapping
labels. In this case, we want a wider plot. While we can directly
specify the height of a plot, we can’t directly specify its width, only
its aspect ratio (the ratio of the width to the height.) `cytoflow`

defaults to 1.5; in this case, let’s widen it out to 4. If this results
in a plot that’s wider than your browser, the Jupyter notebook will
scale it down for you.

```
flow.BarChartView(statistic = ("ByIP", "geom_mean"),
variable = "IP",
huefacet = "Replicate").plot(ex2, aspect = 4)
```

Bar charts are really best for categorical variables (with a modest number of categories.) Let’s do a line chart instead:

```
flow.Stats1DView(statistic = ("ByIP", "geom_mean"),
variable = "IP",
variable_scale = 'log',
huefacet = "Replicate").plot(ex2)
```

Statistics views can also plot error bars; the error bars must *also* be
a statistic, and they must have the same indices as the “main”
statistic. For example, let’s plot the geometric mean and geometric
standard deviation of each IP subset:

```
ex2 = flow.ChannelStatisticOp(name = "ByIP",
by = ["IP"],
channel = "FITC-A",
function = flow.geom_mean).apply(ex)
# While an arithmetic SD is usually plotted plus-or-minus the arithmetic mean,
# a *geometric* SD is usually plotted (on a log scale!) multiplied-or-divided by the
# geometric mean. the function geom_sd_range is a convenience function that does this.
ex3 = flow.ChannelStatisticOp(name = "ByIP",
by = ['IP'],
channel = "FITC-A",
function = flow.geom_sd_range).apply(ex2)
flow.Stats1DView(statistic = ("ByIP", "geom_mean"),
variable = "IP",
variable_scale = "log",
scale = "log",
error_statistic = ("ByIP", "geom_sd_range")).plot(ex3)
```

The plot above shows how one statistic varies (on the Y axis) as a variable changes on the X axis. We can also plot two statistics against eachother. For example, we can ask if the geometric standard deviation varies as the geometric mean changes:

```
ex2 = flow.ChannelStatisticOp(name = "ByIP",
by = ["IP"],
channel = "FITC-A",
function = flow.geom_mean).apply(ex)
ex3 = flow.ChannelStatisticOp(name = "ByIP",
by = ['IP'],
channel = "FITC-A",
function = flow.geom_sd).apply(ex2)
flow.Stats2DView(variable = "IP",
xstatistic = ("ByIP", "geom_mean"),
ystatistic = ("ByIP", "geom_sd"),
xscale = "log").plot(ex3)
```

Nope, guess not. See the **TASBE Calibrated Flow Cytometry** notebook
for more examples of 1D and 2D statistics views.

## Transforming statistics¶

In addition to making statistics by applying summary functions to data,
you can also apply functions to other statistics. For example, a common
question is “What percentage of my events are in a particular gate?” We
could, for instance, ask what percentage of events are above 1000 in the
`FITC-A`

channel, and how that varies by amount of `IP`

. We start by
defining a *threshold* gate with `ThresholdOp`

:

```
thresh = flow.ThresholdOp(name = "Above1000",
channel = "FITC-A",
threshold = 1000)
ex2 = thresh.apply(ex)
```

Now, the `Experiment`

has a condition named `Above1000`

that is
`True`

or `False`

depending on whether that event’s `FITC-A`

channel is greater than 1000. Next, we compute the total number of
events in each subset with a unique combination of `Above1000`

and
`IP`

:

```
ex3 = flow.ChannelStatisticOp(name = "Above1000",
by = ["Above1000", "IP"],
channel = "FITC-A",
function = len).apply(ex2)
ex3.statistics[("Above1000", "len")]
```

```
Above1000 IP
False 0.0159 299
0.0211 299
0.0282 299
0.0376 296
0.0500 299
0.0668 262
0.0892 200
0.1188 143
0.1584 76
0.2112 101
0.2816 76
0.3754 57
0.5000 53
0.6674 32
0.8899 19
1.1865 23
1.5820 21
2.1090 34
2.8125 25
3.7500 10
5.0000 29
True 0.0159 1
0.0211 1
0.0282 1
0.0376 4
0.0500 1
0.0668 38
0.0892 100
0.1188 157
0.1584 224
0.2112 199
0.2816 224
0.3754 243
0.5000 247
0.6674 268
0.8899 281
1.1865 277
1.5820 279
2.1090 266
2.8125 275
3.7500 290
5.0000 271
Name: Above1000 : len, dtype: int64
```

And now we compute the proportion of `Above1000 == True`

for each
value of `IP`

. `TransformStatisticOp`

applies a function to subsets
of a statistic; in this case, we’re applying a `lambda`

function to
convert each `IP`

subset from length into proportion.

```
import pandas as pd
ex4 = flow.TransformStatisticOp(name = "Above1000",
statistic = ("Above1000", "len"),
by = ["IP"],
function = lambda a: pd.Series(a / a.sum()),
statistic_name = "proportion").apply(ex3)
ex4.statistics[("Above1000", "proportion")][0:8]
```

```
Above1000 IP
False 0.0159 0.996667
0.0211 0.996667
0.0282 0.996667
0.0376 0.986667
0.0500 0.996667
0.0668 0.873333
0.0892 0.666667
0.1188 0.476667
Name: Above1000 : len, dtype: float64
```

Now we can plot the new statistic.

```
flow.Stats1DView(statistic = ("Above1000", "proportion"),
variable = "IP",
variable_scale = "log",
huefacet = "Above1000").plot(ex4)
```

Note that be default we get all the values in the statistic; in this
case, it’s both the proportion that’s above 1000 (`True`

) and the
proportion that is below it (`False`

). We can pass a `subset`

attribute to the `Stats1DView`

to look at just one or the other.

```
flow.Stats1DView(statistic = ("Above1000", "proportion"),
variable = "IP",
variable_scale = "log",
subset = "Above1000 == True").plot(ex4)
```

```
/home/brian/src/cytoflow/cytoflow/views/base_views.py:751: CytoflowViewWarning: Only one value for level Above1000; dropping it.
```

Note the warning. Because the index level `Above1000`

is always
`True`

in this subset, it gets dropped from the index. Note also that
we don’t have to specify it as a variable or facet anywhere.

## Statistics from data-driven modules¶

One of the most exciting aspects of statistics in `cytoflow`

is that
other data-driven modules can add them to an `Experiment`

, too. For
example, the `GaussianMixtureOp`

adds several statistics for each
component of the mixture model it fits, containing the mean, standard
deviation and proportion of observations in each component:

```
inputs = {
"Yeast_B1_B01.fcs" : 5.0,
"Yeast_B2_B02.fcs" : 3.75,
"Yeast_B3_B03.fcs" : 2.8125,
"Yeast_B4_B04.fcs" : 2.109,
"Yeast_B5_B05.fcs" : 1.5820,
"Yeast_B6_B06.fcs" : 1.1865,
"Yeast_B7_B07.fcs" : 0.8899,
"Yeast_B8_B08.fcs" : 0.6674,
"Yeast_B9_B09.fcs" : 0.5,
"Yeast_B10_B10.fcs" : 0.3754,
"Yeast_B11_B11.fcs" : 0.2816,
"Yeast_B12_B12.fcs" : 0.2112,
"Yeast_C1_C01.fcs" : 0.1584,
"Yeast_C2_C02.fcs" : 0.1188,
"Yeast_C3_C03.fcs" : 0.0892,
"Yeast_C4_C04.fcs" : 0.0668,
"Yeast_C5_C05.fcs" : 0.05,
"Yeast_C6_C06.fcs" : 0.0376,
"Yeast_C7_C07.fcs" : 0.0282,
"Yeast_C8_C08.fcs" : 0.0211,
"Yeast_C9_C09.fcs" : 0.0159
}
tubes = []
for filename, ip in inputs.items():
tubes.append(flow.Tube(file = "data/" + filename, conditions = {'IP' : ip}))
ex = flow.ImportOp(conditions = {'IP' : "float"},
tubes = tubes).apply()
```

```
op = flow.GaussianMixtureOp(name = "Gauss",
channels = ["FITC-A"],
scale = {"FITC-A" : 'logicle'},
by = ["IP"],
num_components = 1)
op.estimate(ex)
ex2 = op.apply(ex)
op.default_view(xfacet = "IP").plot(ex2, col_wrap = 3)
```

```
ex2.statistics.keys()
```

```
dict_keys([('Gauss', 'mean'), ('Gauss', 'sigma'), ('Gauss', 'interval')])
```

```
ex2.statistics[("Gauss", 'mean')]
```

```
IP Component Channel
0.0159 1 FITC-A 114.867593
0.0211 1 FITC-A 146.760865
0.0282 1 FITC-A 161.831942
0.0376 1 FITC-A 188.241811
0.0500 1 FITC-A 252.533895
0.0668 1 FITC-A 415.359435
0.0892 1 FITC-A 675.335896
0.1188 1 FITC-A 976.574632
0.1584 1 FITC-A 1212.127491
0.2112 1 FITC-A 1169.357113
0.2816 1 FITC-A 1491.938821
0.3754 1 FITC-A 1683.611714
0.5000 1 FITC-A 1812.128528
0.6674 1 FITC-A 2233.860489
0.8899 1 FITC-A 2341.076532
1.1865 1 FITC-A 2474.300525
1.5820 1 FITC-A 2401.710245
2.1090 1 FITC-A 2383.285932
2.8125 1 FITC-A 2411.122019
3.7500 1 FITC-A 2584.007182
5.0000 1 FITC-A 2265.732416
Name: Gauss : mean, dtype: float64
```

```
ex2.statistics[("Gauss", 'interval')]
```

```
IP Component Channel
0.0159 1 FITC-A (56.94459866110697, 224.1783518669721)
0.0211 1 FITC-A (80.36534657785384, 265.7842407147643)
0.0282 1 FITC-A (88.27508462690886, 295.4218184072569)
0.0376 1 FITC-A (101.18497543667104, 350.4182139054138)
0.0500 1 FITC-A (127.95204222704295, 502.06564927968213)
0.0668 1 FITC-A (195.2351392293184, 894.5712089072493)
0.0892 1 FITC-A (310.61832832262314, 1484.6617729462246)
0.1188 1 FITC-A (437.6150673574889, 2200.017570462604)
0.1584 1 FITC-A (542.1024597233911, 2732.4410991530126)
0.2112 1 FITC-A (459.2685949581729, 3011.3302773605374)
0.2816 1 FITC-A (601.1140138003304, 3736.1069195737864)
0.3754 1 FITC-A (784.0312198090705, 3635.8937650247603)
0.5000 1 FITC-A (802.3248292365361, 4117.843046843494)
0.6674 1 FITC-A (1134.2954460750368, 4414.715040960627)
0.8899 1 FITC-A (1246.7333023687233, 4408.716470275269)
1.1865 1 FITC-A (1256.858649306397, 4886.57964348989)
1.5820 1 FITC-A (1151.7297619649949, 5027.751372108821)
2.1090 1 FITC-A (1161.983752683322, 4906.469724482335)
2.8125 1 FITC-A (1276.8361886540924, 4566.1348774856815)
3.7500 1 FITC-A (1277.3600413588333, 5244.721547284582)
5.0000 1 FITC-A (1046.8993940886585, 4925.741101423763)
Name: Gauss : interval, dtype: object
```

```
flow.Stats1DView(statistic = ("Gauss", "mean"),
variable = "IP",
variable_scale = "log",
scale = "log",
error_statistic = ("Gauss", "interval")).plot(ex2, shade_error = True)
```

```
/home/brian/src/cytoflow/cytoflow/views/base_views.py:751: CytoflowViewWarning: Only one value for level Component; dropping it.
/home/brian/src/cytoflow/cytoflow/views/base_views.py:751: CytoflowViewWarning: Only one value for level Channel; dropping it.
```