Yeast induction timecourse#

This notebook demonstrates a real-world multidimensional analysis example. The yeast strain responds to the small molecule isopentyladenine (IP) by expressing green fluorescent protein (GFP), which we measure using a flow cytometer in the FITC-A channel.

This experiment was designed to determine the dynamics of the IP –> GFP response. So, we induced several yeast cultures with different amounts of IP, then took readings on the cytometer every 30 minutes for 8 hours. The outline of the experimental setup is below.

Induction experiment

Induction experiment#


Import cytoflow.

import cytoflow as flow

# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)

In this instance, the amount of IP and the time is actually encoded in the FCS filename. So, use glob to iterate through all the FCS files in the directory and re extract the IP concentration and timepoint from the filename.

NB. Many FCS files already have a channel named “Time” which encodes how long since the acquisition start an event was collected. So it is inadvisable to use “Time” as a condition name.

# In this instance, I have encoded the experimental conditions in the filenames.
# So, use glob to get the files and parse the conditions back out.

import glob, re
tubes = []

for f in glob.glob("*.fcs"):
    r = re.search("IP_(.*?)_Minutes_(.*?)\.fcs", f)
    ip = r.group(1)
    minutes = r.group(2)

    tube = flow.Tube(file = f, conditions = {"IP" : float(ip), "Minutes" : int(minutes)})
    tubes.append(tube)

ex = flow.ImportOp(tubes = tubes,
                   conditions = {"IP" : "float",
                                 "Minutes" : "int"},
                   events = 1000).apply()

Take a quick look at the morphological parameters.

flow.DensityView(xchannel = "FSC-A",
                 xscale = "log",
                 ychannel = "SSC-A",
                 yscale = "log").plot(ex)
../../_images/induction_timecourse_5_0.png

There does seem to be a little bit of structure here, but in general the distribution is quite tight. So, we’ll use a 2D gaussian mixture model to get single cells, keeping the events that are within two standard deviations of the distribution mean.

gm = flow.GaussianMixtureOp(name = "GM",
                            channels = ['FSC-A', 'SSC-A'],
                            scale = {'FSC-A' : 'log',
                                     'SSC-A' : 'log'},
                            num_components = 1,
                            sigma = 2)
gm.estimate(ex)
ex_gm = gm.apply(ex)

A diagnostic plot of the GMM.

gm.default_view().plot(ex_gm, alpha = 0.02)
../../_images/induction_timecourse_9_0.png

Yep, that looks fine. Now compute the geometric mean in the FITC-A channel to see how GFP expression varies with IP concentration and time since induction.

ex_stat = flow.ChannelStatisticOp(name = 'GFP',
                                  channel = 'FITC-A',
                                  function = flow.geom_mean,
                                  by = ['IP', 'Minutes'],
                                  subset = 'GM == True').apply(ex_gm)

And plot it. Could you use pandas and seaborn to do this instead? Absolutely.

flow.Stats1DView(statistic = 'GFP',
                 feature = 'FITC-A',
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')
../../_images/induction_timecourse_13_0.png

A geometric mean is only an appropriate summary statistic if the unimodal in log space. Is this actually true? Let’s look at the histogram of each [IP]/time combination to find out.

flow.HistogramView(channel = 'FITC-A',
                   xfacet = 'Minutes',
                   yfacet = 'IP',
                   scale = 'logicle',
                   subset = 'GM == True').plot(ex_gm)
../../_images/induction_timecourse_15_0.png

Wow! That’s a lot of plots, and reading the axes is impossible. I maybe could use matplotlib.pyplot to change the plot parameters and get something useful, but instead let’s do the following: - Use a Kde1DView to visualize with a kernel density estimate instead of a histogram. - Put [IP] on the hue facet instead of the Y facet. - Wrap the X facet into three columns and actually see each plot.

flow.Kde1DView(channel = 'FITC-A',
               xfacet = 'Minutes',
               huefacet = 'IP',
               scale = 'logicle',
               huescale = 'log',
               subset = 'GM == True').plot(ex_gm, col_wrap = 3, shade = False)
../../_images/induction_timecourse_17_0.png

This is very, very interesting! There appears to be significant structure to this data. It’s almost as if there are two populations, one that is “off” and one that is “on” – and that higher [IP] influences the rate at which cells switch from the off population to the on population.

We can model this mixture of gaussians using a class we’ve already seen, GaussianMixtureOp. We’ll estimate two components, and we won’t specify sigma. We’ll also say by = ['IP', 'Minutes'] to fit a different model to each unique combination of [IP] and time.

gm_fitc = flow.GaussianMixtureOp(name = "GM_FITC",
                                 channels = ['FITC-A'],
                                 scale = {'FITC-A' : 'log'},
                                 by = ['IP', 'Minutes'],
                                 num_components = 2)
gm_fitc.estimate(ex_gm, subset = 'GM == True')
ex_stat_2 = gm_fitc.apply(ex_gm)

Most data-driven operations add summary statistics to the experiment as well – always named the same as the name of the operation. Let’s have a look at the one added by GaussianMixtureOp:

ex_stat_2.statistics['GM_FITC']
FITC-A Mean FITC-A SD FITC-A Interval Low FITC-A Interval High Proportion
IP Minutes Component
0.0 0 1 76.054955 2.907543 26.157807 221.133076 0.750313
2 902.523443 2.015168 447.865025 1818.736716 0.249687
30 1 64.042366 2.483437 25.787797 159.045174 0.630411
2 526.769061 2.508374 210.004171 1321.333965 0.369589
60 1 63.807865 2.430263 26.255541 155.069883 0.641988
... ... ... ... ... ... ... ...
10.0 300 2 1280.406739 1.813615 705.996856 2322.16532 0.868510
360 1 70.205581 3.793706 18.505805 266.339329 0.094780
2 1278.103592 1.828959 698.814867 2337.598794 0.905220
480 1 102.23451 4.591013 22.268401 469.359918 0.077692
2 1614.734745 1.785711 904.253099 2883.449668 0.922308

264 rows × 5 columns

Great, we’ve got the proportions of component 1 and 2, at each IP concentration and timepoint. Plot the proportion in component 2 with “Minutes” on the X axis and “IP” on the hue facet.

flow.Stats1DView(statistic = "GM_FITC",
                 feature = "Proportion",
                 variable = "Minutes",
                 huefacet = "IP",
                 huescale = "log",
                 subset = "Component == 2").plot(ex_stat_2)
/home/brian/src/cytoflow/cytoflow/views/base_views.py:848: CytoflowViewWarning: Only one value for level Component; dropping it.
../../_images/induction_timecourse_24_1.png

Ignore the “jagged” nature of the plot. The original data set is hundreds of megabytes big, and in that data set the curves are much smoother (-;

The important question is: is this any different than the geometric mean? Let’s re-plot the geometric mean plot so we can look at both.

flow.Stats1DView(statistic = 'GFP',
                 feature = 'FITC-A',
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')
../../_images/induction_timecourse_26_0.png

I think those dynamics look significantly different. For one thing, the mixture model “saturates” much more quickly – both in time and in [IP]. The geometric mean model indicates saturation at about 5 uM, while the mixture model seems to saturate one or two steps earlier. Things also stop changing quite as dramatically by about 240 minutes, whereas the geometric mean hasn’t reached anything like a steady state by 480 minutes (the end of the experiment.)


I hope this has demonstrated a non-trivial insight into the dynamics of this biological system that are gained by looking at it through a quantitative lens, with some machine learning thrown in there as well.