Yeast induction timecourse

This notebook demonstrates a real-world multidimensional analysis example. The yeast strain responds to the small molecule isopentyladenine (IP) by expressing green fluorescent protein (GFP), which we measure using a flow cytometer in the FITC-A channel.

This experiment was designed to determine the dynamics of the IP –> GFP response. So, we induced several yeast cultures with different amounts of IP, then took readings on the cytometer every 30 minutes for 8 hours. The outline of the experimental setup is below.

Induction experiment

Induction experiment


Setup the notebook’s matplotlib integration, and import cytoflow.

# in some buggy versions of the Jupyter notebook, this needs to be in its OWN CELL.
%matplotlib inline
import cytoflow as flow

# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)

In this instance, the amount of IP and the time is actually encoded in the FCS filename. So, use glob to iterate through all the FCS files in the directory and re extract the IP concentration and timepoint from the filename.

NB. Many FCS files already have a channel named “Time” which encodes how long since the acquisition start an event was collected. So it is inadvisable to use “Time” as a condition name.

# In this instance, I have encoded the experimental conditions in the filenames.
# So, use glob to get the files and parse the conditions back out.

import glob, re
tubes = []

for f in glob.glob("*.fcs"):
    r = re.search("IP_(.*?)_Minutes_(.*?)\.fcs", f)
    ip = r.group(1)
    minutes = r.group(2)

    tube = flow.Tube(file = f, conditions = {"IP" : float(ip), "Minutes" : int(minutes)})
    tubes.append(tube)

ex = flow.ImportOp(tubes = tubes,
                   conditions = {"IP" : "float",
                                 "Minutes" : "int"},
                   events = 1000).apply()

Take a quick look at the morphological parameters.

flow.DensityView(xchannel = "FSC-A",
                 xscale = "log",
                 ychannel = "SSC-A",
                 yscale = "log").plot(ex)
../../_images/induction_timecourse_6_0.png

There does seem to be a little bit of structure here, but in general the distribution is quite tight. So, we’ll use a 2D gaussian mixture model to get single cells, keeping the events that are within two standard deviations of the distribution mean.

gm = flow.GaussianMixtureOp(name = "GM",
                            channels = ['FSC-A', 'SSC-A'],
                            scale = {'FSC-A' : 'log',
                                     'SSC-A' : 'log'},
                            num_components = 1,
                            sigma = 2)
gm.estimate(ex)
ex_gm = gm.apply(ex)

A diagnostic plot of the GMM.

gm.default_view().plot(ex_gm, alpha = 0.02)
/home/brian/src/cytoflow/cytoflow/operations/base_op_views.py:341: CytoflowViewWarning: Setting 'huefacet' to 'GM_1'
../../_images/induction_timecourse_10_1.png

Yep, that looks fine. Now compute the geometric mean in the FITC-A channel to see how GFP expression varies with IP concentration and time since induction.

ex_stat = flow.ChannelStatisticOp(name = 'GFP',
                                  channel = 'FITC-A',
                                  function = flow.geom_mean,
                                  by = ['IP', 'Minutes'],
                                  subset = 'GM_1 == True').apply(ex_gm)

And plot it. Could you use pandas and seaborn to do this instead? Absolutely.

flow.Stats1DView(statistic = ('GFP', 'geom_mean'),
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')
../../_images/induction_timecourse_14_0.png

A geometric mean is only an appropriate summary statistic if the unimodal in log space. Is this actually true? Let’s look at the histogram of each [IP]/time combination to find out.

flow.HistogramView(channel = 'FITC-A',
                   xfacet = 'Minutes',
                   yfacet = 'IP',
                   scale = 'logicle',
                   subset = 'GM_1 == True').plot(ex_gm)
../../_images/induction_timecourse_16_0.png

Wow! That’s a lot of plots, and reading the axes is impossible. I maybe could use matplotlib.pyplot to change the plot parameters and get something useful, but instead let’s put [IP] on the hue facet instead of the Y facet. Then, we can wrap the X facet into three columns and actually see each plot.

flow.Kde1DView(channel = 'FITC-A',
               xfacet = 'Minutes',
               huefacet = 'IP',
               scale = 'logicle',
               huescale = 'log',
               subset = 'GM_1 == True').plot(ex_gm, col_wrap = 3, shade = False)
../../_images/induction_timecourse_18_0.png

This is very, very interesting! There appears to be significant structure to this data. It’s almost as if there are two populations, one that is “off” and one that is “on” – and that higher [IP] influences the rate at which cells switch from the off population to the on population.

We can model this mixture of gaussians using a class we’ve already seen, GaussianMixtureOp. We’ll estimate two components, and we won’t specify sigma. We’ll also say by = ['IP', 'Minutes'] to fit a different model to each unique combination of [IP] and time.

gm_fitc = flow.GaussianMixtureOp(name = "GM_FITC",
                                 channels = ['FITC-A'],
                                 scale = {'FITC-A' : 'log'},
                                 by = ['IP', 'Minutes'],
                                 num_components = 2)
gm_fitc.estimate(ex_gm, subset = 'GM_1 == True')
ex_stat_2 = gm_fitc.apply(ex_gm)

Most data-driven operations add summary statistics to the experiment as well. Let’s have a look at which statistics are defined for this experiment.

ex_stat_2.statistics.keys()
dict_keys([('GM', 'mean'), ('GM', 'sigma'), ('GM', 'interval'), ('GM', 'correlation'), ('GM_FITC', 'mean'), ('GM_FITC', 'sigma'), ('GM_FITC', 'interval'), ('GM_FITC', 'proportion')])

The statistic ('GM_FITC', 'proportion') looks promising.

ex_stat_2.statistics[('GM_FITC', 'proportion')]
IP       Minutes  Component
0.0000   0        1            0.689475
                  2            0.310525
         30       1            0.689704
                  2            0.310296
         60       1            0.667829
                  2            0.332171
         90       1            0.677261
                  2            0.322739
         120      1            0.651504
                  2            0.348496
         150      1            0.573523
                  2            0.426477
         180      1            0.585170
                  2            0.414830
         240      1            0.576592
                  2            0.423408
         300      1            0.589542
                  2            0.410458
         360      1            0.531817
                  2            0.468183
         480      1            0.527385
                  2            0.472615
0.0312   0        1            0.763097
                  2            0.236903
         30       1            0.648558
                  2            0.351442
         60       1            0.625895
                  2            0.374105
         90       1            0.585238
                  2            0.414762
                                 ...
2.0000   240      1            0.189016
                  2            0.810984
         300      1            0.155909
                  2            0.844091
         360      1            0.152130
                  2            0.847870
         480      1            0.127653
                  2            0.872347
10.0000  0        1            0.699984
                  2            0.300016
         30       1            0.471701
                  2            0.528299
         60       1            0.385929
                  2            0.614071
         90       1            0.294833
                  2            0.705167
         120      1            0.305929
                  2            0.694071
         150      1            0.248707
                  2            0.751293
         180      1            0.259084
                  2            0.740916
         240      1            0.140860
                  2            0.859140
         300      1            0.166703
                  2            0.833297
         360      1            0.112013
                  2            0.887987
         480      1            0.080634
                  2            0.919366
Name: GM_FITC : proportion, Length: 264, dtype: float64

Great, we’ve got the proportions of component 1 and 2, at each IP concentration and timepoint. Plot the proportion in component 2 with “Minutes” on the X axis and “IP” on the hue facet.

flow.Stats1DView(statistic = ("GM_FITC", "proportion"),
                 variable = "Minutes",
                 huefacet = "IP",
                 huescale = "log",
                 subset = "Component == 2").plot(ex_stat_2)
/home/brian/src/cytoflow/cytoflow/views/base_views.py:751: CytoflowViewWarning: Only one value for level Component; dropping it.
../../_images/induction_timecourse_27_1.png

Ignore the “jagged” nature of the plot. The original data set is hundreds of megabytes big, and in that data set the curves are much smoother (-;

The important question is: is this any different than the geometric mean? Let’s re-plot the geometric mean plot so we can look at both.

flow.Stats1DView(statistic = ('GFP', 'geom_mean'),
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')
../../_images/induction_timecourse_29_0.png

I think those dynamics look significantly different. For one thing, the mixture model “saturates” much more quickly – both in time and in [IP]. The geometric mean model indicates saturation at about 5 uM, while the mixture model seems to saturate one or two steps earlier. Things also stop changing quite as dramatically by about 240 minutes, whereas the geometric mean hasn’t reached anything like a steady state by 480 minutes (the end of the experiment.)


I hope this has demonstrated a non-trivial insight into the dynamics of this biological system that are gained by looking at it through a quantitative lens, with some machine learning thrown in there as well.