Yeast induction timecourse#

This notebook demonstrates a real-world multidimensional analysis example. The yeast strain responds to the small molecule isopentyladenine (IP) by expressing green fluorescent protein (GFP), which we measure using a flow cytometer in the FITC-A channel.

This experiment was designed to determine the dynamics of the IP –> GFP response. So, we induced several yeast cultures with different amounts of IP, then took readings on the cytometer every 30 minutes for 8 hours. The outline of the experimental setup is below.

Import cytoflow.

import cytoflow as flow

# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)

In this instance, the amount of IP and the time is actually encoded in the FCS filename. So, use glob to iterate through all the FCS files in the directory and re extract the IP concentration and timepoint from the filename.

NB. Many FCS files already have a channel named “Time” which encodes how long since the acquisition start an event was collected. So it is inadvisable to use “Time” as a condition name.

# In this instance, I have encoded the experimental conditions in the filenames.
# So, use glob to get the files and parse the conditions back out.

import glob, re
tubes = []

for f in glob.glob("*.fcs"):
    r = re.search("IP_(.*?)_Minutes_(.*?)\.fcs", f)
    ip = r.group(1)
    minutes = r.group(2)

    tube = flow.Tube(file = f, conditions = {"IP" : float(ip), "Minutes" : int(minutes)})
    tubes.append(tube)

ex = flow.ImportOp(tubes = tubes,
                   conditions = {"IP" : "float",
                                 "Minutes" : "int"},
                   events = 1000).apply()

Take a quick look at the morphological parameters.

flow.DensityView(xchannel = "FSC-A",
                 xscale = "log",
                 ychannel = "SSC-A",
                 yscale = "log").plot(ex)

../../_images/induction_timecourse_5_0.png

There does seem to be a little bit of structure here, but in general the distribution is quite tight. So, we’ll use a 2D gaussian mixture model to get single cells, keeping the events that are within two standard deviations of the distribution mean.

gm = flow.GaussianMixtureOp(name = "GM",
                            channels = ['FSC-A', 'SSC-A'],
                            scale = {'FSC-A' : 'log',
                                     'SSC-A' : 'log'},
                            num_components = 1,
                            sigma = 2)
gm.estimate(ex)
ex_gm = gm.apply(ex)

A diagnostic plot of the GMM.

gm.default_view().plot(ex_gm, alpha = 0.02)

../../_images/induction_timecourse_9_0.png

Yep, that looks fine. Now compute the geometric mean in the FITC-A channel to see how GFP expression varies with IP concentration and time since induction.

ex_stat = flow.ChannelStatisticOp(name = 'GFP',
                                  channel = 'FITC-A',
                                  function = flow.geom_mean,
                                  by = ['IP', 'Minutes'],
                                  subset = 'GM == True').apply(ex_gm)

And plot it. Could you use pandas and seaborn to do this instead? Absolutely.

flow.Stats1DView(statistic = 'GFP',
                 feature = 'FITC-A',
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')

../../_images/induction_timecourse_13_0.png

A geometric mean is only an appropriate summary statistic if the unimodal in log space. Is this actually true? Let’s look at the histogram of each [IP]/time combination to find out.

flow.HistogramView(channel = 'FITC-A',
                   xfacet = 'Minutes',
                   yfacet = 'IP',
                   scale = 'logicle',
                   subset = 'GM == True').plot(ex_gm)

../../_images/induction_timecourse_15_0.png

Wow! That’s a lot of plots, and reading the axes is impossible. I maybe could use matplotlib.pyplot to change the plot parameters and get something useful, but instead let’s do the following: - Use a Kde1DView to visualize with a kernel density estimate instead of a histogram. - Put [IP] on the hue facet instead of the Y facet. - Wrap the X facet into three columns and actually see each plot.

flow.Kde1DView(channel = 'FITC-A',
               xfacet = 'Minutes',
               huefacet = 'IP',
               scale = 'logicle',
               huescale = 'log',
               subset = 'GM == True').plot(ex_gm, col_wrap = 3, shade = False)

../../_images/induction_timecourse_17_0.png

This is very, very interesting! There appears to be significant structure to this data. It’s almost as if there are two populations, one that is “off” and one that is “on” – and that higher [IP] influences the rate at which cells switch from the off population to the on population.

We can model this mixture of gaussians using a class we’ve already seen, GaussianMixtureOp. We’ll estimate two components, and we won’t specify sigma. We’ll also say by = ['IP', 'Minutes'] to fit a different model to each unique combination of [IP] and time.

gm_fitc = flow.GaussianMixtureOp(name = "GM_FITC",
                                 channels = ['FITC-A'],
                                 scale = {'FITC-A' : 'log'},
                                 by = ['IP', 'Minutes'],
                                 num_components = 2)
gm_fitc.estimate(ex_gm, subset = 'GM == True')

ex_stat_2 = gm_fitc.apply(ex_gm)

Most data-driven operations add summary statistics to the experiment as well – always named the same as the name of the operation. Let’s have a look at the one added by GaussianMixtureOp:

ex_stat_2.statistics['GM_FITC']

			FITC-A Mean	FITC-A SD	FITC-A Interval Low	FITC-A Interval High	Proportion
IP	Minutes	Component
0.0	0	1	76.054955	2.907543	26.157807	221.133076	0.750313
	0	2	902.523443	2.015168	447.865025	1818.736716	0.249687
	30	1	64.042366	2.483437	25.787797	159.045174	0.630411
	30	2	526.769061	2.508374	210.004171	1321.333965	0.369589
	60	1	63.807865	2.430263	26.255541	155.069883	0.641988
...	...	...	...	...	...	...	...
10.0	300	2	1280.406739	1.813615	705.996856	2322.16532	0.868510
	360	1	70.205581	3.793706	18.505805	266.339329	0.094780
	360	2	1278.103592	1.828959	698.814867	2337.598794	0.905220
	480	1	102.23451	4.591013	22.268401	469.359918	0.077692
	480	2	1614.734745	1.785711	904.253099	2883.449668	0.922308

264 rows × 5 columns

Great, we’ve got the proportions of component 1 and 2, at each IP concentration and timepoint. Plot the proportion in component 2 with “Minutes” on the X axis and “IP” on the hue facet.

flow.Stats1DView(statistic = "GM_FITC",
                 feature = "Proportion",
                 variable = "Minutes",
                 huefacet = "IP",
                 huescale = "log",
                 subset = "Component == 2").plot(ex_stat_2)

/home/brian/src/cytoflow/cytoflow/views/base_views.py:848: CytoflowViewWarning: Only one value for level Component; dropping it.

../../_images/induction_timecourse_24_1.png

Ignore the “jagged” nature of the plot. The original data set is hundreds of megabytes big, and in that data set the curves are much smoother (-;

The important question is: is this any different than the geometric mean? Let’s re-plot the geometric mean plot so we can look at both.

flow.Stats1DView(statistic = 'GFP',
                 feature = 'FITC-A',
                 variable = 'Minutes',
                 scale = 'log',
                 huefacet = 'IP',
                 huescale = 'log').plot(ex_stat,
                                        ylabel = 'Geometric mean of GFP (au)')

../../_images/induction_timecourse_26_0.png

I think those dynamics look significantly different. For one thing, the mixture model “saturates” much more quickly – both in time and in [IP]. The geometric mean model indicates saturation at about 5 uM, while the mixture model seems to saturate one or two steps earlier. Things also stop changing quite as dramatically by about 240 minutes, whereas the geometric mean hasn’t reached anything like a steady state by 480 minutes (the end of the experiment.)

I hope this has demonstrated a non-trivial insight into the dynamics of this biological system that are gained by looking at it through a quantitative lens, with some machine learning thrown in there as well.