Computational Cytometry#

Cytoflow includes modules and views for analyzing and visualizing high-dimensional data from flow cytometry experiments. Often called computational cytometry, these semi-supervised and unsupervised analysis pipelines are generally broken into three major pieces:

  1. Clean and pre-process the data. Check for tubes that have artifacts / discontinuities in the flow rate, for example, and then compensate for spill-over between channels. Possibly warp the channels between tubes to bring peaks into registration. Because these data are pre-processed, those capabilities are not demonstrated here, but see the BleedthroughLinearOp, FlowCleanOp and RegistrationOp classes for details on how to perform these cleaning steps.

Often, pre-processing also involves scaling data with a log or biexponential scaling function. However, Cytoflow maintains the underlying data in its unscaled form, and scales it as needed for processing or visualization.

  1. Cluster or reduce the dimensionality of the data. Cytoflow includes several clustering algorithms – KMeans, FlowPeaks, and self-organizing maps – and two dimensionality reduction methods, principle components analysis and t-distributed stochastic neighbor embedding. Self-organizing maps and tSNE are demonstrated below.

  2. Visualize the data to explore the biology. For dimensionality-reduction methods like tSNE and PCA, standard scatterplots are used. However, for high-dimensional clustering, a minimum-spanning tree has become common. Cytoflow allows a user to create both visualizations, and each is demonstrated below as well.

This notebook demonstrates self-organized maps (SOM), minimum-spanning trees, and t-distributed stochastic neighbor embedding (t-SNE) using data from Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nature Reviews Immunology 16:449-462 (2016). The methods below reproduce many of the figures from that paper. (It’s a great paper – go read it!)

The example data files are taken from the Hierarchical Gating example notebook, which applied manual gating to identify NK, NK T, T and B cells; neutrophils, DCs, basophils, and macrophages. (After running the operations in that notebook, I used the ExportFCS operation to export each different cell type in a different .FCS file.) These manual gates serve as the “ground truth” to evaluate the performance of the clustering, dimensionality reduction, and visualization algorithms.

Set up the notebook and import the data set#

import cytoflow as flow
import pandas as pd

# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)

We have a single “tube” metadata, which is the cell type from the (manual) hierarchical gating.

import_op = flow.ImportOp(conditions = {"Cell_Type" : "category"},
                          tubes = [flow.Tube(file='B Cell.fcs', conditions = {"Cell_Type" : "B Cell"}),
                                   flow.Tube(file='Basophil.fcs', conditions = {"Cell_Type" : "Basophil"}),
                                   flow.Tube(file='DC.fcs', conditions = {"Cell_Type" : "DC"}),
                                   flow.Tube(file='Macrophage.fcs', conditions = {"Cell_Type" : "Macrophage"}),
                                   flow.Tube(file='Neutrophil.fcs', conditions = {"Cell_Type" : "Neutrophil"}),
                                   flow.Tube(file='NK T.fcs', conditions = {"Cell_Type" : "NK T"}),
                                   flow.Tube(file='NK.fcs', conditions = {"Cell_Type" : "NK"}),
                                   flow.Tube(file = "T Cell.fcs", conditions = {"Cell_Type" : "T Cell"})],
                          channels = {"FSC-A" : "FSC_A",
                                      "FSC-H" : "FSC_H",
                                      "FSC-W" : "FSC_W",
                                      "APC-Cy7-A" : "Live_Dead",
                                      "AmCyan-A" : "AmCyan",
                                      "BV711-A" : "CD64",
                                      "PE-A" : "CD3",
                                      "PE-Cy5-A" : "CD19",
                                      "APC-A" : "CD161",
                                      "PE-Cy7-A" : "CD11c",
                                      "PerCP-Cy5-5-A" : "MHCII",
                                      "Alexa Fluor 700-A" : "Ly_6G",
                                      "BV605-A" : "CD11b",
                                      "BV786-A" : "FcERI",
                                      "Pacific Blue-A" : "Pacific_Blue",
                                      "Time" : "Time"})

ex_import = import_op.apply()

The exported FCS files were not pre-gated to remove debris and clumps. Instead of using a manual gate, let’s use a DensityGateOp gate on FSC_H and FSC_W to select 80% of the events in the densest clusters. It’s more reproducible and less biased than manual gating!

density_op = flow.DensityGateOp(name = "Single_Cell",
                                xchannel = "FSC_H",
                                ychannel = "FSC_W",
                                yscale = "log",
                                keep = 0.8)
density_op.estimate(ex_import)
density_op.default_view(huescale = "log").plot(ex_import)
../../_images/computational_cytometry_6_0.png
ex_single_cell = density_op.apply(ex_import)

That said, a single RangeOp gate makes it pretty easy to sort out the live cells (Live_Dead-) from the dead cells (Live_Dead+)

flow.HistogramView(channel = "Live_Dead",
                   scale = "logicle",
                   subset = "Single_Cell == True").plot(ex_single_cell)
../../_images/computational_cytometry_9_0.png
ex_live = flow.RangeOp(name = "Live",
                       channel = "Live_Dead",
                       low = -300,
                       high = 1000).apply(ex_single_cell)

Clustering with Self-Organizing Maps#

SOMs use a grid of interconnected “neurons” to that are trained to categorize high-dimensional inputs. For a reasonable panel like the 9-marker panel we’re using, the default settings seem to be fine, but there are a lot of other parameters that can be tweaked. See the SOMOp documentation for details. I also highly suggest reading https://rubikscode.net/2018/08/20/introduction-to-self-organizing-maps/ and https://www.datacamp.com/tutorial/self-organizing-maps – the Tuning the SOM Model section in that second link is particularly helpful!

We use SOMOp just like any other data-driven module – instantiate the module, then call estimate(). This one can take a minute or so on a decent computer, so be patient. This algorithm also works substantially better on scaled data, so we’ll scale each channel with the logicle biexponential scale before training the map.

som_op = flow.SOMOp(name = "SOM_Cluster",
                    channels = ["CD64",
                                "CD3",
                                "CD19",
                                "CD161",
                                "CD11c",
                                "MHCII",
                                "Ly_6G",
                                "CD11b",
                                "FcERI"],
                    scale = {"CD64" : "logicle",
                             "CD3" : "logicle",
                             "CD19" : "logicle",
                             "CD161" : "logicle",
                             "CD11c" : "logicle",
                             "MHCII" : "logicle",
                             "Ly_6G" : "logicle",
                             "CD11b" : "logicle",
                             "FcERI" : "logicle"})
som_op.estimate(ex_live, subset = "Single_Cell == True & Live == True")
/home/brian/src/cytoflow/cytoflow/utility/minisom.py:645: RuntimeWarning: invalid value encountered in sqrt

In this example, we know the ground truth, but in general we won’t – so we need to use internal measures to evaluate the performance of our classifier. In this case, default_view() creates a diagnostic view so we can get a sense of how well the training went. The top plot is a distance map, where each cell represents one neuron and the color represents how close the neuron is to its adjacent neurons. Think of it as a topographic map – the input data will cluster in the “valleys”. In this case, we can see that there are two major “valleys” – we’ll see later if those correspond to any major cell types.

The other plot that default_view() gives you is a plot of the quantization error over the training epochs. Lower quantization error means the model fits the data better. This should decrease, but it pretty much always looks asymptotic. If it doesn’t seem to have decreased much, increase the number of iterations, but beware – later iterations give you less of a decrease each time than earlier ones!

som_op.default_view().plot(ex_live)
../../_images/computational_cytometry_14_0.png

Let’s apply the classifier.

ex_som = som_op.apply(ex_live)

To use SOMOp effectively, it’s important to understand what exactly it did. First, it added a statistic with the same name as the operation. Each row is a cluster and each column is one of the channels the model was trained on.

ex_som.statistics['SOM_Cluster']
CD64 CD3 CD19 CD161 CD11c MHCII Ly_6G CD11b FcERI
SOM_Cluster
0 0.289525 0.326666 0.348314 0.248770 0.592562 0.578617 0.248489 0.596384 0.338891
1 0.171053 0.194912 0.698371 0.209242 0.197450 0.558759 0.181081 0.239758 0.186713
2 0.196070 0.620371 0.248329 0.236463 0.208413 0.229489 0.190163 0.224928 0.170684

Here, we can see that SOMOp created three clusters. This statistic shows the center of that cluster in each of the input channels – this will be useful later. The other thing that SOMOp does is it creates a new condition in the experiment, also named the same as the module. This classifies each event as a member of one of the clusters.

ex_som.data.head()
CD161 Live_Dead Ly_6G AmCyan CD11b CD64 FcERI Cell_Type FSC_A FSC_H FSC_W CD3 CD19 CD11c Pacific_Blue MHCII Time Single_Cell Live SOM_Cluster
0 867.849915 386.842163 79.696068 482.380005 -79.137207 169.278854 188.839493 B Cell 87030.718750 80358.0 70977.937500 207.338821 14680.279297 -281.543640 127.491379 4268.340332 6968.700195 True True 1
1 327.388550 -175.940399 -89.582481 172.660004 51.409374 386.909424 334.144287 B Cell 96338.968750 87559.0 72107.617188 118.806480 12738.702148 447.330078 -40.327629 1693.349609 5436.500000 True False 1
2 493.974030 -180.421280 151.270081 281.239990 26.474604 -319.584473 375.361481 B Cell 95161.500000 88926.0 70131.398438 -24.239075 11745.514648 -80.861443 -76.268112 3642.938721 2477.199951 True False 1
3 -324.122528 -32.580925 -214.568588 140.619995 -46.702297 110.816521 -97.811043 B Cell 85820.492188 79845.0 70440.625000 35.616337 13543.582031 98.451004 90.043648 2853.914062 1077.800049 True True 1
4 58.998810 4059.355469 -3.835757 125.489998 424.426666 -149.123917 266.677643 B Cell 44717.398438 40482.0 72392.656250 -50.284782 16535.718750 -282.773895 15.052464 3099.647705 2372.300049 True False 1

So let’s see how well we did in separating out the cell types. (Even before we do, though, it is notable that SOMOp gave us three clusters – but we started with 8 distinct cell types!) Let’s just count the number of each cell type that ended up in each cluster and see where we end up.

ex_som_count = flow.ChannelStatisticOp(name = "SOM_Count",
                                       channel = "FSC_A",
                                       function = len,
                                       by = ["SOM_Cluster", "Cell_Type"],
                                       subset = "Single_Cell == True & Live == True").apply(ex_som)

flow.MatrixView(statistic = "SOM_Count",
                style = "pie",
                variable = "Cell_Type",
                feature = "FSC_A",
                yfacet = "SOM_Cluster").plot(ex_som_count)
../../_images/computational_cytometry_22_0.png

Well. We got three clusters – one is mostly B cells, one is mostly T cells (with some NK and NK T cells), and one is “everything else” – DCs, macrophages, basophils, neutrohpils. The reason we ended up with only three clusters here is because most of the cells in the data set are B and T cells!

Can we do better? By default, the SOMOp operation uses consensus clustering to find the “natural” number of clusters – but sometimes we want more resolution. Remember that each neuron in the self-organizing map actually defines a cluster, so the “natural” clusters are actually clusters of clusters!

You can disable the consensus clustering by setting the consensus_cluster attribute of the SOMOp to False. If you’ve already trained the SOM, you don’t have to re-train it! Just call update_consensus_clusters() after changing consensus_cluster or any of the consensus clustering parameters, then call apply() again.

som_op.consensus_cluster = False
som_op.update_consensus_clusters()
ex_som = som_op.apply(ex_live)
ex_som.statistics["SOM_Cluster"]
CD64 CD3 CD19 CD161 CD11c MHCII Ly_6G CD11b FcERI
SOM_Cluster
0 0.226738 0.121663 0.711139 0.096507 0.176297 0.470379 0.228657 0.249910 0.142132
1 0.175665 0.259865 0.717535 0.093570 0.098668 0.535609 0.215967 0.163130 0.225028
2 0.206152 0.276354 0.702472 0.126244 0.224442 0.467791 0.252755 0.290158 0.174743
3 0.204389 0.142635 0.689291 0.169682 0.482463 0.546307 0.163354 0.303589 0.163733
4 0.352854 0.221938 0.699500 0.184988 0.325416 0.515196 0.189641 0.354996 0.394311
5 0.518091 0.279517 0.358920 0.257161 0.439005 0.520271 0.185274 0.531840 0.543711
6 0.204237 0.282069 0.247663 0.241629 0.717699 0.619190 0.172715 0.496648 0.229111
7 0.154476 0.121151 0.697525 0.155817 0.160260 0.576676 0.277737 0.271750 0.227268
8 0.219407 0.182614 0.694765 0.122287 0.213434 0.543111 0.224136 0.152105 0.107418
9 0.168990 0.109799 0.721523 0.304668 0.135357 0.508849 0.126979 0.265715 0.163497
10 0.281169 0.150034 0.710293 0.195567 0.220006 0.568767 0.159527 0.239138 0.207659
11 0.235297 0.178217 0.707471 0.205426 0.270274 0.483761 0.112239 0.261615 0.081876
12 0.273329 0.273755 0.389782 0.257516 0.292130 0.279665 0.502135 0.662274 0.257203
13 0.163488 0.229435 0.166235 0.601458 0.491761 0.297294 0.189608 0.449072 0.211414
14 0.206379 0.106992 0.709627 0.200588 0.092779 0.549445 0.127033 0.188463 0.164182
15 0.240396 0.305606 0.694820 0.130993 0.160840 0.586773 0.180355 0.277189 0.174716
16 0.115209 0.177491 0.691077 0.160586 0.138238 0.625112 0.200368 0.175071 0.129453
17 0.125127 0.145870 0.693729 0.199691 0.230885 0.608198 0.165056 0.125243 0.195383
18 0.194069 0.169715 0.626780 0.241258 0.263665 0.373588 0.184766 0.249572 0.178180
19 0.254950 0.192477 0.664192 0.143500 0.142086 0.159576 0.204632 0.201929 0.165504
20 0.227084 0.595421 0.310775 0.239730 0.325393 0.262197 0.203936 0.294080 0.240895
21 0.136347 0.201825 0.703740 0.119259 0.299785 0.594602 0.212150 0.234938 0.200145
22 0.156384 0.241592 0.698324 0.270967 0.216938 0.561354 0.135102 0.137199 0.082454
23 0.125088 0.162495 0.681556 0.221286 0.237698 0.556740 0.207873 0.250207 0.097868
24 0.137548 0.214600 0.717823 0.173933 0.183517 0.543959 0.204044 0.125678 0.241090
25 0.196544 0.289173 0.628371 0.157156 0.203454 0.423672 0.201176 0.186139 0.195378
26 0.228521 0.557851 0.303835 0.196637 0.179855 0.141759 0.191598 0.224525 0.149455
27 0.197655 0.641699 0.216108 0.244526 0.204998 0.208955 0.184996 0.132743 0.129376
28 0.085399 0.248650 0.717835 0.203723 0.166089 0.639234 0.192483 0.190403 0.228592
29 0.109112 0.148866 0.745164 0.091180 0.129494 0.632626 0.191428 0.268908 0.173837
30 0.098192 0.176842 0.709552 0.210121 0.231526 0.616483 0.156820 0.338396 0.246745
31 0.180190 0.225396 0.701726 0.328079 0.165035 0.563848 0.116712 0.219985 0.242152
32 0.172369 0.568534 0.435958 0.205659 0.231257 0.340581 0.183036 0.271614 0.160560
33 0.203379 0.630036 0.256173 0.233902 0.208375 0.185172 0.177039 0.240995 0.200623
34 0.218481 0.651411 0.322282 0.198405 0.193204 0.142562 0.189204 0.284626 0.140142
35 0.161228 0.227828 0.696832 0.227831 0.206264 0.506996 0.148675 0.314344 0.187523
36 0.142372 0.127178 0.667954 0.325671 0.268454 0.574391 0.121194 0.263416 0.132923
37 0.177250 0.180344 0.686478 0.336477 0.226248 0.484226 0.092654 0.133812 0.187389
38 0.224196 0.196167 0.677810 0.244612 0.125569 0.416730 0.175917 0.167268 0.178349
39 0.236056 0.611020 0.376443 0.193382 0.214096 0.124856 0.193684 0.150977 0.154844
40 0.170684 0.636370 0.187967 0.220855 0.191435 0.308774 0.182804 0.146853 0.206397
41 0.185302 0.582613 0.160439 0.233339 0.188856 0.275988 0.193168 0.300069 0.133802
42 0.164954 0.173946 0.703948 0.158942 0.236451 0.596010 0.191247 0.346326 0.085774
43 0.077604 0.172721 0.672837 0.298129 0.159213 0.610449 0.103910 0.260129 0.183249
44 0.226385 0.246277 0.706517 0.319626 0.136734 0.561341 0.129759 0.111565 0.110477
45 0.131606 0.245083 0.740210 0.332164 0.251086 0.600782 0.152830 0.315986 0.187806
46 0.209071 0.476744 0.263280 0.499514 0.205978 0.197310 0.156061 0.259597 0.153490
47 0.164869 0.656669 0.090110 0.289076 0.228309 0.305352 0.138552 0.287980 0.189140
48 0.157298 0.564193 0.084960 0.275941 0.253021 0.335147 0.166746 0.153386 0.172314

By default, SOMOp uses a 7x7 grid of neurons, so now we’ve got 49 clusters instead of 3. Let’s recompute the count statistic from above:

ex_som_count = flow.ChannelStatisticOp(name = "SOM_Count",
                                       channel = "FSC_A",
                                       function = len,
                                       by = ["SOM_Cluster", "Cell_Type"],
                                       subset = "Single_Cell == True & Live == True").apply(ex_som)

This is quite a lot of clusters – how can we make sense of them?

Visualizing a self-organizing map with a minimum-spanning tree#

Remember how the SOMOp module created a statistic with the clusters and their centers? (Of course you do – it’s two cells above this one.) Those cluster centers can be used to plot a minimum spanning tree (MST) that shows the clusters’ relationships to eachother in high-dimensional space. We can use the MSTView view to create a MST. Note that the SOM_Cluster statistic is only used for the locations of each cluster in the tree. What gets plotted at those locations is another statistic – in this case, the SOM_Count statistic that contains the number of events from each cell type in each cluster. We’ll plot a pie chart at each location, which is a good way to visualize the relative abundance of each cell type in each cluster.

flow.MSTView(style = "pie",
             statistic = "SOM_Count",
             variable = "Cell_Type",
             feature = "FSC_A",
             locations = "SOM_Cluster",
             metric = 'euclidean').plot(ex_som_count)
../../_images/computational_cytometry_28_0.png

That’s a little more like it! Lots of T and B cells – as expected – but now, because we have a higher resolution “map” of the high-dimensional space, we can see that the other cell types are also clustering together! There are individual clusters of (mostly) NK, NK T, DC, Neutrophil, and macrophages. The only one missing its own cluster are the basophils – and there are so few of them in this data set that that’s not hugely surprising. Perhaps a SOM with more neurons would resolve them – feel free to play around and find out!

Remember, though, that here we have the ground truth in this data set, and usually you won’t. Let’s use the same tree to plot different data – in this case, the geometric mean of each of the 9 marker channels.

First, we need to create a new statistic. We’ll used FrameStatisticOp to break the data set apart by different values of SOM_Cluster, then compute flow.geom_mean() on each channel in each subset.

op_marker_stat = flow.FrameStatisticOp(name = "SOM_Marker_Mean",
                                       by = ["SOM_Cluster"],
                                       function = lambda x: pd.Series({"CD64" : flow.geom_mean(x["CD64"]),
                                                                       "CD3" : flow.geom_mean(x["CD3"]),
                                                                       "CD19" : flow.geom_mean(x["CD19"]),
                                                                       "CD161" : flow.geom_mean(x["CD161"]),
                                                                       "CD11c" : flow.geom_mean(x["CD11c"]),
                                                                       "MHCII" : flow.geom_mean(x["MHCII"]),
                                                                       "Ly_6G" : flow.geom_mean(x["Ly_6G"]),
                                                                       "CD11b" : flow.geom_mean(x["CD11b"]),
                                                                       "FcERI" : flow.geom_mean(x["FcERI"])}),
                                      subset = "Single_Cell == True & Live == True")
ex_marker = op_marker_stat.apply(ex_som_count)
ex_marker.statistics["SOM_Marker_Mean"].head()
CD64 CD3 CD19 CD161 CD11c MHCII Ly_6G CD11b FcERI
SOM_Cluster
0 48.881560 -113.307490 13527.244066 -541.400374 -78.887474 1404.630782 68.391732 147.235850 -12.864664
1 -37.986999 123.798359 13706.225894 -556.902879 -294.634000 2751.383944 60.797191 -30.524994 80.620049
2 3.975231 168.311116 12865.033204 -371.834672 50.123072 1514.621438 148.298198 254.854666 20.959845
3 -23.783888 -39.767408 14540.616210 -102.483144 1896.639668 3792.009839 -25.314420 337.753416 34.470699
4 461.930846 41.467228 15011.048700 -88.127247 267.814390 2965.898558 23.121447 469.708267 702.696616

Now, we can use MSTView in a slightly different way. First, we’ll make a “petal plot” instead of a pie plot at each location. And second, if we specify a statistic but not the variable from that statistic, the columns are used as the categories. (Was this functionality implemented for precisely this use case? Yes, yes it was.)

flow.MSTView(style = "petal",
             statistic = "SOM_Marker_Mean",
             locations = "SOM_Cluster").plot(ex_marker, lw = 0.3, radius = 0.035)
../../_images/computational_cytometry_33_0.png

Now we can see that the fairly obvious classes of cells and their marker levels. High CD19 (and mostly low owther things) are B cells; high CD3 (and mostly low other things) are T cells. But there are a few clusters that are different, and those correspond to the other cell types.

t-distributed Stochastic Neighbor Embedding#

Self-organizing maps (and other clustering algorithms like K-means and FlowPeaks) are classifiers – they take points in a high-dimensional space and sort them into bins based on a how close they are to eachother. These algorithms consider all of the dimensions – in this case, all 9 of the channels – but they are subject to the curse of dimensionality where increased numbers of dimensions make distance-based algorithms begin to fail.

Another approach is to reduce the number of dimensions, embedding the original high-dimensional data set into a lower-dimensional (usually 2) space. The trick is to do so in a way that retains the structure, keeping “close” observations in the higher-dimensional space still “close” in the lower-dimensional embedding.

t-distributed Stochastic Neighbor Embedding is an algorithm that promises to do just that. It is one of many non-linear dimensionality reduction methods – its benefit over linear dimensionality reductions such as principal components analysis (PCA) is that is more faithfully maintains local structure.

This comes with a cost, of course, and that cost is computational complexity! On this fairly modest data set, the following cell takes over four minutes to run. So be patient! The results are worth it, I promise. The package we use, openTSNE, prints status updates so you don’t get bored or think it’s crashed. Also, the tSNE algorithm can also peform better or worse using different ways of measuring “distance” in the original high-dimensional space. For two or three channels, euclidean is fine, but for higher numbers of channels cosine seems to work better. Finally, this performs much better on scaled data, so we’re using logicle for all of the channels. Just as with SOMOp, there are a number of parameters that can change the performance of the algorithm. Read the tSNEop class documentation for details.

tsne_op = flow.tSNEOp(name = "tSNE",
                      channels = ["CD64",
                                "CD3",
                                "CD19",
                                "CD161",
                                "CD11c",
                                "MHCII",
                                "Ly_6G",
                                "CD11b",
                                "FcERI"],
                      scale = {"CD64" : "logicle",
                               "CD3" : "logicle",
                               "CD19" : "logicle",
                               "CD161" : "logicle",
                               "CD11c" : "logicle",
                               "MHCII" : "logicle",
                               "Ly_6G" : "logicle",
                               "CD11b" : "logicle",
                               "FcERI" : "logicle"},
                      metric = "cosine")
tsne_op.estimate(ex_som_count)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, metric='cosine', n_jobs=8, perplexity=10,
     verbose=True)
--------------------------------------------------------------------------------
===> Finding 30 nearest neighbors using Annoy approximate search using cosine distance...
   --> Time elapsed: 0.20 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.02 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.01 seconds
===> Running optimization with exaggeration=12.00, lr=198.25 for 250 iterations...
Iteration   50, KL divergence 4.6000, 50 iterations in 45.6969 sec
Iteration  100, KL divergence 4.5668, 50 iterations in 43.1384 sec
Iteration  150, KL divergence 4.5568, 50 iterations in 41.4098 sec
Iteration  200, KL divergence 4.5513, 50 iterations in 41.3300 sec
Iteration  250, KL divergence 4.5480, 50 iterations in 43.0481 sec
   --> Time elapsed: 214.62 seconds
===> Running optimization with exaggeration=1.00, lr=2379.00 for 500 iterations...
Iteration   50, KL divergence 2.2273, 50 iterations in 25.8856 sec
Iteration  100, KL divergence 2.0122, 50 iterations in 25.0921 sec
Iteration  150, KL divergence 1.9150, 50 iterations in 25.4421 sec
Iteration  200, KL divergence 1.8603, 50 iterations in 26.1375 sec
Iteration  250, KL divergence 1.8237, 50 iterations in 25.3474 sec
Iteration  300, KL divergence 1.7989, 50 iterations in 25.7548 sec
Iteration  350, KL divergence 1.7796, 50 iterations in 25.6185 sec
Iteration  400, KL divergence 1.7653, 50 iterations in 25.9954 sec
Iteration  450, KL divergence 1.7547, 50 iterations in 25.9668 sec
Iteration  500, KL divergence 1.7458, 50 iterations in 25.8475 sec
   --> Time elapsed: 257.09 seconds

(If you saw an error about “No child process”, don’t worry about it – it doesn’t affect the outcome.)

Apply the operation, and note that it adds two more “channels” to the experiment – named the same as the operation name, with _1 and _2 appended.

ex_tsne = tsne_op.apply(ex_som_count)
ex_tsne.data.head()
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
   --> Time elapsed: 10.52 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.25 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration   50, KL divergence 3948148.9583, 50 iterations in 2.9153 sec
Iteration  100, KL divergence 3915482.6764, 50 iterations in 2.7796 sec
Iteration  150, KL divergence 3895033.2699, 50 iterations in 2.8729 sec
Iteration  200, KL divergence 3880748.6666, 50 iterations in 2.8130 sec
Iteration  250, KL divergence 3869923.7153, 50 iterations in 2.9048 sec
   --> Time elapsed: 14.29 seconds
CD161 Live_Dead Ly_6G AmCyan CD11b CD64 FcERI Cell_Type FSC_A FSC_H ... CD19 CD11c Pacific_Blue MHCII Time Single_Cell Live SOM_Cluster tSNE_1 tSNE_2
0 867.849915 386.842163 79.696068 482.380005 -79.137207 169.278854 188.839493 B Cell 87030.718750 80358.0 ... 14680.279297 -281.543640 127.491379 4268.340332 6968.700195 True True 31 20.072726 -66.822968
1 327.388550 -175.940399 -89.582481 172.660004 51.409374 386.909424 334.144287 B Cell 96338.968750 87559.0 ... 12738.702148 447.330078 -40.327629 1693.349609 5436.500000 True False 4 -3.875199 -19.182361
2 493.974030 -180.421280 151.270081 281.239990 26.474604 -319.584473 375.361481 B Cell 95161.500000 88926.0 ... 11745.514648 -80.861443 -76.268112 3642.938721 2477.199951 True False 31 19.317669 -47.476411
3 -324.122528 -32.580925 -214.568588 140.619995 -46.702297 110.816521 -97.811043 B Cell 85820.492188 79845.0 ... 13543.582031 98.451004 90.043648 2853.914062 1077.800049 True True 11 37.330137 -14.398697
4 58.998810 4059.355469 -3.835757 125.489998 424.426666 -149.123917 266.677643 B Cell 44717.398438 40482.0 ... 16535.718750 -282.773895 15.052464 3099.647705 2372.300049 True False 30 34.690254 -33.893300

5 rows × 22 columns

Let’s plot the results, colored by “ground truth” cell type.

flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huefacet = "Cell_Type").plot(ex_tsne, s = 0.1, alpha = 0.1)
../../_images/computational_cytometry_39_0.png

Excellent! We’re seeing “natural” clusters where (most) clusters seem to represent a single cell type. NK Ts are a little close to the T cells, and the basophils are hiding again, but the others seem to be relatively clearly clustered.

…where are those sneaky basophils?

flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huefacet = "Cell_Type", subset = "Cell_Type == 'Basophil'").plot(ex_tsne, s = 0.1, alpha = 1.0)
../../_images/computational_cytometry_41_0.png

They’re hiding with their friends, the macrophages!

Again, we usually won’t have the ground truth – so it’s again good to evaluate the clusters by plotting the relative amounts of each marker in each cluster. The following graphs do so with the huechannel and huescale attributes of ScatterplotView, which relate the color of each event to the (scaled) value of a channel. And just as the source paper uses different color palettes for each different marker, so too here are different palettes with different vibes. The last, viridis_r, is my favorite and thus is the default if no palette is specified. For a comprehensive list of the options, see https://seaborn.pydata.org/tutorial/color_palettes.html and https://matplotlib.org/stable/users/explain/colors/colormaps.html

flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD11b",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "coolwarm")
../../_images/computational_cytometry_43_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD11c",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Blues")
../../_images/computational_cytometry_44_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD161",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Greens")
../../_images/computational_cytometry_45_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD19",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Oranges")
../../_images/computational_cytometry_46_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD3",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Reds")
../../_images/computational_cytometry_47_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "CD64",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Purples")
../../_images/computational_cytometry_48_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "FcERI",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Spectral_r")
../../_images/computational_cytometry_49_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "Ly_6G",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "plasma_r")
../../_images/computational_cytometry_50_0.png
flow.ScatterplotView(xchannel = "tSNE_1",
                     ychannel = "tSNE_2",
                     huechannel = "MHCII",
                     huescale = "logicle",
                     subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "viridis_r")
../../_images/computational_cytometry_51_0.png

Two final notes. First, because t-SNE is stochastic, and because it is initialized randomly, you will get a somewhat different plot each time you run it. This is expected behavior.

And second, there is a natural cluster in the tSNE plots that was assigned to the T Cell cell type using manual gating, but is definitely distinct from the main T cell blob on the tSNE plot. It is BOTH CD19+ and CD3+. When you go back to look at the MST, they show up in a couple of clusters there too. I am not an immunologist, but Google thinks that this population should not exist. I don’t know what it’s doing there!