Computational Cytometry#
Cytoflow includes modules and views for analyzing and visualizing
high-dimensional data from flow cytometry experiments. Often called
computational cytometry, these semi-supervised and unsupervised
analysis pipelines are generally broken into three major pieces:
Clean and pre-process the data. Check for tubes that have artifacts / discontinuities in the flow rate, for example, and then compensate for spill-over between channels. Possibly warp the channels between tubes to bring peaks into registration. Because these data are pre-processed, those capabilities are not demonstrated here, but see the
BleedthroughLinearOp,FlowCleanOpandRegistrationOpclasses for details on how to perform these cleaning steps.
Often, pre-processing also involves scaling data with a log or
biexponential scaling function. However, Cytoflow maintains the
underlying data in its unscaled form, and scales it as needed for
processing or visualization.
Cluster or reduce the dimensionality of the data. Cytoflow includes several clustering algorithms – KMeans, FlowPeaks, and self-organizing maps – and two dimensionality reduction methods, principle components analysis and t-distributed stochastic neighbor embedding. Self-organizing maps and tSNE are demonstrated below.
Visualize the data to explore the biology. For dimensionality-reduction methods like tSNE and PCA, standard scatterplots are used. However, for high-dimensional clustering, a minimum-spanning tree has become common.
Cytoflowallows a user to create both visualizations, and each is demonstrated below as well.
This notebook demonstrates self-organized maps (SOM), minimum-spanning
trees, and t-distributed stochastic neighbor embedding (t-SNE) using
data from
Saeys Y, Van Gassen S, Lambrecht BN. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nature Reviews Immunology 16:449-462 (2016).
The methods below reproduce many of the figures from that paper. (It’s a
great paper – go read it!)
The example data files are taken from the Hierarchical Gating
example notebook, which applied manual gating to identify NK, NK T, T
and B cells; neutrophils, DCs, basophils, and macrophages. (After
running the operations in that notebook, I used the ExportFCS
operation to export each different cell type in a different .FCS file.)
These manual gates serve as the “ground truth” to evaluate the
performance of the clustering, dimensionality reduction, and
visualization algorithms.
Set up the notebook and import the data set#
import cytoflow as flow
import pandas as pd
# if your figures are too big or too small, you can scale them by changing matplotlib's DPI
import matplotlib
matplotlib.rc('figure', dpi = 160)
We have a single “tube” metadata, which is the cell type from the (manual) hierarchical gating.
import_op = flow.ImportOp(conditions = {"Cell_Type" : "category"},
tubes = [flow.Tube(file='B Cell.fcs', conditions = {"Cell_Type" : "B Cell"}),
flow.Tube(file='Basophil.fcs', conditions = {"Cell_Type" : "Basophil"}),
flow.Tube(file='DC.fcs', conditions = {"Cell_Type" : "DC"}),
flow.Tube(file='Macrophage.fcs', conditions = {"Cell_Type" : "Macrophage"}),
flow.Tube(file='Neutrophil.fcs', conditions = {"Cell_Type" : "Neutrophil"}),
flow.Tube(file='NK T.fcs', conditions = {"Cell_Type" : "NK T"}),
flow.Tube(file='NK.fcs', conditions = {"Cell_Type" : "NK"}),
flow.Tube(file = "T Cell.fcs", conditions = {"Cell_Type" : "T Cell"})],
channels = {"FSC-A" : "FSC_A",
"FSC-H" : "FSC_H",
"FSC-W" : "FSC_W",
"APC-Cy7-A" : "Live_Dead",
"AmCyan-A" : "AmCyan",
"BV711-A" : "CD64",
"PE-A" : "CD3",
"PE-Cy5-A" : "CD19",
"APC-A" : "CD161",
"PE-Cy7-A" : "CD11c",
"PerCP-Cy5-5-A" : "MHCII",
"Alexa Fluor 700-A" : "Ly_6G",
"BV605-A" : "CD11b",
"BV786-A" : "FcERI",
"Pacific Blue-A" : "Pacific_Blue",
"Time" : "Time"})
ex_import = import_op.apply()
The exported FCS files were not pre-gated to remove debris and clumps.
Instead of using a manual gate, let’s use a DensityGateOp gate on
FSC_H and FSC_W to select 80% of the events in the densest
clusters. It’s more reproducible and less biased than manual gating!
density_op = flow.DensityGateOp(name = "Single_Cell",
xchannel = "FSC_H",
ychannel = "FSC_W",
yscale = "log",
keep = 0.8)
density_op.estimate(ex_import)
density_op.default_view(huescale = "log").plot(ex_import)
ex_single_cell = density_op.apply(ex_import)
That said, a single RangeOp gate makes it pretty easy to sort out
the live cells (Live_Dead-) from the dead cells (Live_Dead+)
flow.HistogramView(channel = "Live_Dead",
scale = "logicle",
subset = "Single_Cell == True").plot(ex_single_cell)
ex_live = flow.RangeOp(name = "Live",
channel = "Live_Dead",
low = -300,
high = 1000).apply(ex_single_cell)
Clustering with Self-Organizing Maps#
SOMs use a grid of interconnected “neurons” to that are trained to
categorize high-dimensional inputs. For a reasonable panel like the
9-marker panel we’re using, the default settings seem to be fine, but
there are a lot of other parameters that can be tweaked. See the
SOMOp documentation for details. I also highly suggest reading
https://rubikscode.net/2018/08/20/introduction-to-self-organizing-maps/
and https://www.datacamp.com/tutorial/self-organizing-maps – the
Tuning the SOM Model section in that second link is particularly
helpful!
We use SOMOp just like any other data-driven module – instantiate
the module, then call estimate(). This one can take a minute or so
on a decent computer, so be patient. This algorithm also works
substantially better on scaled data, so we’ll scale each channel with
the logicle biexponential scale before training the map.
som_op = flow.SOMOp(name = "SOM_Cluster",
channels = ["CD64",
"CD3",
"CD19",
"CD161",
"CD11c",
"MHCII",
"Ly_6G",
"CD11b",
"FcERI"],
scale = {"CD64" : "logicle",
"CD3" : "logicle",
"CD19" : "logicle",
"CD161" : "logicle",
"CD11c" : "logicle",
"MHCII" : "logicle",
"Ly_6G" : "logicle",
"CD11b" : "logicle",
"FcERI" : "logicle"})
som_op.estimate(ex_live, subset = "Single_Cell == True & Live == True")
/home/brian/src/cytoflow/cytoflow/utility/minisom.py:645: RuntimeWarning: invalid value encountered in sqrt
In this example, we know the ground truth, but in general we won’t – so
we need to use internal measures to evaluate the performance of our
classifier. In this case, default_view() creates a diagnostic view
so we can get a sense of how well the training went. The top plot is a
distance map, where each cell represents one neuron and the color
represents how close the neuron is to its adjacent neurons. Think of it
as a topographic map – the input data will cluster in the “valleys”. In
this case, we can see that there are two major “valleys” – we’ll see
later if those correspond to any major cell types.
The other plot that default_view() gives you is a plot of the
quantization error over the training epochs. Lower quantization error
means the model fits the data better. This should decrease, but it
pretty much always looks asymptotic. If it doesn’t seem to have
decreased much, increase the number of iterations, but beware – later
iterations give you less of a decrease each time than earlier ones!
som_op.default_view().plot(ex_live)
Let’s apply the classifier.
ex_som = som_op.apply(ex_live)
To use SOMOp effectively, it’s important to understand what exactly
it did. First, it added a statistic with the same name as the operation.
Each row is a cluster and each column is one of the channels the model
was trained on.
ex_som.statistics['SOM_Cluster']
| CD64 | CD3 | CD19 | CD161 | CD11c | MHCII | Ly_6G | CD11b | FcERI | |
|---|---|---|---|---|---|---|---|---|---|
| SOM_Cluster | |||||||||
| 0 | 0.289525 | 0.326666 | 0.348314 | 0.248770 | 0.592562 | 0.578617 | 0.248489 | 0.596384 | 0.338891 |
| 1 | 0.171053 | 0.194912 | 0.698371 | 0.209242 | 0.197450 | 0.558759 | 0.181081 | 0.239758 | 0.186713 |
| 2 | 0.196070 | 0.620371 | 0.248329 | 0.236463 | 0.208413 | 0.229489 | 0.190163 | 0.224928 | 0.170684 |
Here, we can see that SOMOp created three clusters. This statistic
shows the center of that cluster in each of the input channels – this
will be useful later. The other thing that SOMOp does is it creates
a new condition in the experiment, also named the same as the module.
This classifies each event as a member of one of the clusters.
ex_som.data.head()
| CD161 | Live_Dead | Ly_6G | AmCyan | CD11b | CD64 | FcERI | Cell_Type | FSC_A | FSC_H | FSC_W | CD3 | CD19 | CD11c | Pacific_Blue | MHCII | Time | Single_Cell | Live | SOM_Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 867.849915 | 386.842163 | 79.696068 | 482.380005 | -79.137207 | 169.278854 | 188.839493 | B Cell | 87030.718750 | 80358.0 | 70977.937500 | 207.338821 | 14680.279297 | -281.543640 | 127.491379 | 4268.340332 | 6968.700195 | True | True | 1 |
| 1 | 327.388550 | -175.940399 | -89.582481 | 172.660004 | 51.409374 | 386.909424 | 334.144287 | B Cell | 96338.968750 | 87559.0 | 72107.617188 | 118.806480 | 12738.702148 | 447.330078 | -40.327629 | 1693.349609 | 5436.500000 | True | False | 1 |
| 2 | 493.974030 | -180.421280 | 151.270081 | 281.239990 | 26.474604 | -319.584473 | 375.361481 | B Cell | 95161.500000 | 88926.0 | 70131.398438 | -24.239075 | 11745.514648 | -80.861443 | -76.268112 | 3642.938721 | 2477.199951 | True | False | 1 |
| 3 | -324.122528 | -32.580925 | -214.568588 | 140.619995 | -46.702297 | 110.816521 | -97.811043 | B Cell | 85820.492188 | 79845.0 | 70440.625000 | 35.616337 | 13543.582031 | 98.451004 | 90.043648 | 2853.914062 | 1077.800049 | True | True | 1 |
| 4 | 58.998810 | 4059.355469 | -3.835757 | 125.489998 | 424.426666 | -149.123917 | 266.677643 | B Cell | 44717.398438 | 40482.0 | 72392.656250 | -50.284782 | 16535.718750 | -282.773895 | 15.052464 | 3099.647705 | 2372.300049 | True | False | 1 |
So let’s see how well we did in separating out the cell types. (Even
before we do, though, it is notable that SOMOp gave us three
clusters – but we started with 8 distinct cell types!) Let’s just
count the number of each cell type that ended up in each cluster and
see where we end up.
ex_som_count = flow.ChannelStatisticOp(name = "SOM_Count",
channel = "FSC_A",
function = len,
by = ["SOM_Cluster", "Cell_Type"],
subset = "Single_Cell == True & Live == True").apply(ex_som)
flow.MatrixView(statistic = "SOM_Count",
style = "pie",
variable = "Cell_Type",
feature = "FSC_A",
yfacet = "SOM_Cluster").plot(ex_som_count)
Well. We got three clusters – one is mostly B cells, one is mostly T cells (with some NK and NK T cells), and one is “everything else” – DCs, macrophages, basophils, neutrohpils. The reason we ended up with only three clusters here is because most of the cells in the data set are B and T cells!
Can we do better? By default, the SOMOp operation uses consensus
clustering to find the “natural” number of clusters – but sometimes we
want more resolution. Remember that each neuron in the self-organizing
map actually defines a cluster, so the “natural” clusters are actually
clusters of clusters!
You can disable the consensus clustering by setting the
consensus_cluster attribute of the SOMOp to False. If you’ve
already trained the SOM, you don’t have to re-train it! Just call
update_consensus_clusters() after changing consensus_cluster or
any of the consensus clustering parameters, then call apply() again.
som_op.consensus_cluster = False
som_op.update_consensus_clusters()
ex_som = som_op.apply(ex_live)
ex_som.statistics["SOM_Cluster"]
| CD64 | CD3 | CD19 | CD161 | CD11c | MHCII | Ly_6G | CD11b | FcERI | |
|---|---|---|---|---|---|---|---|---|---|
| SOM_Cluster | |||||||||
| 0 | 0.226738 | 0.121663 | 0.711139 | 0.096507 | 0.176297 | 0.470379 | 0.228657 | 0.249910 | 0.142132 |
| 1 | 0.175665 | 0.259865 | 0.717535 | 0.093570 | 0.098668 | 0.535609 | 0.215967 | 0.163130 | 0.225028 |
| 2 | 0.206152 | 0.276354 | 0.702472 | 0.126244 | 0.224442 | 0.467791 | 0.252755 | 0.290158 | 0.174743 |
| 3 | 0.204389 | 0.142635 | 0.689291 | 0.169682 | 0.482463 | 0.546307 | 0.163354 | 0.303589 | 0.163733 |
| 4 | 0.352854 | 0.221938 | 0.699500 | 0.184988 | 0.325416 | 0.515196 | 0.189641 | 0.354996 | 0.394311 |
| 5 | 0.518091 | 0.279517 | 0.358920 | 0.257161 | 0.439005 | 0.520271 | 0.185274 | 0.531840 | 0.543711 |
| 6 | 0.204237 | 0.282069 | 0.247663 | 0.241629 | 0.717699 | 0.619190 | 0.172715 | 0.496648 | 0.229111 |
| 7 | 0.154476 | 0.121151 | 0.697525 | 0.155817 | 0.160260 | 0.576676 | 0.277737 | 0.271750 | 0.227268 |
| 8 | 0.219407 | 0.182614 | 0.694765 | 0.122287 | 0.213434 | 0.543111 | 0.224136 | 0.152105 | 0.107418 |
| 9 | 0.168990 | 0.109799 | 0.721523 | 0.304668 | 0.135357 | 0.508849 | 0.126979 | 0.265715 | 0.163497 |
| 10 | 0.281169 | 0.150034 | 0.710293 | 0.195567 | 0.220006 | 0.568767 | 0.159527 | 0.239138 | 0.207659 |
| 11 | 0.235297 | 0.178217 | 0.707471 | 0.205426 | 0.270274 | 0.483761 | 0.112239 | 0.261615 | 0.081876 |
| 12 | 0.273329 | 0.273755 | 0.389782 | 0.257516 | 0.292130 | 0.279665 | 0.502135 | 0.662274 | 0.257203 |
| 13 | 0.163488 | 0.229435 | 0.166235 | 0.601458 | 0.491761 | 0.297294 | 0.189608 | 0.449072 | 0.211414 |
| 14 | 0.206379 | 0.106992 | 0.709627 | 0.200588 | 0.092779 | 0.549445 | 0.127033 | 0.188463 | 0.164182 |
| 15 | 0.240396 | 0.305606 | 0.694820 | 0.130993 | 0.160840 | 0.586773 | 0.180355 | 0.277189 | 0.174716 |
| 16 | 0.115209 | 0.177491 | 0.691077 | 0.160586 | 0.138238 | 0.625112 | 0.200368 | 0.175071 | 0.129453 |
| 17 | 0.125127 | 0.145870 | 0.693729 | 0.199691 | 0.230885 | 0.608198 | 0.165056 | 0.125243 | 0.195383 |
| 18 | 0.194069 | 0.169715 | 0.626780 | 0.241258 | 0.263665 | 0.373588 | 0.184766 | 0.249572 | 0.178180 |
| 19 | 0.254950 | 0.192477 | 0.664192 | 0.143500 | 0.142086 | 0.159576 | 0.204632 | 0.201929 | 0.165504 |
| 20 | 0.227084 | 0.595421 | 0.310775 | 0.239730 | 0.325393 | 0.262197 | 0.203936 | 0.294080 | 0.240895 |
| 21 | 0.136347 | 0.201825 | 0.703740 | 0.119259 | 0.299785 | 0.594602 | 0.212150 | 0.234938 | 0.200145 |
| 22 | 0.156384 | 0.241592 | 0.698324 | 0.270967 | 0.216938 | 0.561354 | 0.135102 | 0.137199 | 0.082454 |
| 23 | 0.125088 | 0.162495 | 0.681556 | 0.221286 | 0.237698 | 0.556740 | 0.207873 | 0.250207 | 0.097868 |
| 24 | 0.137548 | 0.214600 | 0.717823 | 0.173933 | 0.183517 | 0.543959 | 0.204044 | 0.125678 | 0.241090 |
| 25 | 0.196544 | 0.289173 | 0.628371 | 0.157156 | 0.203454 | 0.423672 | 0.201176 | 0.186139 | 0.195378 |
| 26 | 0.228521 | 0.557851 | 0.303835 | 0.196637 | 0.179855 | 0.141759 | 0.191598 | 0.224525 | 0.149455 |
| 27 | 0.197655 | 0.641699 | 0.216108 | 0.244526 | 0.204998 | 0.208955 | 0.184996 | 0.132743 | 0.129376 |
| 28 | 0.085399 | 0.248650 | 0.717835 | 0.203723 | 0.166089 | 0.639234 | 0.192483 | 0.190403 | 0.228592 |
| 29 | 0.109112 | 0.148866 | 0.745164 | 0.091180 | 0.129494 | 0.632626 | 0.191428 | 0.268908 | 0.173837 |
| 30 | 0.098192 | 0.176842 | 0.709552 | 0.210121 | 0.231526 | 0.616483 | 0.156820 | 0.338396 | 0.246745 |
| 31 | 0.180190 | 0.225396 | 0.701726 | 0.328079 | 0.165035 | 0.563848 | 0.116712 | 0.219985 | 0.242152 |
| 32 | 0.172369 | 0.568534 | 0.435958 | 0.205659 | 0.231257 | 0.340581 | 0.183036 | 0.271614 | 0.160560 |
| 33 | 0.203379 | 0.630036 | 0.256173 | 0.233902 | 0.208375 | 0.185172 | 0.177039 | 0.240995 | 0.200623 |
| 34 | 0.218481 | 0.651411 | 0.322282 | 0.198405 | 0.193204 | 0.142562 | 0.189204 | 0.284626 | 0.140142 |
| 35 | 0.161228 | 0.227828 | 0.696832 | 0.227831 | 0.206264 | 0.506996 | 0.148675 | 0.314344 | 0.187523 |
| 36 | 0.142372 | 0.127178 | 0.667954 | 0.325671 | 0.268454 | 0.574391 | 0.121194 | 0.263416 | 0.132923 |
| 37 | 0.177250 | 0.180344 | 0.686478 | 0.336477 | 0.226248 | 0.484226 | 0.092654 | 0.133812 | 0.187389 |
| 38 | 0.224196 | 0.196167 | 0.677810 | 0.244612 | 0.125569 | 0.416730 | 0.175917 | 0.167268 | 0.178349 |
| 39 | 0.236056 | 0.611020 | 0.376443 | 0.193382 | 0.214096 | 0.124856 | 0.193684 | 0.150977 | 0.154844 |
| 40 | 0.170684 | 0.636370 | 0.187967 | 0.220855 | 0.191435 | 0.308774 | 0.182804 | 0.146853 | 0.206397 |
| 41 | 0.185302 | 0.582613 | 0.160439 | 0.233339 | 0.188856 | 0.275988 | 0.193168 | 0.300069 | 0.133802 |
| 42 | 0.164954 | 0.173946 | 0.703948 | 0.158942 | 0.236451 | 0.596010 | 0.191247 | 0.346326 | 0.085774 |
| 43 | 0.077604 | 0.172721 | 0.672837 | 0.298129 | 0.159213 | 0.610449 | 0.103910 | 0.260129 | 0.183249 |
| 44 | 0.226385 | 0.246277 | 0.706517 | 0.319626 | 0.136734 | 0.561341 | 0.129759 | 0.111565 | 0.110477 |
| 45 | 0.131606 | 0.245083 | 0.740210 | 0.332164 | 0.251086 | 0.600782 | 0.152830 | 0.315986 | 0.187806 |
| 46 | 0.209071 | 0.476744 | 0.263280 | 0.499514 | 0.205978 | 0.197310 | 0.156061 | 0.259597 | 0.153490 |
| 47 | 0.164869 | 0.656669 | 0.090110 | 0.289076 | 0.228309 | 0.305352 | 0.138552 | 0.287980 | 0.189140 |
| 48 | 0.157298 | 0.564193 | 0.084960 | 0.275941 | 0.253021 | 0.335147 | 0.166746 | 0.153386 | 0.172314 |
By default, SOMOp uses a 7x7 grid of neurons, so now we’ve got 49
clusters instead of 3. Let’s recompute the count statistic from above:
ex_som_count = flow.ChannelStatisticOp(name = "SOM_Count",
channel = "FSC_A",
function = len,
by = ["SOM_Cluster", "Cell_Type"],
subset = "Single_Cell == True & Live == True").apply(ex_som)
This is quite a lot of clusters – how can we make sense of them?
Visualizing a self-organizing map with a minimum-spanning tree#
Remember how the SOMOp module created a statistic with the clusters
and their centers? (Of course you do – it’s two cells above this one.)
Those cluster centers can be used to plot a minimum spanning tree
(MST) that shows the clusters’ relationships to eachother in
high-dimensional space. We can use the MSTView view to create a MST.
Note that the SOM_Cluster statistic is only used for the locations
of each cluster in the tree. What gets plotted at those locations is
another statistic – in this case, the SOM_Count statistic that
contains the number of events from each cell type in each cluster. We’ll
plot a pie chart at each location, which is a good way to visualize the
relative abundance of each cell type in each cluster.
flow.MSTView(style = "pie",
statistic = "SOM_Count",
variable = "Cell_Type",
feature = "FSC_A",
locations = "SOM_Cluster",
metric = 'euclidean').plot(ex_som_count)
That’s a little more like it! Lots of T and B cells – as expected – but now, because we have a higher resolution “map” of the high-dimensional space, we can see that the other cell types are also clustering together! There are individual clusters of (mostly) NK, NK T, DC, Neutrophil, and macrophages. The only one missing its own cluster are the basophils – and there are so few of them in this data set that that’s not hugely surprising. Perhaps a SOM with more neurons would resolve them – feel free to play around and find out!
Remember, though, that here we have the ground truth in this data set, and usually you won’t. Let’s use the same tree to plot different data – in this case, the geometric mean of each of the 9 marker channels.
First, we need to create a new statistic. We’ll used
FrameStatisticOp to break the data set apart by different values of
SOM_Cluster, then compute flow.geom_mean() on each channel in
each subset.
op_marker_stat = flow.FrameStatisticOp(name = "SOM_Marker_Mean",
by = ["SOM_Cluster"],
function = lambda x: pd.Series({"CD64" : flow.geom_mean(x["CD64"]),
"CD3" : flow.geom_mean(x["CD3"]),
"CD19" : flow.geom_mean(x["CD19"]),
"CD161" : flow.geom_mean(x["CD161"]),
"CD11c" : flow.geom_mean(x["CD11c"]),
"MHCII" : flow.geom_mean(x["MHCII"]),
"Ly_6G" : flow.geom_mean(x["Ly_6G"]),
"CD11b" : flow.geom_mean(x["CD11b"]),
"FcERI" : flow.geom_mean(x["FcERI"])}),
subset = "Single_Cell == True & Live == True")
ex_marker = op_marker_stat.apply(ex_som_count)
ex_marker.statistics["SOM_Marker_Mean"].head()
| CD64 | CD3 | CD19 | CD161 | CD11c | MHCII | Ly_6G | CD11b | FcERI | |
|---|---|---|---|---|---|---|---|---|---|
| SOM_Cluster | |||||||||
| 0 | 48.881560 | -113.307490 | 13527.244066 | -541.400374 | -78.887474 | 1404.630782 | 68.391732 | 147.235850 | -12.864664 |
| 1 | -37.986999 | 123.798359 | 13706.225894 | -556.902879 | -294.634000 | 2751.383944 | 60.797191 | -30.524994 | 80.620049 |
| 2 | 3.975231 | 168.311116 | 12865.033204 | -371.834672 | 50.123072 | 1514.621438 | 148.298198 | 254.854666 | 20.959845 |
| 3 | -23.783888 | -39.767408 | 14540.616210 | -102.483144 | 1896.639668 | 3792.009839 | -25.314420 | 337.753416 | 34.470699 |
| 4 | 461.930846 | 41.467228 | 15011.048700 | -88.127247 | 267.814390 | 2965.898558 | 23.121447 | 469.708267 | 702.696616 |
Now, we can use MSTView in a slightly different way. First, we’ll
make a “petal plot” instead of a pie plot at each location. And second,
if we specify a statistic but not the variable from that statistic, the
columns are used as the categories. (Was this functionality
implemented for precisely this use case? Yes, yes it was.)
flow.MSTView(style = "petal",
statistic = "SOM_Marker_Mean",
locations = "SOM_Cluster").plot(ex_marker, lw = 0.3, radius = 0.035)
Now we can see that the fairly obvious classes of cells and their marker levels. High CD19 (and mostly low owther things) are B cells; high CD3 (and mostly low other things) are T cells. But there are a few clusters that are different, and those correspond to the other cell types.
t-distributed Stochastic Neighbor Embedding#
Self-organizing maps (and other clustering algorithms like K-means and FlowPeaks) are classifiers – they take points in a high-dimensional space and sort them into bins based on a how close they are to eachother. These algorithms consider all of the dimensions – in this case, all 9 of the channels – but they are subject to the curse of dimensionality where increased numbers of dimensions make distance-based algorithms begin to fail.
Another approach is to reduce the number of dimensions, embedding the original high-dimensional data set into a lower-dimensional (usually 2) space. The trick is to do so in a way that retains the structure, keeping “close” observations in the higher-dimensional space still “close” in the lower-dimensional embedding.
t-distributed Stochastic Neighbor Embedding is an algorithm that promises to do just that. It is one of many non-linear dimensionality reduction methods – its benefit over linear dimensionality reductions such as principal components analysis (PCA) is that is more faithfully maintains local structure.
This comes with a cost, of course, and that cost is computational
complexity! On this fairly modest data set, the following cell takes
over four minutes to run. So be patient! The results are worth it, I
promise. The package we use, openTSNE, prints status updates so you
don’t get bored or think it’s crashed. Also, the tSNE algorithm can also
peform better or worse using different ways of measuring “distance” in
the original high-dimensional space. For two or three channels,
euclidean is fine, but for higher numbers of channels cosine
seems to work better. Finally, this performs much better on scaled
data, so we’re using logicle for all of the channels. Just as with
SOMOp, there are a number of parameters that can change the
performance of the algorithm. Read the tSNEop class documentation
for details.
tsne_op = flow.tSNEOp(name = "tSNE",
channels = ["CD64",
"CD3",
"CD19",
"CD161",
"CD11c",
"MHCII",
"Ly_6G",
"CD11b",
"FcERI"],
scale = {"CD64" : "logicle",
"CD3" : "logicle",
"CD19" : "logicle",
"CD161" : "logicle",
"CD11c" : "logicle",
"MHCII" : "logicle",
"Ly_6G" : "logicle",
"CD11b" : "logicle",
"FcERI" : "logicle"},
metric = "cosine")
tsne_op.estimate(ex_som_count)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, metric='cosine', n_jobs=8, perplexity=10,
verbose=True)
--------------------------------------------------------------------------------
===> Finding 30 nearest neighbors using Annoy approximate search using cosine distance...
--> Time elapsed: 0.20 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.02 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.01 seconds
===> Running optimization with exaggeration=12.00, lr=198.25 for 250 iterations...
Iteration 50, KL divergence 4.6000, 50 iterations in 45.6969 sec
Iteration 100, KL divergence 4.5668, 50 iterations in 43.1384 sec
Iteration 150, KL divergence 4.5568, 50 iterations in 41.4098 sec
Iteration 200, KL divergence 4.5513, 50 iterations in 41.3300 sec
Iteration 250, KL divergence 4.5480, 50 iterations in 43.0481 sec
--> Time elapsed: 214.62 seconds
===> Running optimization with exaggeration=1.00, lr=2379.00 for 500 iterations...
Iteration 50, KL divergence 2.2273, 50 iterations in 25.8856 sec
Iteration 100, KL divergence 2.0122, 50 iterations in 25.0921 sec
Iteration 150, KL divergence 1.9150, 50 iterations in 25.4421 sec
Iteration 200, KL divergence 1.8603, 50 iterations in 26.1375 sec
Iteration 250, KL divergence 1.8237, 50 iterations in 25.3474 sec
Iteration 300, KL divergence 1.7989, 50 iterations in 25.7548 sec
Iteration 350, KL divergence 1.7796, 50 iterations in 25.6185 sec
Iteration 400, KL divergence 1.7653, 50 iterations in 25.9954 sec
Iteration 450, KL divergence 1.7547, 50 iterations in 25.9668 sec
Iteration 500, KL divergence 1.7458, 50 iterations in 25.8475 sec
--> Time elapsed: 257.09 seconds
(If you saw an error about “No child process”, don’t worry about it – it doesn’t affect the outcome.)
Apply the operation, and note that it adds two more “channels” to the
experiment – named the same as the operation name, with _1 and
_2 appended.
ex_tsne = tsne_op.apply(ex_som_count)
ex_tsne.data.head()
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
--> Time elapsed: 10.52 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.25 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration 50, KL divergence 3948148.9583, 50 iterations in 2.9153 sec
Iteration 100, KL divergence 3915482.6764, 50 iterations in 2.7796 sec
Iteration 150, KL divergence 3895033.2699, 50 iterations in 2.8729 sec
Iteration 200, KL divergence 3880748.6666, 50 iterations in 2.8130 sec
Iteration 250, KL divergence 3869923.7153, 50 iterations in 2.9048 sec
--> Time elapsed: 14.29 seconds
| CD161 | Live_Dead | Ly_6G | AmCyan | CD11b | CD64 | FcERI | Cell_Type | FSC_A | FSC_H | ... | CD19 | CD11c | Pacific_Blue | MHCII | Time | Single_Cell | Live | SOM_Cluster | tSNE_1 | tSNE_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 867.849915 | 386.842163 | 79.696068 | 482.380005 | -79.137207 | 169.278854 | 188.839493 | B Cell | 87030.718750 | 80358.0 | ... | 14680.279297 | -281.543640 | 127.491379 | 4268.340332 | 6968.700195 | True | True | 31 | 20.072726 | -66.822968 |
| 1 | 327.388550 | -175.940399 | -89.582481 | 172.660004 | 51.409374 | 386.909424 | 334.144287 | B Cell | 96338.968750 | 87559.0 | ... | 12738.702148 | 447.330078 | -40.327629 | 1693.349609 | 5436.500000 | True | False | 4 | -3.875199 | -19.182361 |
| 2 | 493.974030 | -180.421280 | 151.270081 | 281.239990 | 26.474604 | -319.584473 | 375.361481 | B Cell | 95161.500000 | 88926.0 | ... | 11745.514648 | -80.861443 | -76.268112 | 3642.938721 | 2477.199951 | True | False | 31 | 19.317669 | -47.476411 |
| 3 | -324.122528 | -32.580925 | -214.568588 | 140.619995 | -46.702297 | 110.816521 | -97.811043 | B Cell | 85820.492188 | 79845.0 | ... | 13543.582031 | 98.451004 | 90.043648 | 2853.914062 | 1077.800049 | True | True | 11 | 37.330137 | -14.398697 |
| 4 | 58.998810 | 4059.355469 | -3.835757 | 125.489998 | 424.426666 | -149.123917 | 266.677643 | B Cell | 44717.398438 | 40482.0 | ... | 16535.718750 | -282.773895 | 15.052464 | 3099.647705 | 2372.300049 | True | False | 30 | 34.690254 | -33.893300 |
5 rows × 22 columns
Let’s plot the results, colored by “ground truth” cell type.
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huefacet = "Cell_Type").plot(ex_tsne, s = 0.1, alpha = 0.1)
Excellent! We’re seeing “natural” clusters where (most) clusters seem to represent a single cell type. NK Ts are a little close to the T cells, and the basophils are hiding again, but the others seem to be relatively clearly clustered.
…where are those sneaky basophils?
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huefacet = "Cell_Type", subset = "Cell_Type == 'Basophil'").plot(ex_tsne, s = 0.1, alpha = 1.0)
They’re hiding with their friends, the macrophages!
Again, we usually won’t have the ground truth – so it’s again good to
evaluate the clusters by plotting the relative amounts of each marker in
each cluster. The following graphs do so with the huechannel and
huescale attributes of ScatterplotView, which relate the color
of each event to the (scaled) value of a channel. And just as the source
paper uses different color palettes for each different marker, so too
here are different palettes with different vibes. The last,
viridis_r, is my favorite and thus is the default if no palette is
specified. For a comprehensive list of the options, see
https://seaborn.pydata.org/tutorial/color_palettes.html and
https://matplotlib.org/stable/users/explain/colors/colormaps.html
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD11b",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "coolwarm")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD11c",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Blues")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD161",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Greens")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD19",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Oranges")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD3",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Reds")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "CD64",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Purples")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "FcERI",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "Spectral_r")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "Ly_6G",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "plasma_r")
flow.ScatterplotView(xchannel = "tSNE_1",
ychannel = "tSNE_2",
huechannel = "MHCII",
huescale = "logicle",
subset = "Single_Cell == True & Live == True").plot(ex_tsne, s = 0.1, palette = "viridis_r")
Two final notes. First, because t-SNE is stochastic, and because it is initialized randomly, you will get a somewhat different plot each time you run it. This is expected behavior.
And second, there is a natural cluster in the tSNE plots that was
assigned to the T Cell cell type using manual gating, but is
definitely distinct from the main T cell blob on the tSNE plot. It is
BOTH CD19+ and CD3+. When you go back to look at the MST, they
show up in a couple of clusters there too. I am not an immunologist, but
Google thinks that this population should not exist. I don’t know what
it’s doing there!