Cygnus Example Analysis - A Tutorial • Cygnus

This tutorial covers reading a processed input file from AIVIA to generate an example report and 3D plots required for interactive data visualization. The instructions for package installation can be found here.

For this tutorial, we will use example data from cell line derived extracellular vesicle imaging data. The data can be downloaded here.

library(Cygnus)
library(viridis)
library(plotly)

Creating a Cygnus Object

To create a Cygnus object, specify metadata and markers columns from the input .csv file. You can use an interactive Shiny app or directly provide the column names.

data.path <- "../inst/extdata/all_cells.csv" 
# cyg <- CreateCygnus(data.path) # Starts the interactive app

cyg <- CreateCygnus(data.path,
                    markers_col = c("PanEV", "EpCAM", "MET",
                                    "SDC1", "EGFR", "ADAM10",
                                    "CTSH", "PDL1", "HER2"),
                    meta_col = "celltype")

cyg

## Summary of CygnusObject:
## Slot: [1] "Raw_Score"
## Numer of EVs: [1] 3264

Object Structure and Metadata

When creating a Cygnus object using the CreateCygnus() function, the meta_col parameter refers to a column in the input .csv file that contains additional information about each EV. This may include sample annotations (e.g., indicating different cell lines), morphological characteristics (e.g., circularity), or clustering details.

Cygnus objects also store metadata for each marker. This can include information such as marker relevance (i.e., whether a marker should be included for clustering), which will be demonstrated later in this tutorial. To view the overall structure of the object,

# view(cyg)

You can also add metadata from another file using the following format:

cyg@ev_meta[['type']] <- "cell_line" 

# To view all metadata, 
str(cyg@ev_meta)

## List of 2
##  $ celltype: chr [1:3264] "A549" "A549" "A549" "A549" ...
##  $ type    : chr "cell_line"

Pre-processing and Normalization

For exploratory analysis, examining the distribution of expression scores can be useful. This helps identify potential outliers and informs decisions around scaling or normalization to address batch effects or enable meaningful comparisons between markers.

Visualizing Marker Intensity Distribution

To visualize the score distributions for all markers:

plotDistribution(cyg)

Sometimes, marker distributions can vary between samples. For example, in this case, the expression distribution of EpCAM (one of the markers used) differs depending on the cell line. To group distributions based on metadata, you can use the group_by parameter:

plotDistribution(
    obj = cyg,
    plot_markers = "EpCAM",
    matrix = "Raw_Score",
    group_by = "celltype",
    group_colors = c("A549" = "#FF7966", "H2228" = "#CB8FF8", "KP" = "#7CC0F6")
)

You can see that most EVs from the A549 and H2228 cell lines have EpCAM expression values near 0, while those from the KP1.9 cell line are more centered around 1000. This could reflect one of two possibilities: (1) EVs from the KP1.9 cell line genuinely express higher levels of EpCAM, accurately capturing the underlying biology, or (2) there may be a technical bias—such as differences in staining efficiency, imaging intensity, or batch effects—that artificially inflates the measured scores in KP1.9 samples.

plotDistribution(
    obj = cyg,
    plot_markers = "PanEV",
    matrix = "Raw_Score",
    group_by = "celltype",
    group_colors = c("A549" = "#FF7966", "H2228" = "#CB8FF8", "KP" = "#7CC0F6")
)

Indeed, when examining the distribution of PanEV, we see that its expression is consistently higher across all KP1.9 EVs. In this context, PanEV refers to a set of markers used to detect the presence of EVs. This overall shift in marker expression suggests a potential global bias, which can be addressed through normalization, as demonstrated in the following section.

Normalizing by PanEV

PanEV expression often correlates with EV size. The following function normalizes marker intensities by PanEV expression to account for this effect.

cyg <- normalizeByPanEV(cyg, "PanEV")
plotDistribution(cyg, matrix = "normalized_exp_mat") +
  theme(text = element_text(size = 8))

## NULL

plotDistribution(
    obj = cyg,
    plot_markers = "EpCAM",
    matrix = "normalized_exp_mat",
    group_by = "celltype",
    group_colors = c("A549" = "#FF7966", "H2228" = "#CB8FF8", "KP" = "#7CC0F6")
    
)

Scaling Expression Matrices

Some markers may show a wider range of expression due to technical or experimental variation. To account for this and enable meaningful comparisons across markers, the expression matrix can be scaled.

cyg <- scaleExpressionMatrix(cyg)
plotDistribution(cyg, matrix = "scaled_exp_matrix")

# All expression values are between 0 and 1 now!

It is also possible to normalize by PanEV first, then scale the normalized scores to enable comparisons across markers. Cygnus provides a flexible analysis workflow designed to accommodate diverse data types and varying analytical needs.

Selection of Relevant Markers

To visualize differences in average marker expression across metadata groups, you can create a heatmap using the following function:

plotAvgHeatmap(cyg, "celltype", scale = "row")

Here, you can see that PanEV is highly expressed across all cell lines. As mentioned earlier, this is expected because PanEV represents a group of markers used to detect the presence of EVs. Therefore, if PanEV doesn’t provide meaningful variation, it might be best to exclude it from downstream analysis.

Relevant markers can be selected using the following method:

cyg <- markRelevantMarkers(cyg, c("EpCAM", "MET",
                               "SDC1", "EGFR", "ADAM10",
                               "CTSH", "PDL1", "HER2"))

Marker relevancy is stored as logical values in the markers_meta slot of the object. You can update these values at any point during the analysis to easily include or exclude markers as needed.

cyg@markers_meta

## $marker
## [1] "PanEV"  "EpCAM"  "MET"    "SDC1"   "EGFR"   "ADAM10" "CTSH"   "PDL1"  
## [9] "HER2"  
## 
## $relevant
##  PanEV  EpCAM    MET   SDC1   EGFR ADAM10   CTSH   PDL1   HER2 
##  FALSE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE

Now, we can only plot relevant markers using the AverageHeatmap functionality:

plotAvgHeatmap(cyg, "celltype", only_relevant_markers = TRUE, scale = "row")

Analyzing Marker Co-localization

Cygnus supports data analysis from two complementary perspectives: marker-centered and EV-centered. In other words, it can focus either on the behavior and interactions of markers or on the characteristics of individual EVs. For example, when focusing on markers, you can explore the likelihood of co-localization among a set of markers. Alternatively, the EV-centered workflow helps identify rare EV subtypes. These two perspectives are closely related and can often be visualized together. We will begin by exploring methods to analyze marker co-localization.

Creating a Binary Matrix

To determine whether a set of markers are present simultaneously on EVs, expression levels for each marker can be converted into binary sscores: 1 if expressed, 0 if not. This binarization is particularly useful because it allows co-localization to be analyzed using set-based approaches—treating presence as membership in a group (1) and absence as non-membership (0).

In this example, a hard threshold of 1000 is applied: if the Raw_Score is greater than 1000, the binary score is 1; if less, it’s 0. For greater flexibility, thresholds can be specified as a vector with unique cutoffs for each marker. Cygnus also offers alternative methods to determine thresholds, such as using IgG distributions to correct for non-specific binding.

cyg <- createBinaryMatrix(cyg, thresholds = 5000)
head(cyg@matrices$binary_exp_matrix)

##   PanEV EpCAM MET SDC1 EGFR ADAM10 CTSH PDL1 HER2
## 1     1     0   0    0    0      0    0    0    0
## 2     1     0   0    0    0      0    0    0    0
## 3     1     0   0    0    0      0    0    0    0
## 4     1     0   0    0    0      0    0    0    0
## 5     1     0   0    0    0      0    0    0    0
## 6     1     0   0    0    0      0    0    0    0

# Distribution can be plotted as well 
# You can see that everything is either 0 or 1!

#plotDistribution(cyg, matrix = "binary_exp_matrix") +
#  theme(text = element_text(size = 8))

Computing Deviations and p-values

Cygnus calculates the deviation—the difference between observed and expected counts—for each marker intersection. It then estimates p-values by running simulations under the null hypothesis that all markers are independent. More details can be found here.

colocal_marker <- getColocalizedMarkers(cyg)

## [1] "Calculating expected and observed probabilities..."
## [1] "Running simulation to calculate the null distribution of deviations..."

head(colocal_marker)

##   EpCAM MET SDC1 EGFR ADAM10 CTSH PDL1 HER2 Intersection Observed
## 1     0   0    0    0      0    0    0    0     00000000     2430
## 2     0   0    0    0      0    0    0    0     00000000     2430
## 3     0   0    0    0      0    0    0    0     00000000     2430
## 4     0   0    0    0      0    0    0    0     00000000     2430
## 5     0   0    0    0      0    0    0    0     00000000     2430
## 6     0   0    0    0      0    0    0    0     00000000     2430
##   Expected_Probability Expected_Count Actual_Deviation MeanDeviation
## 1            0.5569637       1817.929         612.0705     -1.899493
## 2            0.5569637       1817.929         612.0705     -1.899493
## 3            0.5569637       1817.929         612.0705     -1.899493
## 4            0.5569637       1817.929         612.0705     -1.899493
## 5            0.5569637       1817.929         612.0705     -1.899493
## 6            0.5569637       1817.929         612.0705     -1.899493
##   SDDeviation p_value p_adj p_adj_capped log10_p_adj significant
## 1    12.50572       0     0       1e-300         300           *
## 2    12.50572       0     0       1e-300         300           *
## 3    12.50572       0     0       1e-300         300           *
## 4    12.50572       0     0       1e-300         300           *
## 5    12.50572       0     0       1e-300         300           *
## 6    12.50572       0     0       1e-300         300           *

After computing deviations and their associated p-values, marker co-localizations can be visualized using an UpSet plot from the ComplexUpset package.

# plotUpSet(cyg, markers = c("EpCAM", "MET",
#                            "SDC1", "EGFR", "ADAM10",
#                            "CTSH", "PDL1", "HER2"), 
#           nsets = 10, keep.order = TRUE)


plotUpset(cyg, colocal_marker, threshold_count = 10)

Alternatively, a volcano plot can be used to more easily visualize significant intersections. Here, each intersection is labelled as a binary code, in the order of relevant markers.

plotVolcano(colocal_marker)

Dimensionality Reduction and Clustering

Dimensionality Reduction

Because multiplex imaging data is high-dimensional, dimensionality reduction is essential for effective visualization and interpretation. Cygnus supports PCA, t-SNE, and UMAP, each of which can be run and visualized using separate functions, as demonstrated here:

PCA

cyg <- runPCA(cyg, matrix_name = "normalized_exp_mat")
plotPCA(cyg, color_by = "celltype",plot_3d = T)

Note that we can run PCA with the raw, and un-normalized matrix as well:

cyg <- runPCA(cyg, matrix_name = "Raw_Score")
plotPCA(cyg, color_by = "celltype",plot_3d = T)

Here, you can observe that the KP1.9 cell line appears quite distinct from the other two cell lines, but this difference is effectively reduced after normalization.

t-SNE

set.seed(100)

cyg <- runTSNE(cyg, matrix_name = "normalized_exp_mat")
plotTSNE(cyg, color_by = "celltype", marker_size = 2, plot_3d = T)

It is also possible to color by marker expression in different matrices:

plotTSNE(cyg, color_by = "EpCAM", marker_size = 5, matrix_name = 'Raw_Score')

plotTSNE(cyg, color_by = "EpCAM", marker_size = 5, matrix_name = 'normalized_exp_mat')

plotTSNE(cyg, color_by = "EpCAM", marker_size = 5, matrix_name = 'binary_exp_matrix')

UMAP

cyg <- runUMAP(cyg, matrix_name = "normalized_exp_mat")

## 02:27:02 UMAP embedding parameters a = 0.9922 b = 1.112

## 02:27:02 Converting dataframe to numerical matrix

## 02:27:02 Read 3264 rows and found 8 numeric columns

## 02:27:02 Using FNN for neighbor search, n_neighbors = 30

## 02:27:03 Commencing smooth kNN distance calibration using 2 threads with target n_neighbors = 30
## 02:27:03 Initializing from normalized Laplacian + noise (using irlba)
## 02:27:03 Commencing optimization for 500 epochs, with 132488 positive edges
## 02:27:03 Using rng type: pcg
## 02:27:06 Optimization finished

plotUMAP(cyg, color_by = "celltype", marker_size = 5, plot_3d = T)

Clustering

Cygnus offers three different methods of unsupervised clustering: K-means, HDBSCAN, and leiden.

# HDBscan - Noise points are given a value of 0, so increment by 1.
cyg <- ClusterCygnus(CygnusObject = cyg, 
                     use.dims = "tSNE", 
                     relevant_markers = TRUE,
                     matrix_name = "scaled_exp_matrix", 
                     
                     clustering_method = "hdbscan",
                     hdb_min_cluster_size = 10)

plotTSNE(cyg, color_by = 'hdbscan_clusters', marker_size = 5)

# leiden
cyg <- ClusterCygnus(CygnusObject = cyg, 
                     use.dims = "tSNE", 
                     relevant_markers = TRUE,
                     matrix_name = "scaled_exp_matrix", 
                     
                     clustering_method = "leiden",
                     graph_distance=500,
                     leiden_resolution =0.1)

plotTSNE(cyg, color_by = 'leiden_clusters', marker_size = 5)

# No dim 
cyg <- ClusterCygnus(CygnusObject = cyg, 
                     use.dims = "None", 
                     relevant_markers = TRUE,
                     matrix_name = "scaled_exp_matrix", 
                     
                     clustering_method = "kmeans",
                     n_clusters = 3)

plotUMAP(cyg, color_by = 'k_means_clusters', marker_size = 5)

Clusters and their proportions can be visualized using the following approach:

plotBar(cyg, color_by = "celltype", split_by = "k_means_clusters")

Outputs

The processed Cygnus object can be stored in an .RDS format to facilitate data sharing and reproducible analysis.

#saveRDS(cyg, "specify_file_directory/processed_cyg.RDS")