---
title: "scapGNN: Graph Neural Network-Based Framework for Single Cell Active Pathways and Gene Modules Analysis"
author: "Xudong Han and Xuejiang Guo"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Graph Neural Network-Based Framework for Single Cell Active Pathways and Gene Modules Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(scapGNN)
```
# Introduction
With the widespread availability of single-cell sequencing technology, numerous single-cell RNA-seq and ATAC-seq data are collected into public databases. However, the single-cell data have characteristics such as drop-out events and low library sizes. How to leverage these data to parse cellular heterogeneity at the pathway level and identify activated gene modules with multiple cell phenotypes is still a challenge.
Here, we propose `scapGNN`, an accurate, effective, and robust framework to comprehensively parse single-cell data at the pathway level from single-omics data or multi-omics integration data.
The `scapGNN` model takes single-cell gene expression profiles or the gene activity matrix of scATAC-seq as input data. As shown in Figure 1A, the computing architecture consists of three modules:
1. A depp neural network-based autoencoder, which is used to extract cellular features, gene features, and mine latent associations between genes and cells.
2. A graph convolutional network-based graph autoencoder, which is used to construct cell-cell association network and gene-gene association network.
3. Random walk with restart (RWR), which is used to calculate the activity score of a pathway or gene set in each cell.
For the SNARE-seq dataset, `scapGNN` can integrate gene-cell association networks from scATAC-seq and scRNA-seq into one combined network by the Brown's extension of the Fisher's combined probability test. According to the combined gene-cell association network, pathway activity scores or activated gene modules are supported by single-cell multi-omics (Figure 2). `scapGNN` also designs abundant visualization programs to clearly display the analysis results. Taken together, its capabilities enable `scapGNN` to calculate pathway activity scores in a single cell, assess pathway heterogeneity between cells, identify key active genes in pathway, mine gene modules under multiple cell phenotypes, and provide gene and cell network data (Figure 1B).
```{r echo = F, out.width = "100%"}
knitr::include_graphics("../inst/extdata/flow_diagram1.png")
```
**Figure 1** Overview of the scapGNN framework.
```{r echo = F, out.width = "100%"}
knitr::include_graphics("../inst/extdata/flow_diagram2.png")
```
**Figure 2** The workflow of integrating scRNA-seq data and scATAC-seq data by scapGNN.
# Data preprocessing
## single-cell RNA-seq data
Single-cell gene expression profiles data needs to pass `Seurat`'s pre-processing workflow. These represent the selection and filtration of cells based on QC metrics, data normalization, and the detection of highly variable features.
```
library(Seurat)
# Load the PBMC dataset (Case data for seurat)
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Our recommended data filtering is that only genes expressed as non-zero in more than
# 1% of cells, and cells expressed as non-zero in more than 1% of genes are kept.
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "case",
min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalizing the data.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize")
# Identification of highly variable features.
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
# Run Preprocessing() of scapGNN.
Prep_data <- Preprocessing(pbmc,verbose=FALSE)
```
Detailed procedures and descriptions can be found in vignettes of `Seurat` ().
```{r}
# Users can also directly input data in a data frame or matrix format which contains hypervariable genes and is log-transformed.
data("Hv_exp")
Prep_data <- Preprocessing(Hv_exp,verbose=FALSE)
summary(Prep_data)
```
## single-cell ATAC-seq data
First, the `Signac` package is used to quantify the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity matrix derived from the scATAC-seq data. For details, please refer to vignettes of `Signac` ().
As with single-cell gene expression profiles, the ensuing gene activity matrix is normalized and detected highly variable features by the `Seurat` package. Finally, the `Preprocessing()` of `scapGNN` further processes the data into the required format.
# Construct gene-cell association network
Using the `Preprocessing()` function results of scRNA-seq or scATAC-seq as input, the `ConNetGNN()` function evaluates whether gene-gene, cell-cell, and gene-cell are associated and levels of strength. Finally, an undirected and weighted gene-cell network is constructed and use for subsequent analysis.
The `ConNetGNN()` function is implemented based on `pytorch`, so an appropriate python environment is required:
- python >=3.9.7
- pytorch >=1.10.0 (CPU)
- sklearn >=0.0
- scipy >=1.7.3
- numpy >=1.19.5
We also provide environment files for conda: `/inst/extdata/scapGNN_env.yaml`. Users can install it with the command: `conda env create -f scapGNN_env.yaml`.
```
library(coop)
library(reticulate)
library(parallel)
# Data preprocessing
data("Hv_exp")
Prep_data <- Preprocessing(Hv_exp)
# Run ConNetGNN()
ConNetGNN_data <- ConNetGNN(Prep_data,python.path="../miniconda3/envs/scapGNN_env/python.exe")
```
```{r}
# View the content of the ConNetGNN() results.
data(ConNetGNN_data)
summary(ConNetGNN_data)
```
For the SNARE-seq dataset, the `ConNetGNN()` results of scRNA-seq and scATAC-seq can be integrated into a combined gene-cell association network via `InteNet()`.
```
library(ActivePathways)
library(parallel)
data(RNA_net)
data(ATAC_net)
RNA_ATAC_IntNet<-InteNet(RNA_net,ATAC_net,parallel.cores=1)
```
# Infer pathway activity score matrix at the single-cell level
For the gene-cell association network constructed by `ConNetGNN()`, the `scPathway()` function uses RWR algorithm to calculate the pathway activity score of each cell.
```
library(parallel)
library(utils)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
# We recommend the use of a compiler.
# The compiler package can be used to speed up the operation.
# library(compiler)
# scPathway<- cmpfun(scPathway)
scPathway_data<-scPathway(ConNetGNN_data,gmt.path=system.file("extdata", "KEGG_human.gmt", package = "scapGNN"),parallel.cores=1)
```
The result `scPathway_data` of `scPathway()` is a matrix of pathway activity scores, with rows as pathways and columns as cells.
```{r,fig.width = 15,fig.height = 7}
data(scPathway_data)
scPathway_data[1:3,1:3]
```
```
library(pheatmap)
pheatmap(scPathway_data)
```
```{r echo = F, out.width = "100%"}
knitr::include_graphics("../inst/extdata/heatmap.png")
```
Use the `plotGANetwork()` function to plot the gene network of pathways in the specified cell phenotype.
```{r,fig.width = 7,fig.height = 7}
library(igraph)
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
# Construct cell set.
index<-grep("0h",colnames(Hv_exp))
cellset<-colnames(Hv_exp)[index]
# Construct gene set.
pathways<-load_path_data(system.file("extdata", "KEGG_human.gmt", package = "scapGNN"))
geneset<-pathways[[which(names(pathways)=="Tight junction [PATH:hsa04530]")]]
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Tight junction [PATH:hsa04530]")
```
The `plotGANetwork()` function can construct and display a cell community network.
```{r,fig.width = 7,fig.height = 7}
require(igraph)
require(graphics)
data(ConNetGNN_data)
# Construct the cell phenotype vector.
cell_id<-colnames(ConNetGNN_data[["cell_network"]])
temp<-unlist(strsplit(cell_id,"_"))
cell_phen<-temp[seq(2,length(temp)-1,by=3)]
names(cell_id)<-cell_phen
head(cell_id)
plotCCNetwork(ConNetGNN_data,cell_id,edge.width=10)
```
# Identification of activated gene modules
The `cpGModule()` can identify cell phenotype activated gene module.
```{r}
require(parallel)
require(stats)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
data(Hv_exp)
# Construct the cell set corresponding to 0h.
index<-grep("0h",colnames(Hv_exp))
cellset<-colnames(Hv_exp)[index]
H9_0h_cpGM_data<-cpGModule(ConNetGNN_data,cellset,parallel.cores=1)
summary(H9_0h_cpGM_data)
```
Use the `plotGANetwork()` function to plot the gene network of module.
```{r,fig.width = 7,fig.height = 7}
library(igraph)
# Load data.
data(ConNetGNN_data)
data("Hv_exp")
data("H9_0h_cpGM_data")
# Construct cell set.
index<-grep("0h",colnames(Hv_exp))
cellset<-colnames(Hv_exp)[index]
# Construct gene set.
geneset<-H9_0h_cpGM_data$Genes
plotGANetwork(ConNetGNN_data,cellset,geneset,vertex.label.dist=1.5,main = "Gene network of 0h cells activated gene module")
```
For multiple cellular phenotypes (e.g. classification, time stage, etc.), `cpGModule()` identifies the activated gene modules individually. The `plotMulPhenGM()` function then quantifies the activation strength of genes in different cell phenotypes and displays gene networks under multiple cell phenotypes.
```{r,fig.width = 7,fig.height = 7}
require(igraph)
require(grDevices)
# Load the result of the ConNetGNN function.
data(ConNetGNN_data)
# Obtain cpGModule results for each cell phenotype.
data(H9_0h_cpGM_data)
data(H9_24h_cpGM_data)
data(H9_36h_cpGM_data)
data.list<-list(H9_0h=H9_0h_cpGM_data,H9_24h=H9_24h_cpGM_data,H9_36h=H9_36h_cpGM_data)
plotMulPhenGM(data.list,ConNetGNN_data,margin=-0.05)
```
```
# Set plotgraph=F to get the network data.
data_graph<- plotMulPhenGM(data.list,ConNetGNN_data,margin=-0.05,plotgraph=F)
#Obtain the adjacency matrix of the network by the method of igraph package.
Gene_adj_matrix <- as.matrix(as_adjacency_matrix( data[["graph"]], attr="weight"))
# The value of Gene_adj_matrix is the strength of the gene-gene association.
```
# Data analysis and visualization combined with Seurat package
The single-cell pathway activity matrix can be docked with the `Seurat` package to perform non-linear dimensional reduction, identify differentially active pathways, draw heat map, violin plot, and more.
```
library(dplyr)
library(Seurat)
library(patchwork)
data(scPathway_data)
data <- CreateSeuratObject(counts = scPathway_data, project = "Pathway")
all.genes <- rownames(data)
data <- ScaleData(data, features = all.genes,verbose = FALSE)
data <- RunPCA(data, features =all.genes,verbose = FALSE)
data <- FindNeighbors(data, verbose = FALSE)
data <- FindClusters(data, resolution = 0.5,verbose = FALSE)
data <- RunUMAP(data,dims = 1:15, verbose = FALSE)
cell_phen<-unlist(strsplit(colnames(scPathway_data),"_"))
cell_phen<-cell_phen[seq(2,length(cell_phen)-1,by=3)]
data@active.ident<-as.factor(cell_phen)
# Visualize non-linear dimensional reduction
DimPlot(data, reduction = "umap")
# Identify differentially active pathways
diff.pathways <- FindAllMarkers(data,verbose = FALSE)
# violin plot
VlnPlot(data, features = "MAPK signaling pathway [PATH:hsa04010]")
# heat map
DoHeatmap(data, features = row.names(scPathway_data))
```