This package utilizes a granularity-based dimension-agnostic tool, single-cell big-small patch (scBSP), implementing sparse matrix operation and Approximate Nearest Neighbor search method for distance calculation, for the identification of spatially variable genes on large-scale data. A corresponding Python library is available at https://pypi.org/project/scbsp.
This package can be installed on R CRAN.
# Install sparseMatrixStats if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("sparseMatrixStats", quietly = TRUE)) {
BiocManager::install("sparseMatrixStats")
}
# Install scBSP from CRAN
if (!requireNamespace("scBSP", quietly = TRUE)) {
install.packages("scBSP")
}
In this example, we will identify spatially variable genes (SVGs) using the HDST dataset of the mouse hippocampus. HDST (High-Definition Spatial Transcriptomics) provides subcellular resolution data. This dataset contains 181,367 spots and 19,950 genes, with a sparse density of approximately 0.0003 (indicating only 0.03% of the matrix elements are non-zero). Such high-resolution data offers unique insights into spatial gene expression patterns but also poses computational challenges due to its size and sparsity.
First, ensure that the required libraries (ggplot2
and
scBSP
) are installed and loaded. ggplot2
is
used for data visualization, and scBSP
is the primary tool
for identifying spatially variable genes.
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
library(scBSP)
load("./processed data/CN24_D1_unmodgtf_filtered_red_ut_HDST_final_clean.rds")
After loading the data, you can examine the structure of the expression count matrix (sp_count). This sparse matrix provides a compact representation of gene expression data, where most values are zero.
head(sp_count)
## 6 x 181367 sparse Matrix of class "dgCMatrix"
## [[ suppressing 34 column names '1000x100', '1000x103', '1000x113' ... ]]
##
## Rcn2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## Mycbp2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## mt-Rnr2 . . . . . . . . . . 1 . 1 2 . . 1 . 1 1 . . 2 2 2 6 4 . 6 2 . 1 1 4
## Mprip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## Mroh1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## Zfp560 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##
## Rcn2 ......
## Mycbp2 ......
## mt-Rnr2 ......
## Mprip ......
## Mroh1 ......
## Zfp560 ......
##
## .....suppressing 181333 columns in show(); maybe adjust options(max.print=, width=)
## ..............................
Here, the rows correspond to genes, and the columns correspond to spatial spots. Each cell in the matrix represents the expression count of a gene in a specific spot.
To identify SVGs, we use the scBSP function, which calculates p-values for each gene to assess its spatial variability. This step may require significant computational resources depending on the size of the data.
system.time(P_values <- scBSP(location, sp_count))
## Normalizing the expression matrix
## Normalizing the coords matrix
## Calculating p-values
## user system elapsed
## 5.66 1.04 7.02
To visualize the spatial pattern of a gene, we merge the spatial coordinates with the expression data of the most significant gene (i.e., the gene with the smallest p-value). We then create a scatter plot with ggplot2.
In the first example, we identified SVGs from HDST data of the mouse hippocampus. In this example, we will further analyze these genes by performing Gene Ontology (GO) enrichment analysis, which helps interpret the biological significance of the identified genes by grouping them into known functional categories.
Before performing GO enrichment analysis, we need to ensure that the required R packages are installed and loaded. The following packages are essential for functional enrichment analysis:
If any of these packages are not installed, the script will install them automatically using Bioconductor.
# Check and install necessary packages
if (!requireNamespace("vidger", quietly = TRUE)) {
BiocManager::install("vidger")
}
## Warning: multiple methods tables found for 'sort'
## Warning: multiple methods tables found for 'sort'
## Warning: multiple methods tables found for 'sort'
if (!requireNamespace("AnnotationDbi", quietly = TRUE)) {
BiocManager::install("AnnotationDbi")
}
## Warning: multiple methods tables found for 'sort'
if (!requireNamespace("org.Mm.eg.db", quietly = TRUE)) {
BiocManager::install("org.Mm.eg.db")
}
if (!requireNamespace("clusterProfiler", quietly = TRUE)) {
BiocManager::install("clusterProfiler")
}
# Load the required libraries
library(vidger)
library(AnnotationDbi)
library(org.Mm.eg.db)
library(clusterProfiler)
Once we have identified the SVGs, we will determine which ontology they are significantly involved in. We will conduct GO enrichment analysis using the clusterProfiler package. This step will also help refine the identified candidate biomarkers.
The key steps in this analysis include:
enrichGO()
function, we map the significant genes to Biological Processes (BP)
using the org.Mm.eg.db
annotation database.Visualizing the results: A dot plot is generated to show the top enriched biological processes.
scBSP_Sig_Genes <- P_values$GeneNames[which(P_values$P_values<0.05)]
scBSP_Sig_GO <- enrichGO(scBSP_Sig_Genes,
OrgDb = "org.Mm.eg.db",
ont="BP",
keyType="SYMBOL",
pvalueCutoff = 0.0001,
readable=TRUE)
# generate figure
dotplot(scBSP_Sig_GO, showCategory = 10)
This downstream analysis helps uncover the functional roles of spatially variable genes in the mouse hippocampus, providing insights into region-specific gene activities.
This example demonstrates the analysis of high-density SlideSeq V2
data from Seurat object using the scBSP
package. We will
install and load the necessary data and libraries, preprocess the data
and detect the spatially variable genes using scBSP.
To begin, we ensure that the required packages are installed. The
remotes
package will allow us to install
SeuratData
, which contains the dataset we need.
# Install required package if not already installed
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
if (!requireNamespace("Seurat", quietly = TRUE)) {
install.packages("Seurat")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
# Install SeuratData from GitHub
remotes::install_github("satijalab/seurat-data")
Next, we load the libraries that will be used throughout this analysis: Seurat, SeuratData, and SeuratObject.
# Load required libraries
library(scBSP)
library(Seurat)
library(SeuratData)
library(SeuratObject)
library(ggplot2)
The Slide-seq V2 dataset of the mouse hippocampus can now be installed. This dataset contains spatially resolved transcriptomics data with relatively high density.
# Install the Slide-seq v2 dataset
InstallData("ssHippo")
Once the dataset is installed, we load it into the environment and preprocess it for further analysis. We extract the filtered expression matrix using the scBSP package.
# Load the Slide-seq v2 data
slide.seq <- LoadData("ssHippo")
# Extract spatial data and expression matrix
data_extracted <- scBSP::LoadSpatial(slide.seq)
# Apply a filtering step on the expression matrix
ExpMatrix_Filtered <- scBSP::SpFilter(data_extracted$ExpMatrix, Threshold = 1)
To understand the structure of the data, we check the dimensions of the SlideSeq coordinates and the filtered gene expression matrix.
# Dimensions of the SlideSeq coordinates
dim(data_extracted$Coords)
## [1] 53173 2
# Dimensions of the filtered gene expression matrix
dim(ExpMatrix_Filtered)
## [1] 23243 53173
# Record computational time during scBSP analysis
system.time(P_values <- scBSP::scBSP(data_extracted$Coords, ExpMatrix_Filtered))
## Normalizing the expression matrix
## Normalizing the coords matrix
## Calculating p-values
## user system elapsed
## 24.53 2.64 27.59
To visualize the spatial pattern of a gene, we merge the spatial coordinates with the expression data of the most significant gene (i.e., the gene with the smallest p-value). We then create a scatter plot with ggplot2.
Wang, J., Li, J., Kramer, S.T. et al. Dimension-agnostic and granularity-based spatially variable gene identification using BSP. Nat Commun 14, 7367 (2023). https://doi.org/10.1038/s41467-023-43256-5
Li, J., Wang, Y., Raina, M.A. et al. scBSP: A fast and accurate tool for identifying spatially variable genes from spatial transcriptomic data. bioRxiv 2024.05.06.592851; doi: https://doi.org/10.1101/2024.05.06.592851
HDST data source
https://xzhoulab.github.io/SPARK/02_SPARK_Example/
SlideSeq V2 data from Seurat
Stickels, R.R., Murray, E., Kumar, P., Li, J., Marshall, J.L., Di Bella, D., Arlotta, P., Macosko, E.Z. and Chen, F., 2020. Sensitive spatial genome wide expression profiling at cellular resolution. BioRxiv, pp.2020-03. doi: https://doi.org/10.1101/2020.03.12.989806