Single-Cell Dataset Preparation¶

Step-by-step guide for preparing single-cell RNA-seq data for visualization in BrainDataPortal.

Learn how to prepare and process single-cell/nuclei RNA sequencing data for visualization in BrainDataPortal. This section covers seurat object processing, gene expression data splitting, metadata table preparation and data formatting.

We will use a brain dataset as an example and cover all essential preprocessing steps.

1. Prerequisites¶

Python 3.8+ with pandas, numpy, json libraries installed.
R 4.0+ with Seurat, tidyverse, presto packages installed.
Basic understanding of single-cell RNA-seq concepts.

2. Download demo data¶

We will use a single-cell dataset from human brain. This dataset contains 10 subjects, approximately 50,000 cells from brain middle temporal gyrus region.

Demo dataset and scripts:
1. Seurat object: snRNAseq_MTG_10samples.rds
2. Sample metadata sheet: Sample_snRNAseq_MTG_10samples.csv
3. Dataset configuration file: dataset_info.toml
4. Processing script: sc_script.zip

3. Data loading and checking¶

Once you have the data, Load it and perform initial inspection to understand the dataset structure.

Full code in Notebook: 11.extract_SC_v4.R.

You need to pay attention to the input arguments: seurat_obj_file, output_dir, cluster_col

## Rscript 11.extract_SC_v4.R
... ...
# Get the arguments
seurat_obj_file <- "snRNAseq_MTG_10samples.rds"
output_dir <- "snRNAseq_MTG_10samples"
cluster_col <- "MajorCellTypes"

# Load the Seurat object
seurat_obj <- readRDS(seurat_obj_file)
capture.output(str(seurat_obj), file = paste0(output_dir, "/seurat_obj_structure.txt"))
... ...

Important Note

The above code will generate a file named seurat_obj_structure.txt in the output directory. Check this file to see the structure of the Seurat object, Make sure the necessary data is present.

Required data in Seurat object

sc RNAseq Seurat RDS/
|-- @assays                     ## List of assays
|   |-- RNA                     ## RNA data
|   |   |-- @counts             ## Raw counts
|   |   |-- @data               ## Normalized data
|   |   |-- @features           ## Feature/Gene names, v5 format
|   |   `-- @cells              ## Cell names/ID, v5 format, 
|   `-- ...
|-- @meta.data                  ## Cell level Metadata table
|   |-- cell_id                 ## Cell ID
|   |-- cell_type               ## Cell type annotation
|   |-- sample_id               ## Sample ID
|   |-- nCount_RNA              ## Number of UMI counts in the cell
|   |-- nFeature_RNA            ## Number of genes detected in the cell
|   `-- ...
|-- @reductions                 ## Dimensionality reduction results
|   |--umap                
|   |   |-- @cell.embeddings    ## UMAP coordinates for each cell
|   |   `-- ...
|   `-- ...
`-- ...

4. Data extraction¶

After check the structure of the Seurat object, we can extract the data and metadata from the object.

Full code in Notebook: 11.extract_SC_v4.R.

Important Note

Seurat v5 has a different structure compared to v4, you may need to adjust the following codes accordingly.

v4 Seurat Object

## 1. Extract the normalized counts
normalized_counts <- seurat_obj@assays$RNA@data  # This is a sparse matrix

## 2. Convert sparse matrix to triplet format (long format)
long_data <- summary(normalized_counts)

## 3. Get row (gene) and column (cell) names
long_data$Gene <- rownames(normalized_counts)[long_data$i]
long_data$Cell <- colnames(normalized_counts)[long_data$j]
long_data$Expression <- long_data$x

## 4. Keep only necessary columns
long_data <- long_data[, c("Gene", "Cell", "Expression")]

v5 Seurat Object

## 1. Extract normalized data
norm_data <- seurat_obj[["RNA"]]@layers[["data"]]   # This is a sparse matrix

## 2. Get gene and cell names from LogMaps
gene_names <- dimnames(seurat_obj[["RNA"]]@features)[[1]]
cell_names <- dimnames(seurat_obj[["RNA"]]@cells)[[1]]

## 3. Convert to triplet format (sparse matrix summary)
triplet <- summary(norm_data)

## 4. Map i and j indices to gene and cell names
triplet$Gene <- gene_names[triplet$i]
triplet$Cell <- cell_names[triplet$j]

## 5. Reorder and rename
long_data <- triplet %>% select(Cell, Gene, Expression = x)

5. Metadata processing¶

This step processes single cell metadata for visualization, including:

Metadata filtering and renaming
UMAP embedding sampling (Subset UMAP points for faster loading)
Expression data splitting and saving (Save gene expression data in json files)
Pseudo-bulk level expression calculation

Full code in Notebook: 21.rename_meta.py.

Set the following parameters according to your dataset:

## This is the output directory from the previous step 
dataset_path = "snRNAseq_MTG_10samples"  

## a list of metadata columns to keep, pick features that you want to visualize
kept_features =[ "nCount_RNA", "nFeature_RNA", "sex", "MajorCellTypes", 
                "updrs", "Complex_Assignment", "mmse", "sample_id", "case",]
sample_col = "sample_id"
cluster_col = "MajorCellTypes"
condition_col = "case"

6. Computing cluster markers¶

This step computes cluster markers for each cluster, including:

Finding cell cluster specific markers
Calculating differential expression between conditions within each cell cluster
Performing pseudo-bulk analysis

Full code in Notebook: 31.clustermarkers.R.

Set the following parameters according to your dataset:

## Dataset specific parameters
seurat_obj_file <- "snRNAseq_MTG_10samples.rds"
output_dir <- "snRNAseq_MTG_10samples"
cluster_col <- "MajorCellTypes"
condition_col <- "case"
sample_col <- "sample_id"
seurat_type <- "snrnaseq"

Important Note

You may need to adjust the following codes for your specific dataset.

if (!"data" %in% slotNames(seurat_obj@assays$RNA)) {
    stop("The Seurat object does not contain the 'data' slot in the 'RNA' assay.")
}
... ...
# Check if the Seurat object has the necessary assay
if (!"ATAC" %in% names(seurat_obj@assays)) {
    stop("The Seurat object does not contain the 'ATAC' assay.")
}
# Check if the Seurat object has the necessary assay data
if (!"counts" %in% slotNames(seurat_obj@assays$ATAC)) {
    stop("The Seurat object does not contain the 'counts' slot in the 'ATAC' assay.")
}
... ...
# Check if the Seurat object has the necessary assay
if (!"Spatial" %in% names(seurat_obj@assays)) {
    stop("The Seurat object does not contain the 'Spatial' assay.")
}
# Check if the Seurat object has the necessary assay data
if (!"data" %in% slotNames(seurat_obj@assays$Spatial)) {
    stop("The Seurat object does not contain the 'data' slot in the 'Spatial' assay.")
}

7. Post-marker processing¶

This step identifies and analyzes top marker genes for each cell type (or cluster) from single-cell data. It also calculates detection frequency and average expression for selected marker genes across conditions and sexes.

Full code in Notebook: 41_clustermarkers_postprocess.py.

Modify the following codes for your specific dataset.

# Define dataset and metadata column names
dataset_folder = "example_data/snRNA_MTG_10Samples"
cluster_col = "MajorCellTypes"
sex_col = "sex"
output_folder = dataset_folder + "/clustermarkers"
... ...

# Filter for significant genes
filtered_df = marker_genes[marker_genes['p_val_adj'] < 0.05]

# Rank by absolute log2FC and select top 10 per cluster
top_genes = (
    filtered_df
    .assign(abs_log2FC = filtered_df['avg_log2FC'])  ## chenge to abs(filtered_df['avg_log2FC']) will include negative log2FC genes
    .sort_values(['cluster', 'abs_log2FC'], ascending=[True, False])
    .groupby('cluster')
    .head(10)
    .drop(columns='abs_log2FC')  # Optional: remove helper column
)
... ...

8. Prepare sample sheet¶

This step generates a sample sheet file for the dataset. It includes information about the samples, such as condition, sex, and other relevant metadata.

Download the demo sample sheet file: Sample_snRNAseq_MTG_10samples.csv

Important Note

PLEASE KEEP ALL THE COLUMN NAMES AND ORDER AS IS, JUST FILL IN YOUR DATA.

9. Dataset configuration file¶

This step prepares the dataset information.

The dataset information file is a toml file.
It is an essential file for the dataset.

Download the demo dataset information file: dataset_info.toml

Important Note

Dataset configuration file name must be dataset_info.toml.

Here is an example of the dataset configuration file content:

[datasetfile] file = "" datatype = ""

[dataset] dataset_name = "" description = "" PI_full_name = "" PI_email = "" first_contributor = "" first_contributor_email = "" other_contributors = "" support_grants = "" other_funding_source = "" publication_DOI = "" publication_PMID = "" brain_super_region = "" brain_region = "" sample_info = "" sample_sheet = "" n_samples = 96 organism = "Homo Sapiens" tissue = "Brain" disease = "PD"

[study] study_name = "Parkinson5D" description = "" team_name = "Team Scherzer" lab_name = "NeuroGenomics" submitter_name = "" submitter_email = ""

[protocol] protocol_id = "P002" protocol_name = "P001_VisiumST" version = "" github_url = "" sample_collection_summary = "" cell_extraction_summary = "" lib_prep_summary = "" data_processing_summary = "" protocols_io_DOI = "" other_reference = ""

[meta_features] selected_features = ["nCou sample_id_column = "sample_name" major_cluster_column = "CellType" condition_column = "diagnosis"

[visium_defaults] samples = [ features = [ genes = [



## Path to the Seurat object file ## Type of the data. Options: scRNAseq, scATACseq, VisiumST, xQTL ## Required: Dataset name, MUST BE UNIQUE, used to identify the dataset in the database ## Dataset description ## Principal Investigator (PI) full name ## PI email ## First contributor name ## First contributor email ## Other contributors ## Support grants ## Other funding source ## DOI of the publication ## PMID of the publication ## Brain super region ## Brain region ## Sample information ## Sample sheet file name (Not the path, just the file name) ## Number of samples ## Organism ## Tissue ## Disease ## Study name, the dataset belongs to ## Study description ## Team name ## Lab name ## Submitter name ## Submitter email ## Protocol ID ## Protocol name ## Protocol version ## GitHub URL ## Sample collection summary ## Cell extraction summary ## Library preparation summary ## Data processing summary ## protocols.io DOI ## Other reference nt_Spatial",...] ## List of selected features will be shown in the page ## Sample ID column in Seurat object metadata ## Major cluster column in Seurat object metadata ## Condition column in Seurat object metadata         class="s2">"BN2023", "BN1076",]         ## List of sample names class="s2">"smoothed_label_s5",...]    ## List of default feature names class="s2">"SNCA",...]                    ## List of default gene names
10. Upload dataset¶
Upload the dataset to server folder.
After running the pipeline, there will be a dataset folder that contains all necessary files:

The files with names starting with raw_ are NOT necessary for upload.
The file named 'pb_expr_matrix.csv' in the folder '<dataset_name>/clustermarkers' is NOT necessary for upload.
Upload the dataset folder named '<dataset_name>' to the server at 'backend/datasets'.
Put the dataset configuration file named 'dataset_info.toml' to your dataset folder at 'backend/datasets/<dataset_name>'.
Upload samplesheet file to the server at 'backend/SampleSheet'.
Refresh the database: Go to 'Datasets Management Page '(DATASETS -> +ADD DATASET) and click 'REFRESH DB'.