--- title: "BPCells backend for DelayedArray objects" output: rmarkdown::html_vignette: toc: true github_document: toc: true vignette: > %\VignetteIndexEntry{BPCellsArray} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Installation To install from Bioconductor, use the following code: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("BPCellsArray") ``` You can install the development version of `BPCellsArray` from [GitHub](https://github.com/Yunuuuu/BPCellsArray) with: ```{r, eval=FALSE} if (!requireNamespace("pak")) { install.packages("pak", repos = sprintf( "https://r-lib.github.io/p/pak/devel/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch ) ) } pak::pkg_install("Yunuuuu/BPCellsArray@main") ``` ## Introduction BPCells is a package for high performance single cell analysis on RNA-seq and ATAC-seq datasets. This package just bring BPCells into Bioconductor single-cell workflow. Almost all operations in `BPCells` are lazy, which means that no real work is performed on `BPCellsMatrix` objects until the result needs to be returned as an R object or written to disk. And most operations have been optimized by `c++` or `c`. Although `DelayedArray` package provides block processing for most usual operations, `BPCellsArray` re-dispatch these methods to use the optimized methods in BPCells. Here is a summarized delayed operations in `BPCellsArray`: | Operations | BPCells | BPCellsArray | | ---------------------------------------- | ---------------------------- | -------------------------------------- | | Combine by row | rbind2 | rbind2,rbind,arbind,bindROWS | | Combine by column | cbind2 | cbind2,cbind,acbind,bindCOLS | | transpose matrix | t | t | | subset | `[` | `[` | | Rename | `dimnames<-` | `dimnames<-`,`rownames<-`,`colnames<-` | | Multiplication | `%*%` | `%*%` | | Crossproduct | | crossprod | | Matrix product transpose | | tcrossprod | | Arithmetic | `+`,`-`,`*`,`/` | `+`,`-`,`*`,`/` | | Relational Operators | Binary (`<`,`>`,`<=`, `>=`) | Binary (`<`,`>`,`<=`, `>=`) | | Storage mode | convert_matrix_type | convert_mode | | Rank-transform | rank_transform | `rank_transform`,`rowRanks`,`colRanks` | | Mask matrix entries to zero | mask_matrix | mask_matrix | | Take minumum with a global constant | min_scalar | pmin_scalar | | Take the minimum with a per-col constant | min_by_col | pmin_by_col | | Take the minimum with a per-row constant | min_by_row | pmin_by_row | | Round number | round | round | | `exp(x) - 1` | `expm1_slow`,`expm1` | `expm1_slow`,`expm1` | | `log(1+x)` | `log1p`,`log1p_slow` | `log1p_single`,`log1p` | | Power | `pow_slow`,`^` | `pow_slow`,`^` | Other non-lazied operations: | Operations | BPCells | BPCellsArray | Note | | ------------------------ | -------------------------------- | --------------------------------------- | ---------------- | | row/col summarize | matrix_stats | matrix_stats | | | row summarize | rowSums,rowMeans,rowVars,rowMaxs | rowSums,rowMeans,rowVars,rowSds,rowMaxs | | | col summarize | colSums,colMeans,colVars,colMaxs | colSums,colMeans,colVars,colSds,colMaxs | | | Multiplication | %*% | %*% | For some methods | | Crossproduct | | crossprod | For some methods | | Matrix product transpose | | tcrossprod | For some methods | | svd | svds | `runSVD`+`SpectraParam` | | | apply | `apply_by_row`, `apply_by_col` | apply | | ## Matrix Storage Format BPCells provide following formats: 1. Directory of files - read: `readBPCellsDirMatrix` - write: `writeBPCellsDirMatrix` 2. HDF5 file - read: `readBPCellsHDF5Matrix` - write: `writeBPCellsHDF5Matrix` 3. 10x HDF5 file - read: `readBPCells10xHDF5Matrix` - write: `writeBPCells10xHDF5Matrix` 4. Anndata HDF5 file - read: `readBPCellsAnnHDF5Matrix` - write: `writeBPCellsAnnHDF5Matrix` 5. in memory - write: `writeBPCellsMemMatrix` Matrices can be stored in a directory on disk, in memory, or in an HDF5 file. Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant. Details see: ## Single cell analysis ```{r setup} library(BPCellsArray) library(SingleCellExperiment) ``` Let's prepare some data for analysis ```{r} set.seed(1L) path <- tempfile("BPCells") sce <- scuttle::mockSCE(2000L, 3000L) format(object.size(assay(sce, "counts")), "MB") ``` What we need to do is transform the counts matrix into a `BPCellsMatrix` object. ```{r} counts_mat <- assay(sce, "counts") bitpacking_mat <- writeBPCellsDirMatrix(counts_mat, path = path) format(object.size(bitpacking_mat), "MB") ``` The path store the data can be obtained by `path` function. ```{r} identical(path(bitpacking_mat), path) ``` We can inspect the assay info by print it. Attention the class name. ```{r} bitpacking_mat ``` You can coerce it into a dense matrix or `dgCMatrix` to get the actual value. ```{r} bitpacking_mat[1:10, 1:10] as.matrix(bitpacking_mat[1:10, 1:10]) as(bitpacking_mat[1:10, 1:10], "dgCMatrix") ``` All `DelayedArray` methods can be used, especially, the block-processing statistical methods. You can check [DelayedMatrixStats](https://github.com/PeteHaitch/DelayedMatrixStats) pacakge for more supported matrix statisticals. ```{r} identical(rowMins(bitpacking_mat), rowMins(counts_mat, useNames = TRUE)) identical(rowMaxs(bitpacking_mat), rowMaxs(counts_mat, useNames = TRUE)) ``` Again, no real work is performed on the matrix until the result needs to be returned as an R object or written to disk. Attention the `Queued Operations` information. ```{r} assay(sce, "counts") <- bitpacking_mat sce <- scuttle::logNormCounts(sce) assay(sce, "logcounts") ``` Both `count` and `logcounts` share the same disk path. ```{r} identical(path(assay(sce, "counts")), path(assay(sce, "logcounts"))) ``` ```{r} dec_sce <- scran::modelGeneVar(sce) set.seed(1L) scater::runPCA(sce, subset_row = scran::getTopHVGs(dec_sce, n = 2000L), BSPARAM = BiocSingular::IrlbaParam() ) ``` BPCells has implement a C++ Spectra solver for SVD calculation, `BPCellsArray` has wrap it into `SpectraParam`, the same format with other `BiocSingular` function. ```{r} set.seed(1L) sce <- scater::runPCA(sce, subset_row = scran::getTopHVGs(dec_sce, n = 2000L), BSPARAM = SpectraParam() ) colLabels(sce) <- scran::clusterCells( sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam( k = 20L, type = "jaccard", cluster.fun = "leiden", cluster.args = list( objective_function = "modularity", resolution_parameter = 1, n_iterations = -1L # undocumented characteristics ), BNPARAM = BiocNeighbors::AnnoyParam() ) ) sce <- scater::runUMAP( sce, dimred = "PCA", n_neighbors = 10L, min_dist = 0.3, metric = "cosine", external_neighbors = TRUE, BNPARAM = BiocNeighbors::AnnoyParam() ) scater::plotReducedDim( sce, "UMAP", colour_by = "label", point_shape = 16, point.padding = 0, force = 0 ) ``` ## sessionInfo ```{r} sessionInfo() ```