---
title: "BPCells backend for DelayedArray objects"
output: 
  rmarkdown::html_vignette:
    toc: true
  github_document: 
    toc: true
vignette: >
  %\VignetteIndexEntry{BPCellsArray}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Installation
To install from Bioconductor, use the following code:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("BPCellsArray")
```

You can install the development version of `BPCellsArray` from [GitHub](https://github.com/Yunuuuu/BPCellsArray) with:

```{r, eval=FALSE}
if (!requireNamespace("pak")) {
    install.packages("pak",
        repos = sprintf(
            "https://r-lib.github.io/p/pak/devel/%s/%s/%s",
            .Platform$pkgType, R.Version()$os, R.Version()$arch
        )
    )
}
pak::pkg_install("Yunuuuu/BPCellsArray@main")
```

## Introduction

BPCells is a package for high performance single cell analysis on RNA-seq and
ATAC-seq datasets. This package just bring BPCells into Bioconductor single-cell
workflow. 

Almost all operations in `BPCells` are lazy, which means that no real work is
performed on `BPCellsMatrix` objects until the result needs to be returned as an
R object or written to disk. And most operations have been optimized by `c++` or
`c`. Although `DelayedArray` package provides block processing for most usual
operations, `BPCellsArray` re-dispatch these methods to use the optimized
methods in BPCells. 

Here is a summarized delayed operations in `BPCellsArray`:

| Operations                               | BPCells                      | BPCellsArray                           |
| ---------------------------------------- | ---------------------------- | -------------------------------------- |
| Combine by row                           | rbind2                       | rbind2,rbind,arbind,bindROWS           |
| Combine by column                        | cbind2                       | cbind2,cbind,acbind,bindCOLS           |
| transpose matrix                         | t                            | t                                      |
| subset                                   | `[`                          | `[`                                    |
| Rename                                   | `dimnames<-`                 | `dimnames<-`,`rownames<-`,`colnames<-` |
| Multiplication                           | `%*%`                        | `%*%`                                  |
| Crossproduct                             |                              | crossprod                              |
| Matrix product transpose                 |                              | tcrossprod                             |
| Arithmetic                               | `+`,`-`,`*`,`/`              | `+`,`-`,`*`,`/`                        |
| Relational Operators                     | Binary  (`<`,`>`,`<=`, `>=`) | Binary  (`<`,`>`,`<=`, `>=`)           |
| Storage mode                             | convert_matrix_type          | convert_mode                           |
| Rank-transform                           | rank_transform               | `rank_transform`,`rowRanks`,`colRanks` |
| Mask matrix entries to zero              | mask_matrix                  | mask_matrix                            |
| Take minumum with a global constant      | min_scalar                   | pmin_scalar                            |
| Take the minimum with a per-col constant | min_by_col                   | pmin_by_col                            |
| Take the minimum with a per-row constant | min_by_row                   | pmin_by_row                            |
| Round number                             | round                        | round                                  |
| `exp(x) - 1`                             | `expm1_slow`,`expm1`         | `expm1_slow`,`expm1`                   |
| `log(1+x)`                               | `log1p`,`log1p_slow`         | `log1p_single`,`log1p`                 |
| Power                                    | `pow_slow`,`^`               | `pow_slow`,`^`                         |


Other non-lazied operations:

| Operations               | BPCells                          | BPCellsArray                            | Note             |
| ------------------------ | -------------------------------- | --------------------------------------- | ---------------- |
| row/col summarize        | matrix_stats                     | matrix_stats                            |                  |
| row summarize            | rowSums,rowMeans,rowVars,rowMaxs | rowSums,rowMeans,rowVars,rowSds,rowMaxs |                  |
| col summarize            | colSums,colMeans,colVars,colMaxs | colSums,colMeans,colVars,colSds,colMaxs |                  |
| Multiplication           | %*%                              | %*%                                     | For some methods |
| Crossproduct             |                                  | crossprod                               | For some methods |
| Matrix product transpose |                                  | tcrossprod                              | For some methods |
| svd                      | svds                             | `runSVD`+`SpectraParam`                 |                  |
| apply                    | `apply_by_row`, `apply_by_col`   | apply                                   |                  |

## Matrix Storage Format
BPCells provide following formats:

1. Directory of files
    - read: `readBPCellsDirMatrix`
    - write: `writeBPCellsDirMatrix`
2. HDF5 file 
    - read: `readBPCellsHDF5Matrix`
    - write: `writeBPCellsHDF5Matrix`
3. 10x HDF5 file 
    - read: `readBPCells10xHDF5Matrix`
    - write: `writeBPCells10xHDF5Matrix`
4. Anndata HDF5 file 
    - read: `readBPCellsAnnHDF5Matrix`
    - write: `writeBPCellsAnnHDF5Matrix`
5. in memory 
    - write: `writeBPCellsMemMatrix`

Matrices can be stored in a directory on disk, in memory, or in an HDF5 file.
Saving in a directory on disk is a good default for local analysis, as it
provides the best I/O performance and lowest memory usage. The HDF5 format
allows saving within existing hdf5 files to group data together, and the in
memory format provides the fastest performance in the event memory usage is
unimportant.

Details see: <https://bnprks.github.io/BPCells/articles/web-only/bitpacking-format.html>

## Single cell analysis
```{r setup}
library(BPCellsArray)
library(SingleCellExperiment)
```

Let's prepare some data for analysis
```{r}
set.seed(1L)
path <- tempfile("BPCells")
sce <- scuttle::mockSCE(2000L, 3000L)
format(object.size(assay(sce, "counts")), "MB")
```


What we need to do is transform the counts matrix into a `BPCellsMatrix`
object. 
```{r}
counts_mat <- assay(sce, "counts")
bitpacking_mat <- writeBPCellsDirMatrix(counts_mat, path = path)
format(object.size(bitpacking_mat), "MB")
```

The path store the data can be obtained by `path` function.
```{r}
identical(path(bitpacking_mat), path)
```

We can inspect the assay info by print it. Attention the class name. 
```{r}
bitpacking_mat
```

You can coerce it into a dense matrix or `dgCMatrix` to get the actual value.
```{r}
bitpacking_mat[1:10, 1:10]
as.matrix(bitpacking_mat[1:10, 1:10])
as(bitpacking_mat[1:10, 1:10], "dgCMatrix")
```

All `DelayedArray` methods can be used, especially, the block-processing
statistical methods. You can check
[DelayedMatrixStats](https://github.com/PeteHaitch/DelayedMatrixStats) pacakge
for more supported matrix statisticals.
```{r}
identical(rowMins(bitpacking_mat), rowMins(counts_mat, useNames = TRUE))
identical(rowMaxs(bitpacking_mat), rowMaxs(counts_mat, useNames = TRUE))
```

Again, no real work is performed on the matrix until the result needs to be
returned as an R object or written to disk. Attention the `Queued Operations`
information. 
```{r}
assay(sce, "counts") <- bitpacking_mat
sce <- scuttle::logNormCounts(sce)
assay(sce, "logcounts")
```

Both `count` and `logcounts` share the same disk path.
```{r}
identical(path(assay(sce, "counts")), path(assay(sce, "logcounts")))
```


```{r}
dec_sce <- scran::modelGeneVar(sce)
set.seed(1L)
scater::runPCA(sce,
    subset_row = scran::getTopHVGs(dec_sce, n = 2000L),
    BSPARAM = BiocSingular::IrlbaParam()
)
```

BPCells has implement a C++ Spectra solver for SVD calculation, `BPCellsArray`
has wrap it into `SpectraParam`, the same format with other `BiocSingular`
function. 
```{r}
set.seed(1L)
sce <- scater::runPCA(sce,
    subset_row = scran::getTopHVGs(dec_sce, n = 2000L),
    BSPARAM = SpectraParam()
)
colLabels(sce) <- scran::clusterCells(
    sce,
    use.dimred = "PCA",
    BLUSPARAM = bluster::SNNGraphParam(
        k = 20L, type = "jaccard",
        cluster.fun = "leiden",
        cluster.args = list(
            objective_function = "modularity",
            resolution_parameter = 1,
            n_iterations = -1L # undocumented characteristics
        ),
        BNPARAM = BiocNeighbors::AnnoyParam()
    )
)
sce <- scater::runUMAP(
    sce,
    dimred = "PCA",
    n_neighbors = 10L,
    min_dist = 0.3,
    metric = "cosine",
    external_neighbors = TRUE,
    BNPARAM = BiocNeighbors::AnnoyParam()
)
scater::plotReducedDim(
    sce, "UMAP",
    colour_by = "label",
    point_shape = 16,
    point.padding = 0,
    force = 0
)
```

## sessionInfo
```{r}
sessionInfo()
```