What if my adata does not fit in memory?¶
Although not initially designed to run out-of-core rank-sum tests, illico supports some disk-backed expression matrices natively. The slowdown occurred by backing the dataset on disk is hard to estimate as it directly depends on your system’s IO. Notably:
h5-dense (np.ndarray) disk-backed dataset are natively supported
h5-CSC (sparse along the columns) disk-backed datasets are natively supported
h5-CSR (sparse along the rows) disk-backed datasets are natively supported only for OVO (perturbed vs controls) test. If you want to perform OVR (each group vs the rest) tests, you are better off loading it entirely in memory, as OVR test requires each column to be entirely in RAM at once, and CSR format does not allow to load columns from disk without loading the entire
.indicesin RAM (without telling you).
If your data is backed through Dask or another backend, please open an issue as it should require little rework for it to be supported.
Summary:
Test |
Format |
Storage |
Supported ? |
Remark |
|---|---|---|---|---|
[OVO|OVR] |
[Dense|CSC|CSR] |
In RAM |
✅ |
- |
OVO (reference=”non-targeting”) |
Dense |
Lazy (H5) |
✅ |
- |
OVO (reference=”non-targeting”) |
CSR |
Lazy (H5) |
✅ |
Specific parallelization scheme |
OVO (reference=”non-targeting”) |
CSC |
Lazy (H5) |
✅ |
- |
OVR (reference=None) |
Dense |
Lazy (H5) |
✅ |
- |
OVR (reference=None) |
CSR |
Lazy (H5) |
❌ |
Voluntarily not supported, better off loading in RAM |
OVR (reference=None) |
CSC |
Lazy (H5) |
✅ |
- |
Notes:
Supporting the CSR use case is highly non trivial, and running
adata[:, idxs]on a backed CSR matrix will load (temporarily) the entirety of the indices in RAM, resulting in a memory footprint almost equivalent to loading everything at once, on top of being extremely slow. That’s why OVR test on lazy CSR is not supported.Users struggling with out-of-core single cell RNASeq analyses should visit
rapids-singlecell, which explicitely targets this use-case.The “Specific parallelization scheme mentioned for the OVO lazy CSR use case simply relies on the fact that due to the nature of the OVR test, we can run it group by group, and thus only load one group at a time in RAM, which is not the case for OVR where we need to load all groups at once.
Note also that illico is expected to scale less well on lazy datasets, as most of the time the data loading part (such as the one of h5 datasets) is GIL-blocking.