API Reference¶
illico package¶
- illico.asymptotic_wilcoxon.asymptotic_wilcoxon(adata, is_log1p, group_keys, reference=None, n_threads=1, batch_size='auto', alternative='two-sided', use_continuity=True, tie_correct=True, layer=None, precompile=True)[source]¶
Perform asymptotic Mann-Whitney tests for differential gene expression.
Mann-Whitney test is the same as Wilcoxon rank-sum test. This function takes as input an AnnData object of shape (n_cells, n_genes) with a group (e.g., perturbation) variable stored in .obs. It performs either one-versus-rest (OVR) or one-versus-one (OVO) Wilcoxon-Mann-Whitney tests for each gene, depending on whether a reference group is provided.
It supports in-RAM dense, sparse CSC and sparse CSR matrices, as well as backed dense and sparse CSC matrices.
- Parameters:
adata (ad.AnnData) – Annotated data matrix of shape (n_cells, n_genes).
is_log1p (bool) – Whether the data is log1p transformed.
group_keys (str) – Key in adata.obs specifying the group variable.
reference (str or None, default=None) – Name of the reference group for OVO tests. If None, OVR tests are performed.
n_threads (int, default=1) – Number of threads to use for parallel computation.
batch_size (int or "auto", default="auto") – Number of genes to process in each batch. If “auto”, automatically determines optimal batch size aiming for approximately 256 genes per chunk.
alternative (str, default="two-sided") – Type of alternative hypothesis. One of ‘two-sided’, ‘less’, or ‘greater’.
use_continuity (bool, default=True) – Whether to apply continuity correction.
tie_correct (bool, default=True) – Whether to apply tie correction in the test statistic.
layer (str or None, default=None) – Layer in adata.layers to use for the data. If None, uses adata.X.
precompile (bool, default=True) – Whether to precompile necessary functions for performance. It is recommended to set this to True.
- Returns:
A DataFrame with MultiIndex (pert, feature) containing columns: - ‘p_value’: P-value from the Mann-Whitney test - ‘statistic’: Test statistic (U-statistic) - ‘fold_change’: Fold change between groups
- Return type:
pd.DataFrame
- Raises:
ValueError – If input data matrix indices are not sorted (for sparse CSR matrices). If batch_size is not ‘auto’ or an integer.
Examples
>>> import anndata as ad >>> import numpy as np >>> import pandas as pd >>> >>> # Create example AnnData object >>> n_cells, n_genes = 1000, 500 >>> X = np.random.negative_binomial(5, 0.3, size=(n_cells, n_genes)) >>> obs = pd.DataFrame({'cell_type': np.random.choice(['A', 'B', 'C'], n_cells)}) >>> var = pd.DataFrame(index=[f'gene_{i}' for i in range(n_genes)]) >>> adata = ad.AnnData(X=X, obs=obs, var=var) >>> >>> # Perform one-versus-rest tests >>> results = asymptotic_wilcoxon( ... adata, ... is_log1p=False, ... group_keys='cell_type', ... n_threads=4 ... ) >>> print(results.head()) >>> >>> # Perform one-versus-one tests against reference >>> results_ovo = asymptotic_wilcoxon( ... adata, ... is_log1p=False, ... group_keys='cell_type', ... reference='A', ... n_threads=4, ... alternative='greater' ... ) >>> >>> # Filter significant results >>> significant = results[results['p_value'] < 0.05] >>> print(f"Found {len(significant)} significant tests")
Notes
The function automatically handles both dense and sparse matrices. For sparse CSR matrices, indices must be sorted per row to ensure correct results.
Author: Rémy Dubois
ovo subpackage¶
- illico.ovo.dense_ovo.dense_ovo_mwu_kernel(sorted_ref_data, sorted_tgt_data, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Sequentially perform OVO tests on columns between sorted ref and sorted perturbed data.
- Parameters:
sorted_ref_data (np.ndarray) – Vertically sorted reference data.
sorted_tgt_data (_type_) – Vertically sorted perturbed data.
use_continuity (bool, optional) – Apply continuity factor or not . Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Returns:
two-sided p-values, U-statistics. Each of shape (n_genes,).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovo.dense_ovo.dense_ovo_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVO tests group-wise and gene(col)-wise.
Update: There is no need to fortranize the whole chunk at once, it can be done group by group within the loop. The memo
Memory footprint investigations: 1. ad.read_h5ad allocates memory in such a weird way that zeros are not properly assigned. Most likely this differs on unix systems. All investigations done on my MBP were so weird because of that. Tests were so weird because during the wilcoxon test, values were accessed and re-ordered and all those things. As a result, some memory allocation was happening within my functions although I was not allocating anything. 2. np.asfortranarray(X[:, chunk_lb:chunk_ub][grpc.indices]) seems to allocate 2x the memory needed for the chunk. So if chunk is 2GB, this line allocates 4GB temporarily. 3.
- Parameters:
X (np.ndarray) – Input dense expression matrix of shape (n_cells, n_genes)
chunk_lb (int) – Lower bound of the vertical slicing
chunk_ub (int) – Upper bound of the vertical slicing
grpc (GroupContainer) – GroupContainer, contains information about which group each row belongs to.
use_continuity (bool, optional) – Apply continuity factor or not. Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
is_log1p (bool, optional) – User-indicated flag telling if data underwent log1p transform or not. Defaults to False.
- Raises:
ValueError – If bounds are not intelligible.
- Returns:
two-sided p-values, U-statistics, fold change. Each of shape (n_groups, chunk_ub - chunk_lb).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovo.sparse_ovo.single_group_sparse_ovo_mwu_kernel(sorted_ref_data, sorted_tgt_data, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVO tests gene wise using the two sorted CSC matrix given as input.
The test performed is the equivalent of: scipy.stats.mannwhitneyu(sorted_ref_data.toarray(), sorted_tgt_data.toarray(), use_continuity=True)
- Parameters:
sorted_ref_data (CSCMatrix) – Reference data stored in CSC, sorted column-wise
sorted_tgt_data (CSCMatrix) – Perturbed data stored in CSC, sorted column-wise
use_continuity (bool, optional) – Apply continuity factor or not. Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Raises:
ValueError – If shape mismatche
- Returns:
two-sided p-values, U-statistics. Each of shape (n_genes,).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovo.sparse_ovo.multi_group_sparse_ovo_mwu_kernel(X, grpc, ref_group_id, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Sequentially perform group-wise OVO tests along the columns on a CSR matrix.
- Parameters:
X (CSRMatrix) – Input CSR matrix of shape (n_cells, n_genes)
grpc (GroupContainer) – GroupContainer
ref_group_id (int) – Encoded reference group ID.
use_continuity (bool, optional) – Whether to use continuity correction when computing p-values. Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Returns:
two-sided p-values, U-statistics. Each of shape (n_groups, n_genes).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovo.sparse_ovo.csc_ovo_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVO test over contiguous chunk of column of a CSC matrix.
- Parameters:
X (CSCMatrix) – Input matrix of shape (n_cells, n_genes)
chunk_lb (int) – Lower bound of the vertical slicing
chunk_ub (int) – Upper bound of the vertical slicing
grpc (GroupContainer) – GroupContainer
is_log1p (bool) – User-indicated flag telling if data underwent log1p transform.
use_continuity (bool) – Whether to use continuity correction when computing p-values.
tie_correct (bool) – Whether to apply tie correction when computing p-values.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Raises:
ValueError – If chunk bounds are inintelligible.
- Returns:
two-sided p-values, U-statistics, fold change. Each of shape (n_groups, chunk_ub - chunk_lb).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovo.sparse_ovo.csr_ovo_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVO test over contiguous chunk of column of a CSR matrix.
- Parameters:
X (CSRMatrix) – Input matrix of shape (n_cells, n_genes)
chunk_lb (int) – Lower bound of the vertical slicing
chunk_ub (int) – Upper bound of the vertical slicing
grpc (GroupContainer) – GroupContainer
is_log1p (bool) – User-indicated flag telling if data underwent log1p transform.
use_continuity (bool) – Whether to use continuity correction when computing p-values.
tie_correct (bool) – Whether to apply tie correction when computing p-values.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Raises:
ValueError – If chunk bounds are inintelligible.
- Returns:
two-sided p-values, U-statistics, fold change. Each of shape (n_groups, chunk_ub - chunk_lb).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
ovr subpackage¶
Runs in 3.4 seconds for 100 genes, so roughly a 3.4 x 180 = 9 minutes for the whole H1.
- illico.ovr.dense_ovr.dense_ovr_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Compute OVR ranksum test on a dense matrix of expression counts.
- Parameters:
X (np.ndarray) – Input dense raw counts matrix
grpc (GroupContainer) – GroupContainer
use_continuity (bool, optional) – Apply continuity factor or not. Defaults to True.
is_log1p (bool, optional) – User-indicated flag telling if data underwent log1p
False. (transformation or not. Defaults to)
use_continuity – Whether to use continuity correction when computing p-values. Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis. Defaults to “two-sided”.
chunk_lb (int)
chunk_ub (int)
- Returns:
Two-sided p-values, U-statistic and fold change. Each np.ndarray of shape (n_groups, n_genes)
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovr.sparse_ovr.sparse_ovr_mwu_kernel(X, groups, group_counts, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVR ranksum test group wise and column wise on a CSC matrix.
- Parameters:
X (CSCMatrix) – CSC matrix holding expression counts.
groups (np.ndarray) – np.ndarray of shape (n_cells, ) holding encoded group labels.
group_counts (np.ndarray) – Count of cells per group.
use_continuity (bool, optional) – Apply continuity factor or not. Defaults to True.
tie_correct (bool, optional) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis. Defaults to “two-sided”.
- Returns:
Two-sided p-values and U-statistics, per group and per gene (column).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovr.sparse_ovr.csc_ovr_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVR ranksum test over the contiguous column chunk defined by the bounds.
This function only applies to data stored in a CSC matrix.
- Parameters:
X (CSCMatrix) – Input CSC matrix holding expression counts
chunk_lb (int) – Lower bound of the vertial slice
chunk_ub (int) – Upper bound of the vertical slice
grpc (GroupContainer) – GroupContainer
is_log1p (bool) – User-indicated flag telling if data was log1p transformed or not.
use_continuity (bool) – Whether to use continuity correction when computing p-values. Defaults to True.
tie_correct (bool) – Whether to apply tie correction when computing p-values. Defaults to True.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis.
- Raises:
ValueError – If bounds are not intelligible
- Returns:
two-sided p-values, u-statistics and fold changes, each of shape (n_groups, chunk_lb - chunk_ub).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
- illico.ovr.sparse_ovr.csr_ovr_mwu_kernel_over_contiguous_col_chunk(X, chunk_lb, chunk_ub, grpc, is_log1p, use_continuity=True, tie_correct=True, alternative='two-sided')¶
Perform OVR ranksum test over the contiguous column chunk defined by the bounds.
This function only applies to data stored in a CSR matrix.
- Parameters:
X (CSRMatrix) – Input CSR matrix holding expression counts
chunk_lb (int) – Lower bound of the vertial slice
chunk_ub (int) – Upper bound of the vertical slice
grpc (GroupContainer) – GroupContainer
is_log1p (bool) – User-indicated flag telling if data was log1p transformed or not.
use_continuity (bool) – Whether to use continuity correction when computing p-values.
tie_correct (bool) – Whether to apply tie correction when computing p-values.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis
- Raises:
ValueError – If bounds are not intelligible
- Returns:
two-sided p-values, u-statistics and fold changes, each of shape (n_groups, chunk_lb - chunk_ub).
- Return type:
tuple[np.ndarray]
Author: Rémy Dubois
utils¶
- class illico.utils.groups.GroupContainer(encoded_groups, counts, indices, indptr, encoded_ref_group)¶
- counts¶
Alias for field number 1
- encoded_groups¶
Alias for field number 0
- encoded_ref_group¶
Alias for field number 4
- indices¶
Alias for field number 2
- indptr¶
Alias for field number 3
- illico.utils.groups.encode_and_count_groups(groups, ref_group)[source]¶
Build the GroupContainer holding all group-related information.
GroupContainer holds: - original group information - reference group (control) - encoded groups - unique raw groups - counts (of cell, per group) - indices, indptr in a RLE format - encoded reference group (control)
- Parameters:
groups (np.ndarray) – 1-d array holding group labels, one per cell
ref_group (Any) – Flag
- Returns:
Array of unique group labels GroupContainer: GroupContainer holding all group-related information.
- Return type:
unique_groups (np.ndarray)
Author: Rémy Dubois
- illico.utils.math.diff(x)¶
Equivalent of np.diff.
For some reasons, np.ndiff failed to compile properly.
- Parameters:
x (np.ndarray) – Input 1-d array
- Returns:
Results diff array, of size x.size - 1.
- Return type:
np.ndarray
Author: Rémy Dubois
- illico.utils.math.compute_pval(n_ref, n_tgt, n, tie_sum, U, mu, contin_corr=0.0, alternative='two-sided')¶
Compute p-value.
This small piece of code was isolated here because it was duplicated in the all six routines.
- Parameters:
n_ref (int) – Number of reference (control) values (cells)
n_tgt (int) – Number of perturbed (targeted) values (cells)
n (int) – Total number of values (cells)
tie_sum (float) – Tie sum
U (float) – U-statistic
mu (float) – Mean
contin_corr (float, optional) – Continuity correction factor. Defaults to 0.0.
alternative (Literal["two-sided", "less", "greater"]) – Type of alternative hypothesis.
- Returns:
P-value
- Return type:
float
Author: Rémy Dubois
- illico.utils.math.sampled_max(data, sample_size=200000)¶
- Parameters:
data (numpy.ndarray)
sample_size (int)
- Return type:
float
- illico.utils.math.fold_change_from_summed_expr(group_agg_counts, grpc)¶
Compute fold change from summed expression values, per group.
- Parameters:
group_agg_counts (np.ndarray) – Sum of expression values of shape (n_groups, n_genes)
grpc (GroupContainer) – GroupContainer holding group information
- Returns:
Fold change values of shape (n_groups, n_genes)
- Return type:
np.ndarray
Author: Rémy Dubois
- illico.utils.math.dense_fold_change(X, grpc, is_log1p)¶
Compute fold change from a dense array of expression counts.
- Parameters:
X (np.ndarray) – Expression counts
grpc (GroupContainer) – GroupContainer holding group information
is_log1p (bool) – User-indicated flag if data is log1p or not.
- Returns:
Fold change values of shape (n_groups, n_genes)
- Return type:
np.ndarray
Author: Rémy Dubois
- illico.utils.math.compute_sparsity(X)[source]¶
Compute sparsity of the data matrix.
- Parameters:
X (np.ndarray | sc_sparse.spmatrix) – Data matrix
- Returns:
Sparsity (fraction of zero elements)
- Return type:
float
Author: Rémy Dubois
- illico.utils.math.chunk_and_fortranize(X, chunk_lb, chunk_ub, indices=None)¶
Vertically chunk the input array and converts it to Fortran-contiguous.
The reason to be of the conversion is that later operations access the columns of this array so F order is advantageous. Also, this function performs one memory allocation instead of 2, which happens if calling np.asfortranarray on top of fancy-indexing.
NB: If indices is None, then all rows are taken as is.
- Parameters:
X (np.ndarray) – Input dense array
chunk_lb (int) – Lower bound of the vertical slicing
chunk_ub (int) – Upper bound of the vertical slicing
indices (np.ndarray) – Indices to reorder rows. There can be less indices than rows in X.
- Returns:
Chunked Fortran-contiguous array with reordered rows.
- Return type:
np.ndarray
Author: Rémy Dubois
- illico.utils.memory.log_memory_usage(data_handler, grpc, batch_size, n_threads)[source]¶
Log estimated memory usage of the whole routine.
This function simply computes estimated memory footprint of the DE genes routine.
- Parameters:
data_handler (DataHandler)
grpc (GroupContainer) – GroupContainer
batch_size (int) – Batch size used for processing genes
n_threads (int) – Number of threads used for processing genes
Author (Rémy Dubois)
- Return type:
None
- illico.utils.ranking.rank_sum_and_ties_from_sorted(A, B)¶
Compute rank sums and tie sums from two 1-d sorted arrays.
This routine is similar to the leetcode “merge two sorted arrays”, except it never returns to sorted array, instead it accumulate rank sums of the second array and tie sums for the combined arrays.
This routine sits at the core of the one-versus-one (or one-versus-control) asymptotic wilcoxon rank sum test as it allows to sort controls only once. :param A: The first sorted array (controls) :type A: np.ndarray :param B: The second sorted array (perturbed) :type B: np.ndarray
- Returns:
Ranks sum from the second array, and tie sums for the combined arrays.
- Return type:
tuple[np.ndarray]
- Parameters:
A (numpy.ndarray)
B (numpy.ndarray)
Author: Rémy Dubois
- illico.utils.ranking.sort_along_axis(X, axis=0)¶
Sort a dense array along a given axis.
- Parameters:
X (np.ndarray) – Input dense array.
axis (int, optional) – Axis along which to sort. Defaults to 0.
- Returns:
Sorted array.
- Return type:
np.ndarray
Author: Rémy Dubois
- illico.utils.ranking.check_if_sorted(arr)¶
Check if an array is sorted. O(n)
- Parameters:
arr (np.ndarray) – 1-d array to check
- Returns:
bool – If sorted or not.
Author (Rémy Dubois)
- Return type:
bool
- illico.utils.ranking.check_indices_sorted_per_parcel(indices, indptr)¶
Check if indices of a sparse array are sorted.
This is esssential if input data is CSR. Indeed, chunking makes use of binary search on indices, which requires sorted indices.
- Parameters:
indices (np.ndarray) – Indices
indptr (np.ndarray) – Indptr
- Returns:
True if all indices subarrays are sorted. False otherwise.
- Return type:
bool
- class illico.utils.registry.KernelDataFormat(value)[source]¶
- DENSE = 'dense'¶
- CSC = 'csc'¶
- CSR = 'csr'¶
- class illico.utils.registry.DispatcherRegistry[source]¶
- register(test, data_format)[source]¶
- Parameters:
test (Test)
data_format (KernelDataFormat)
- get(test, data_format)[source]¶
Return the value for key if key is in the dictionary, else default.
- Parameters:
test (Test)
data_format (KernelDataFormat)
- class illico.utils.registry.DataHandler(data)[source]¶
- abstractmethod input_signature(*args, **kwargs)[source]¶
Return the numba input signature for this handler.
- Return type:
tuple
- abstractmethod to_nb(*args, **kwargs)[source]¶
Convert data to numba-compatible format.
- Return type:
Any
- class illico.utils.registry.DenseDataHandler(data)[source]¶
- class illico.utils.registry.CSRDataHandler(data)[source]¶
-
- classmethod to_nb(X)[source]¶
Convert data to numba-compatible format.
- Parameters:
X (scipy.sparse.csr_matrix)
- Return type:
CSRMatrix
- class illico.utils.registry.CSCDataHandler(data)[source]¶
-
- classmethod to_nb(X)[source]¶
Convert data to numba-compatible format.
- Parameters:
X (scipy.sparse.csc_matrix)
- Return type:
CSCMatrix
- class illico.utils.registry.H5pyDatasetDataHandler(data)[source]¶