Changelog#
Changelog#
[0.0.6] - 2026-05-31#
Added#
Batch-wise fitting API (
causarray/gcate.py,causarray/DR_learner.py,causarray/utils.py)fit_gcate_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, ...): Fits GCATE independently on batches of perturbations. A shared control subsample ofn_ctrlcells is reused across batches; dispersion is pre-estimated once on the control pool. Supportsskip_batchesto resume interrupted runs; reports per-batch wall time and ETA whenverbose=True.gcate_lfc_batch(Y, X, A, r, batch_size=10, max_cells=2000, n_ctrl=2000, cache_path=None, ...): End-to-end batch pipeline — runs GCATE and LFC per batch, freeing large intermediate arrays after each batch.cache_pathenables HDF5 disk caching viapandas.HDFStoreso interrupted runs resume from the last completed batch. Returns a concatenated DataFrame with a'batch'column.LFC_batch(...): deprecated alias forgcate_lfc_batch; emitsDeprecationWarningand will be removed in a future release.n_batchesparameter forfit_gcate_batchandgcate_lfc_batch: specifies total number of batches instead of per-batch count; overridesbatch_sizewhen set.estimate_r(max_cells=N, random_state=0): new parameter that automatically subsamples to at mostNcells before running JIC selection, prioritising control cells.
Fast GLM backend via crispyx (
nb_glm_fast.py,gcate_glm.py)fit_glm_fast(): Batch NB-GLM fitting using crispyx’sNBGLMBatchFitter, replacing per-gene statsmodels IRLS with vectorized batch IRLS.estimate_disp_fast(): Vectorized method-of-moments dispersion estimation.fit_glm_ondisk(): On-disk streaming GLM fitting for large h5ad files.Per-perturbation fitting (
_fit_glm_fast_per_perturbation): for multi-treatment data, fits binary (ctrl vs. treatment_k) models independently, then assembles the full coefficient matrix.fit_glm_auto(): Routes tofit_glm_fast()when crispyx is available and the effective design dimension is small; falls back to statsmodels otherwise.estimate_disp_auto(): Routes toestimate_disp_fast()for large gene counts; falls back to statsmodels otherwise.
Fixed#
Numba TBB fork warning: Set
NUMBA_THREADING_LAYER_PRIORITYto prefer OpenMP over TBB in__init__.py, eliminating fork warnings when Joblib forks after Numba parallel execution. Addedllvm-openmpto conda dependencies.Fast-path threshold (
gcate_glm.py): Raised the effective design-dimension ceiling so largerrvalues and wide batch designs correctly use the crispyx path.Backend toggle (
gcate_glm.py): Re-added_USE_FAST_BACKENDmodule flag and_backend_override()context manager for reliable statsmodels fallback.Weighted dispersion (
nb_glm_fast.py): Dispersion averaging is now cell-count-weighted; low-coverage perturbations contribute proportionally less.Control-cell residuals (
nb_glm_fast.py): Fixed last-perturbation overwrite bug; control-cell deviance residuals and fitted values are now initialised from the global covariate model.Module-qualified imports (
gcate_opt.py,gcate.py,DR_estimation.py): Backend toggles now propagate correctly at call time.estimate_rbare name (gcate.py): FixedNameErrorcaused by a barefit_glm_autoreference.crispyx availability check (
gcate_glm.py): Users without crispyx now get a transparent fallback to statsmodels instead of a traceback.
Changed#
⚠️
alter_min()early-stopping defaults (gcate_opt.py):Default
kwargs_es['max_iters']reduced from 500 → 50.Default
tolerancereduced from1e-3→0.0; new scale-invariantrel_tol=2e-4introduced. To reproduce pre-v0.0.6 behavior, passkwargs_es_1=dict(max_iters=500)andkwargs_es_2=dict(max_iters=500).
⚠️ BREAKING —
LFC()variance and defaultusevar(DR_learner.py):Default
usevarchanged from'pooled'to'unequal'(Welch). Revert withLFC(..., usevar='pooled')if reproducing pre-v0.0.6 results.'unequal'formula corrected: variance is nows₀²/n₀ + s₁²/n₁(standard Welch); the prior version used(s₀²/n₀ + s₁²/n₁)/2(“half-Welch”), under-estimating the standard error by √2.p-values now use the t-distribution with Welch-Satterthwaite degrees of freedom per gene; the prior version used a Normal approximation.
alter_min()initialisation,_check_input(),estimate_r(), andcross_fitting()now use the auto-dispatch GLM/dispersion paths.LFC()now acceptsbackend: str = "auto"("fast"forces crispyx,"original"forces statsmodels).comp_size_factor()vectorized withnp.nanmean/np.nanmedian.
Performance#
Benchmarked on Perturb-seq data (n = 2,926 cells, p = 3,221 genes, 29 perturbations):
Component |
Original |
Fast |
Speedup |
|---|---|---|---|
GCATE |
331.6 s |
298.5 s |
1.1× |
LFC |
87.8 s |
65.7 s |
1.3× |
Total |
419.3 s |
364.2 s |
1.2× |
On synthetic data (n = 500, p = 200): 61.5× GLM fit speedup, 7.1× imputation speedup. Latent factor recovery: mean canonical correlation 0.998. LFC correlation: 0.856.
Additional LFC throughput improvements on the Replogle tutorial dataset (79,865 cells × 8,563 genes, 200 perturbations, 14 batches):
Change |
Speedup contribution |
Accuracy impact |
|---|---|---|
Stage 1 |
−10 min |
identical (r=1.000) |
Stage 1 ≤3,000-cell mixed subsample |
−55 min |
tau r=0.992, Jaccard=0.80 |
Stage 2 joint fit |
−5 min |
tau r=0.9994, Jaccard=0.975 |
Combined |
−70.6 min / 1.48× |
tau r=0.9994, Jaccard=0.975 |
Full-run: 217.5 min → 146.9 min (1.48× faster); sig pairs −0.2%, perts with ≥1 hit −0.6%.
[0.0.5] - 2025-01-30#
GCATE model for gene-level causal effect estimation from CRISPR screens.
Doubly-robust learner (LFC, VIM) with AIPW pseudo-outcomes.
Alternating minimization with Numba+OpenMP acceleration.
Statsmodels-based per-gene NB-GLM fitting.
Multiple testing correction (BH, step-down, FDX).