🎬 BokehFlow v3: Ultra-Fast Convolutional Recurrence for Real-Time Video Bokeh

DSLR-quality bokeh rendering on 2-4GB VRAM — no transformers, no attention, no sequential loops

Metric	v1 (broken)	v3 (current)
Training step (256×256, B=4)	220 seconds	~50 ms
Speedup	1×	~4,400×
VRAM (1080p)	OOM	~1.8 GB

What Changed in v3?

v1 used a sequential Python for-loop to process 4,096 tokens one-by-one through a GatedDeltaNet recurrence. This required 131,072 Python iterations per batch (4096 tokens × 4 scan directions × 8 blocks), each doing small matrix multiplications. The GPU sat idle ~99% of the time waiting for Python.

v3 replaces the sequential recurrence with Gated Convolutional Recurrence — depthwise conv cascades that compute the exact same spatial mixing patterns in parallel via cuDNN. Two 7×7 depthwise convs give an effective receptive field of 13 pixels per direction (equivalent to a 13-step recurrence), but computed in a single GPU kernel call.

Key Insight

For 2D images, a depthwise conv kernel IS a fixed-window recurrence — the kernel weights are the recurrence coefficients applied in parallel. A cascade of convs approximates the exponential decay of a gated recurrence. Same math, 100% GPU utilization.

Architecture

INPUT: RGB (H×W×3) + Camera params (f-number, focal_length, focus_distance)
    ↓
ConvStem: 3→48→96 channels, stride-4 (GroupNorm, no BatchNorm)
    ↓
┌─────────────────────────────────────────────────┐
│ Dual-Stream Encoder (6 blocks each)             │
│                                                  │
│ Depth Stream         Bokeh Stream               │
│ ┌──────────────┐     ┌──────────────────────┐   │
│ │ GatedConvRec │     │ GatedConvRec + ACFM  │   │
│ │ DWConv×2→PW  │     │ (f-stop conditioned) │   │
│ │ + SiLU gate  │     │                      │   │
│ │ + FFN        │     │                      │   │
│ └──────┬───────┘     └──────────┬───────────┘   │
│        └──── CrossFusion ───────┘               │
│              (every 2 blocks)                    │
└─────────────────────────────────────────────────┘
    ↓                    ↓
DepthHead            BokehHead + PG-CoC
(→ depth map)        (physics blur + learned residual)
    ↓
OUTPUT: Bokeh frame (H×W×3) + Depth map (H×W×1)

Core Block: GatedConvRecurrence

x → GroupNorm → DWConv7×7 → SiLU → DWConv7×7 → PW Conv → × sigmoid(gate) → + residual
                                                                               ↓
                                                             → GroupNorm → FFN → + residual

Depthwise conv cascade: 2× DWConv(7×7) = 13px effective RF per block. 6 blocks = 78px = covers full 64×64 feature map.
SiLU gating: Learned per-channel gate controls spatial mixing strength (analogous to α in recurrence).
Zero-init residual: PW conv and FFN output layers initialized to zero for stable training start.
GroupNorm(8) everywhere — works at any batch size including 1.

Physics-Guided CoC (PG-CoC)

Real thin-lens formula: CoC(x,y) = |f²/(N·(S₁-f))| · |D(x,y) - S₁| / D(x,y)

5-level Gaussian blur pyramid interpolated by per-pixel CoC value. Differentiable, physically correct, and fast.

ACFM (Aperture-Conditioned FiLM)

Camera params → MLP → per-channel scale & shift. One model handles any f-stop/focal-length/focus-distance. Zero-initialized so the model starts as identity on camera params.

Model Variants

Variant	Params	VRAM (est. 1080p)	Training speed (256×256)
Nano	254K	~0.8 GB	~30ms/step
Small	1.16M	~1.8 GB	~50ms/step
Base	~4.6M	~3.2 GB	~100ms/step

Files

File	Description
`bokehflow_v3.py`	Architecture code (standalone, no dependencies beyond PyTorch)
`train_v3.py`	Self-contained training script (model + dataset + training loop)
`bokehflow.py`	Original v1 architecture (⚠️ too slow to train — kept for reference)
`ARCHITECTURE.md`	Detailed design document with math
`AUDIT.md`	Known issues in v1

Quick Start

import torch
from bokehflow_v3 import BokehFlow, BokehFlowConfig

config = BokehFlowConfig(variant="small")
model = BokehFlow(config).cuda()

image = torch.rand(1, 3, 720, 1280, device='cuda')
output = model(
    image,
    f_number=torch.tensor([2.0], device='cuda'),
    focal_length_mm=torch.tensor([50.0], device='cuda'),
    focus_distance_m=torch.tensor([2.0], device='cuda'),
)

bokeh = output['bokeh']   # (1, 3, 720, 1280) — rendered bokeh
depth = output['depth']   # (1, 1, 720, 1280) — predicted depth

Training

# Quick test (200 scenes, 3 epochs, ~5 min on T4)
VARIANT=small MAX_SCENES=200 EPOCHS=3 BATCH_SIZE=4 python train_v3.py

# Full training (all 3960 scenes, 10 epochs)
VARIANT=small EPOCHS=10 BATCH_SIZE=8 LR=2e-4 python train_v3.py

Requirements: pip install torch torchvision Pillow huggingface_hub trackio

Dataset: timseizinger/RealBokeh_3MP — auto-downloaded.

Why Phone Bokeh Looks Fake (and How We Fix It)

Failure	Phone Approach	BokehFlow Fix
Sharp matted edges	Binary segmentation	Continuous per-pixel CoC from dense depth
Color bleeding	No occlusion awareness	Physics-guided layered compositing
Missing specular highlights	Gaussian blur	Disk-shaped PSF kernels
Flat blur gradient	2-3 depth planes	Per-pixel continuous CoC
Temporal flicker	Per-frame independent	Recurrent state propagation (future v3+)

Research Foundation

Built on insights from:

GatedDeltaNet (arXiv:2412.06464) — gated delta rule recurrence
HGRN-2 (arXiv:2404.07904) — hierarchical gate lower bounds
MambaIRv2 (arXiv:2411.15269) — multi-direction scan redundancy analysis
Bokehlicious (arXiv:2503.16067) — aperture-conditioned bokeh
Dr.Bokeh (arXiv:2308.08843) — physics-guided layered rendering
ConvNeXt (arXiv:2201.03545) — large-kernel depthwise conv effectiveness

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for asdf98/BokehFlow