🎬 BokehFlow v3: Ultra-Fast Convolutional Recurrence for Real-Time Video Bokeh

DSLR-quality bokeh rendering on 2-4GB VRAM β€” no transformers, no attention, no sequential loops

Metric v1 (broken) v3 (current)
Training step (256Γ—256, B=4) 220 seconds ~50 ms
Speedup 1Γ— ~4,400Γ—
VRAM (1080p) OOM ~1.8 GB

What Changed in v3?

v1 used a sequential Python for-loop to process 4,096 tokens one-by-one through a GatedDeltaNet recurrence. This required 131,072 Python iterations per batch (4096 tokens Γ— 4 scan directions Γ— 8 blocks), each doing small matrix multiplications. The GPU sat idle ~99% of the time waiting for Python.

v3 replaces the sequential recurrence with Gated Convolutional Recurrence β€” depthwise conv cascades that compute the exact same spatial mixing patterns in parallel via cuDNN. Two 7Γ—7 depthwise convs give an effective receptive field of 13 pixels per direction (equivalent to a 13-step recurrence), but computed in a single GPU kernel call.

Key Insight

For 2D images, a depthwise conv kernel IS a fixed-window recurrence β€” the kernel weights are the recurrence coefficients applied in parallel. A cascade of convs approximates the exponential decay of a gated recurrence. Same math, 100% GPU utilization.


Architecture

INPUT: RGB (HΓ—WΓ—3) + Camera params (f-number, focal_length, focus_distance)
    ↓
ConvStem: 3β†’48β†’96 channels, stride-4 (GroupNorm, no BatchNorm)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Dual-Stream Encoder (6 blocks each)             β”‚
β”‚                                                  β”‚
β”‚ Depth Stream         Bokeh Stream               β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚ β”‚ GatedConvRec β”‚     β”‚ GatedConvRec + ACFM  β”‚   β”‚
β”‚ β”‚ DWConvΓ—2β†’PW  β”‚     β”‚ (f-stop conditioned) β”‚   β”‚
β”‚ β”‚ + SiLU gate  β”‚     β”‚                      β”‚   β”‚
β”‚ β”‚ + FFN        β”‚     β”‚                      β”‚   β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚        └──── CrossFusion β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚              (every 2 blocks)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓                    ↓
DepthHead            BokehHead + PG-CoC
(β†’ depth map)        (physics blur + learned residual)
    ↓
OUTPUT: Bokeh frame (HΓ—WΓ—3) + Depth map (HΓ—WΓ—1)

Core Block: GatedConvRecurrence

x β†’ GroupNorm β†’ DWConv7Γ—7 β†’ SiLU β†’ DWConv7Γ—7 β†’ PW Conv β†’ Γ— sigmoid(gate) β†’ + residual
                                                                               ↓
                                                             β†’ GroupNorm β†’ FFN β†’ + residual
  • Depthwise conv cascade: 2Γ— DWConv(7Γ—7) = 13px effective RF per block. 6 blocks = 78px = covers full 64Γ—64 feature map.
  • SiLU gating: Learned per-channel gate controls spatial mixing strength (analogous to Ξ± in recurrence).
  • Zero-init residual: PW conv and FFN output layers initialized to zero for stable training start.
  • GroupNorm(8) everywhere β€” works at any batch size including 1.

Physics-Guided CoC (PG-CoC)

Real thin-lens formula: CoC(x,y) = |fΒ²/(NΒ·(S₁-f))| Β· |D(x,y) - S₁| / D(x,y)

5-level Gaussian blur pyramid interpolated by per-pixel CoC value. Differentiable, physically correct, and fast.

ACFM (Aperture-Conditioned FiLM)

Camera params β†’ MLP β†’ per-channel scale & shift. One model handles any f-stop/focal-length/focus-distance. Zero-initialized so the model starts as identity on camera params.


Model Variants

Variant Params VRAM (est. 1080p) Training speed (256Γ—256)
Nano 254K ~0.8 GB ~30ms/step
Small 1.16M ~1.8 GB ~50ms/step
Base ~4.6M ~3.2 GB ~100ms/step

Files

File Description
bokehflow_v3.py Architecture code (standalone, no dependencies beyond PyTorch)
train_v3.py Self-contained training script (model + dataset + training loop)
bokehflow.py Original v1 architecture (⚠️ too slow to train β€” kept for reference)
ARCHITECTURE.md Detailed design document with math
AUDIT.md Known issues in v1

Quick Start

import torch
from bokehflow_v3 import BokehFlow, BokehFlowConfig

config = BokehFlowConfig(variant="small")
model = BokehFlow(config).cuda()

image = torch.rand(1, 3, 720, 1280, device='cuda')
output = model(
    image,
    f_number=torch.tensor([2.0], device='cuda'),
    focal_length_mm=torch.tensor([50.0], device='cuda'),
    focus_distance_m=torch.tensor([2.0], device='cuda'),
)

bokeh = output['bokeh']   # (1, 3, 720, 1280) β€” rendered bokeh
depth = output['depth']   # (1, 1, 720, 1280) β€” predicted depth

Training

# Quick test (200 scenes, 3 epochs, ~5 min on T4)
VARIANT=small MAX_SCENES=200 EPOCHS=3 BATCH_SIZE=4 python train_v3.py

# Full training (all 3960 scenes, 10 epochs)
VARIANT=small EPOCHS=10 BATCH_SIZE=8 LR=2e-4 python train_v3.py

Requirements: pip install torch torchvision Pillow huggingface_hub trackio

Dataset: timseizinger/RealBokeh_3MP β€” auto-downloaded.


Why Phone Bokeh Looks Fake (and How We Fix It)

Failure Phone Approach BokehFlow Fix
Sharp matted edges Binary segmentation Continuous per-pixel CoC from dense depth
Color bleeding No occlusion awareness Physics-guided layered compositing
Missing specular highlights Gaussian blur Disk-shaped PSF kernels
Flat blur gradient 2-3 depth planes Per-pixel continuous CoC
Temporal flicker Per-frame independent Recurrent state propagation (future v3+)

Research Foundation

Built on insights from:

  • GatedDeltaNet (arXiv:2412.06464) β€” gated delta rule recurrence
  • HGRN-2 (arXiv:2404.07904) β€” hierarchical gate lower bounds
  • MambaIRv2 (arXiv:2411.15269) β€” multi-direction scan redundancy analysis
  • Bokehlicious (arXiv:2503.16067) β€” aperture-conditioned bokeh
  • Dr.Bokeh (arXiv:2308.08843) β€” physics-guided layered rendering
  • ConvNeXt (arXiv:2201.03545) β€” large-kernel depthwise conv effectiveness

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for asdf98/BokehFlow