PGNet uses dynamic image sizes per batch (max_text_size=512), which spikes
peak memory above 32GB at batch 32. Settle on 16 — ~2x throughput vs the
default 14 while staying well clear of OOM.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit only updated Train.loader.batch_size_per_card; PGProcessTrain
still expected batch_size=14 which would mismatch the dataloader and silently drop samples.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- config: batch_size_per_card 14 -> 32 (5090 32GB headroom)
- setup_server.sh: pin nvidia-cudnn-cu13>=9.17 to match the sm_120 wheel
(without it conv2d hits "Cannot load symbol cublasLtCreate" abort)
- new scripts/recreate_container.sh: one-shot rebuild with --shm-size 8g,
preserves /root/.netrc so wandb auth survives, runs setup_server.sh
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Global.use_wandb: True + top-level wandb.project=kr_lp_pgnet
- Add wandb to setup_server.sh pip install list
User must run `docker exec -it kr_lp_pgnet wandb login` once before
training so the API key lands in /root/.netrc inside the container.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- run_step1.sh: symlinks /workspace/train_data into PaddleOCR, runs
tools/train.py with the step1 pretrain checkpoint, supports DRY_RUN=1
for quick smoke test and EPOCHS=N override
- epoch_num: 200 -> 50 (matches the 50k synthetic budget)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>