Dr. Post-Training
Dr. Post-Training
This is the official implementation of Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training.
Getting Started
The trainer of SFT and RLHF is implemented in plain PyTorch without advanced distributed training frameworks (e.g., DeepSpeed, FairScale, or Hugging Face Accelerator) to maximize clarity and ease of understanding. For large-scale training, we provide our implementation in the RLVR experiment with Verl (Ray-based distributed RL) with vLLM for fast generation.
# Clone with submodules
git clone --recursive https://github.com/TRAIS-Lab/Dr.Post-Training.git
# Or if already cloned, initialize submodules
git submodule update --init --recursive
Environment Setup
[!IMPORTANT] SFT/RLHF and RLVR require different conda environments due to incompatible dependencies (e.g.,
transformersversion conflicts). Choose the appropriate setup below.
For SFT and RLHF
conda create -n drpt python=3.10
conda activate drpt
conda install -c "nvidia/label/cuda-12.4.0" cudatoolkit
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install packaging ninja psutil
pip3 install sjlt --no-build-isolation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
pip3 install -r requirements.txt
[!Note] It is only required to install
cudatoolkitwith appropriatetorchin order to buildsjlt. Withoutsjltinstallation, you can still run the experiment with other gradient compression methods, such as the default. For instance, the following should also work as long as you don’t useGraSScompression (which requiressjlt):conda create -n drpt python=3.10 conda activate drpt pip3 install -r requirements.txt pip3 install flash-attn --no-build-isolation --no-cache-dir
For RLVR
To set up the environment for RLVR experiments, use the following commands:
conda create -n drpt_rlvr python=3.12
conda activate drpt_rlvr
# Install VERL (submodule)
cd RLVR/verl
pip install -e ".[vllm,math]"
pip install flash-attn --no-build-isolation --no-cache-dir
Note that due to the complicated dependencies of VERL (which is included as a git submodule) and vLLM, we recommend using a separate conda environment for RLVR experiments and let the VERL installation handle all the dependencies.
Cluster Setup
Both cluster_env.sh and submit.sh are gitignored — each user creates their own at the repo root. Every script sources cluster_env.sh for paths and conda activation, while submit.sh wraps sbatch with cluster-specific SLURM defaults.
1. Create cluster_env.sh
cat > cluster_env.sh << 'EOF'
# Cluster-specific configuration
# This file is gitignored — safe to edit without merge conflicts.
# Directory paths
export SCRATCH_DIR="/scratch/$USER/Project" # where data/checkpoints live
export CODE_DIR="$HOME/Project" # where this repo is cloned
# Conda environment activation
activate_env() { conda activate drpt; }
# Or if conda activate doesn't work in non-interactive shells:
# activate_env() { export PATH="$HOME/.conda/envs/drpt/bin:$PATH"; }
# SLURM defaults
export SLURM_ACCOUNT="my-account"
export SLURM_PARTITION="gpuA40x4"
export SLURM_MAIL_USER="user@example.edu"
EOF
2. Create submit.sh
cat > submit.sh << 'SCRIPT'
#!/bin/bash
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$REPO_ROOT/cluster_env.sh" || { echo "ERROR: cluster_env.sh not found."; exit 1; }
SCRIPT="$1"; shift
JOB_NAME="${JOB_NAME:-$(basename "$SCRIPT" .sh)}"
GPUS="${GPUS:-4}"
MEM="${MEM:-128g}"
CPUS="${CPUS:-16}"
TIME="${TIME:-12:00:00}"
CONSTRAINT="${CONSTRAINT:-scratch}"
LOG_DIR="${CODE_DIR}/Dr.Post-Training/log"
mkdir -p "$LOG_DIR"
sbatch --job-name="$JOB_NAME" --account="$SLURM_ACCOUNT" --partition="$SLURM_PARTITION" \
--mail-user="$SLURM_MAIL_USER" --mail-type="END" --mem="$MEM" --nodes=1 \
--ntasks-per-node=1 --cpus-per-task="$CPUS" --gpus-per-node="$GPUS" \
--gpu-bind=none --time="$TIME" --constraint="$CONSTRAINT" \
--output="$LOG_DIR/%x-%j.log" "$SCRIPT" "$@"
SCRIPT
chmod +x submit.sh
3. Submit jobs
# SFT training
./submit.sh SFT/train/train.sh --methods baseline
# RLHF training
./submit.sh RLHF/train/train.sh --methods all
# RLVR training
./submit.sh RLVR/scripts/run_qwen1.7b_math_grpo.sh
# Override SLURM defaults per-invocation
GPUS=1 TIME=1:00:00 MEM=64g ./submit.sh SFT/eval/eval.sh --task samsum
Experiments
| Experiment | Environment | Description | Documentation |
|---|---|---|---|
| SFT | drpt | Supervised Fine-Tuning with layer-wise-subset data curation | SFT/README.md |
| RLHF | drpt | Reinforcement Learning from Human Feedback with layer-wise-subset data curation | RLHF/README.md |
| RLVR | drpt_rlvr | Reinforcement Learning with Verifiable Rewards (VERL + vLLM) | RLVR/README.md |
TODOs
- Gradient Accumulation
- Adaptive Exact Scoring
