Skip to content

Commit

Permalink
make AcceleratRunner a subclass of Accelerator
Browse files Browse the repository at this point in the history
add TorchRunner
add DeepSpeedRunner
  • Loading branch information
ZhiyuanChen committed Oct 2, 2024
1 parent 46025c1 commit 8bc280b
Show file tree
Hide file tree
Showing 22 changed files with 1,195 additions and 805 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
Expand All @@ -24,7 +24,7 @@ jobs:
- name: Install dependencies
run: pip install -r requirements.txt && pip install -e .
- name: Install dependencies for testing
run: pip install pytest pytest-cov torch torcheval torchmetrics torchvision accelerate
run: pip install pytest pytest-cov
- name: pytest
run: pytest --cov=materialx --cov-report=xml --cov-report=html .
- name: Upload coverage report for documentation
Expand Down
9 changes: 6 additions & 3 deletions danling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

from lazy_imports import try_import

from danling import metrics, modules, optim, registry, runner, tensors, typing, utils
from danling import defaults, metrics, modules, optim, registry, runner, tensors, typing, utils

from .metrics import (
AverageMeter,
Expand All @@ -29,7 +29,7 @@
)
from .optim import LRScheduler
from .registry import GlobalRegistry, Registry
from .runner import AccelerateRunner, BaseRunner, TorchRunner
from .runner import AccelerateRunner, BaseRunner, Config, DeepSpeedRunner, TorchRunner
from .tensors import NestedTensor, PNTensor, tensor
from .utils import (
catch,
Expand All @@ -47,6 +47,7 @@
from .metrics import Metrics, MultiTaskMetrics

__all__ = [
"defaults",
"metrics",
"modules",
"optim",
Expand All @@ -55,9 +56,11 @@
"tensors",
"utils",
"typing",
"Config",
"BaseRunner",
"AccelerateRunner",
"TorchRunner",
"AccelerateRunner",
"DeepSpeedRunner",
"LRScheduler",
"Registry",
"GlobalRegistry",
Expand Down
15 changes: 8 additions & 7 deletions danling/runner/defaults.py → danling/defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the LICENSE file for more details.

DEFAULT_RUN_NAME = "Run"
DEFAULT_EXPERIMENT_NAME = "DanLing"
DEFAULT_EXPERIMENT_ID = "xxxxxxxxxxxxxxxx"
DEFAULT_IGNORED_KEYS_IN_HASH = {
RUN_NAME = "Run"
EXPERIMENT_NAME = "DanLing"
EXPERIMENT_ID = "xxxxxxxxxxxxxxxx"
SEED = 1016
IGNORED_CONFIG_IN_HASH = {
"timestamp",
"iters",
"steps",
"epochs",
"iter",
"step",
"epoch",
"results",
"score_split",
"score",
Expand Down
14 changes: 7 additions & 7 deletions danling/runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,26 @@ The Runner of DanLing sets up the basic environment for running neural networks.

## Components

For cross-platform compatibilities, DanLing features a two-level Runner + RunnerState system.
For cross-platform compatibilities, DanLing features a two-level Runner + Config system.

### PlatformRunner

PlatformRunner implements platform-specific features like `step` and `prepare`.

The Runner contains all runtime information that is irrelevant to the checkpoint (e.g. `world_size`, `rank`, etc.). All other information should be saved in `RunnerState`.
The Runner contains all runtime information that is irrelevant to the checkpoint (e.g. `world_size`, `rank`, etc.). All other information should be saved in `Config`.

Currently, only [`AccelerateRunner`][danling.runner.AccelerateRunner] is supported.

### [`BaseRunner`][danling.runner.BaseRunner]

[`BaseRunner`](danling.runner.BaseRunner) defines shared attributes and implements platform-agnostic features, including `init_logging`, `results` and `scores`.
[`BaseRunner`][danling.runner.BaseRunner] defines shared attributes and implements platform-agnostic features, including `init_logging`, `results` and `scores`.

### [`RunnerState`][danling.runner.RunnerState]
### [`Config`][danling.runner.Config]

[`RunnerState`][danling.runner.RunnerState] stores the state of a run (e.g. `epochs`, `run_id`, `network`, etc.).
[`Config`][danling.runner.Config] stores the state of a run (e.g. `epoch`, `run_id`, `network`, etc.).

With `RunnerState` and corresponding weights, you can resume a run from any point.
Therefore, all members in `RunnerState` will be saved in the checkpoint, and thus should be json serialisable.
With `Config` and corresponding weights, you can resume a run from any point.
Therefore, all members in `Config` will be saved in the checkpoint, and thus should be json serialisable.

## Experiments Management

Expand Down
9 changes: 5 additions & 4 deletions danling/runner/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,20 @@
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the LICENSE file for more details.

from . import defaults
from .accelerate_runner import AccelerateRunner
from .base_runner import BaseRunner
from .state import RunnerState
from .config import Config
from .deepspeed_runner import DeepSpeedRunner
from .torch_runner import TorchRunner
from .utils import on_local_main_process, on_main_process

__all__ = [
"RunnerState",
"Config",
"BaseRunner",
"TorchRunner",
"AccelerateRunner",
"DeepSpeedRunner",
"TorchRunner",
"on_main_process",
"on_local_main_process",
"defaults",
]
Loading

0 comments on commit 8bc280b

Please sign in to comment.