Skip to content
  • Masaki Kozuki's avatar
    Pipeline Model Parallel (#1202) · 63d5dd63
    Masaki Kozuki authored
    
    
    * Init apex.ppu (pipeline model parallel utility)
    
    Reference commit:
    
    ```
    commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
    Merge: 14f2c684 7b293d9b
    Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
    Date:   Wed Sep 22 22:57:54 2021 -0700
    
        Merge branch 'add_BOS' into 'main'
    
        Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives
    
        See merge request ADLR/megatron-lm!328
    ```
    
    * removing get_args and replace import - phase 1
    
    * removing get_args and replace import - phase 2
    
    * move ppu to apex.transformer.pipeline_parallel
    
    * update two __init__.py
    
    * update READMEs
    
    * mpu -> parallel_state & tensor_parallel
    
    * fix
    
    * remove not pipeline files
    
    * separate schedules.py - phase 1
    
    * dissect schedules.py
    
    * data_iterators -> batch
    
    * remove optimizer from forward_backward_step funcs
    
    * init test
    
    * Apply 2 suggestion(s) to 2 file(s)
    
    * fix cyclic import
    
    * fix syntax of Callable
    
    * fix - 1
    
    * move directory as testing used for pp test as well
    
    * add some functions for num microbatches calculator
    
    * model is a list in pipeline parallel
    
    * skip build num microbatch calculator
    
    * fix test
    
    * assert -> raise
    
    * skip args printing
    
    * specify tensor shape everywhere even if None - phase 1
    
    * private timers
    
    * passing tensor shape & dtype around
    
    * update dtype handling by introducing helper func
    
    * write helper func to reduce cyclomatic complexity
    
    * remove duplicate
    
    * update
    
    * move split_tensor_into_1d_equal_chunks to avoid cyclic import
    
    * tmp
    
    * cosmetic
    
    * move gather_split_1d_tensor to avoid cyclic imports
    
    * remove debug print
    
    * add outer loop
    
    * early return if possible
    
    * cosmetic
    
    * passing around tensor shape
    
    * refactor test
    
    * add script to learn batch sampler behavior
    
    * update
    
    * minibatch splitter
    
    * add minibatch splitter
    
    * split minibatch into microbatches
    
    * minor changes
    
    * uncomment split batch for test sake
    
    * set as attribute
    
    * study the behavior of no pipelining
    
    * debug 1
    
    * reflect test util namespace change
    
    * update readme
    
    * cosmetic in test
    
    * add model build helper func for interleaving shced
    
    * adding model builder from megatron
    
    * canbe cyclic import
    
    * fix
    
    * enable interleaving test, but failing even if forward only
    
    * fix batch preparation
    
    * add explanation
    
    * print data parallel size
    
    * fix typo
    
    * Add Megatron style GPT model by Rishi
    
    Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
    
    * update
    
    * type hint for jit
    
    * fix forward_backward_no_pipelining test
    
    * pipeline forward backward seem to hang if not forward only
    
    * fix typo
    
    * debug
    
    * add p2p test
    
    * simplify
    
    * fix
    
    * tentative
    
    * set both tmp and pmp to 1
    
    * init
    
    * fix typo
    
    * fix
    
    * fix path of divide
    
    * set seed for tmp
    
    * update upon Eddie comment
    
    * fix typo
    
    * adding failing data loader test
    
    * fix
    
    * megatron still failing
    
    * check in
    
    * with the nested loop of new order, interleaving seems fine
    
    * cosmetic change
    
    * make `forward_backward_pipelining_with_interleaving private
    
    * warn users that interleaving sched is unstable
    
    * move noop handler to no pipelining
    
    * comment out rank_print
    
    * make `build_model` more flexible
    
    * skip megatron test tentatively
    
    * correctly comment out rank_print
    
    * correctly comment out rank_print
    
    * correctly comment out rank_print
    
    * skip appropriately
    
    * remove wip p2p comm test
    
    * update type hint of model_provider_func
    
    * disable tf32 in each test script
    
    * skip interleaving w/ backward
    
    * rename as mpu is the old name
    
    * remove broken case
    
    * expose build_model func
    
    * delete `dist.ring_exchange` func call and `use_ring_exchange` argument
    
    * nit fixes
    
    * check in
    
    * remove unused file
    
    * update the list
    
    * update tensor shape
    
    * remove mixed dtype case
    
    * use torch.distributed.run
    
    * 2020 -> 2021
    
    * another 2020 -> 2021
    
    * docstring & type hint
    
    * fix teardown
    
    * update
    
    * change to experimental
    
    * check if warned
    
    Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
    Co-authored-by: default avatarEddie Yan <eddiey@nvidia.com>
    63d5dd63