- 08 Sep, 2021 1 commit
-
-
Masaki Kozuki authored
- passing include directories to `CUDAExtension`'s `include_dirs` argument - removing `-I/path/to/dir` arguments from `extra_compile_args`
-
- 04 Sep, 2021 1 commit
-
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path * safer guard around CUBLAS constants, remove unreferenced variable * more guard changes * guard against cublas version instead of cuda Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 03 Sep, 2021 2 commits
-
-
Thor Johnsen authored
Optional NCCL communicator argument to init method
-
Thor Johnsen authored
-
- 02 Sep, 2021 13 commits
-
-
Thor Johnsen authored
Bug fix in wgrad
-
Thor Johnsen authored
-
Thor Johnsen authored
Bug fixes
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
Various bug fixes in fused spatial parallel bottleneck block
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Burc Eryilmaz authored
* option to set param views to flat buffer * remove redundant variables in init_stage1 Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com> Co-authored-by:
ptrblck <ptrblck@users.noreply.github.com>
-
Burc Eryilmaz authored
-
Kexin Yu authored
* add full all-reduce code path * debug * debug Co-authored-by:
ptrblck <ptrblck@users.noreply.github.com>
-
Thor Johnsen authored
Add functions to compute grad_out1, grad_out1_halo
-
Thor Johnsen authored
-
- 01 Sep, 2021 3 commits
-
-
Burc Eryilmaz authored
* fuse norm into scale * add fused norm into dlamb Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Kexin Yu authored
wrapper function for flat view creation in _lazy_init_stage2
-
- 31 Aug, 2021 3 commits
-
-
Thor Johnsen authored
Spatially Distributed Fast Bottleneck block
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 30 Aug, 2021 1 commit
-
-
Thorsten Kurth authored
Wrote a small wrapper function for flat view creation in _lazy_init_stage2 to support channels last data formats
-
- 21 Aug, 2021 1 commit
-
-
X Wang authored
-
- 17 Jul, 2021 3 commits
-
-
Nan Zheng authored
* Added support for fused ReLU and dropout into transducer joint * Reorganized code selection path in transducer joint fwd * Added support for fused ReLU+dropout into transducer joint * Vectorize transducer loss backward with fused softmax (#3) * Nanz/transducer loss (#4) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potential IMA * Nanz/transducer loss (#5) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potentional IMA * Added more predicates to avoid IMAs * Updated documentations for newly added features. * Fixed a error in transducer.py
-
yjk21 authored
-
X Wang authored
* local_rank and install cuda version fix
-
- 15 Jun, 2021 2 commits
- 26 May, 2021 1 commit
-
-
Kexin Yu authored
* clip before reduce scatter * provide clip before/after RS option * change to clip after ar (avoid confusion) * fix comments
-
- 17 May, 2021 1 commit
-
-
Burc Eryilmaz authored
Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 20 Apr, 2021 1 commit
-
-
Burc Eryilmaz authored
* don't create cublasLt handle, fix zero block size case * cleanup
-
- 17 Apr, 2021 3 commits
-
-
Burc Eryilmaz authored
* initial cublaslt support * 64 bit input * add license headers * cleanup * remove license Co-authored-by:
pbialecki <pbialecki@nvidia.com>
-
ptrblck authored
-
Deyu Fu authored
* initial commit for adding fast bottleneck * sync cudnn-frontend module Co-authored-by:
pbialecki <pbialecki@nvidia.com>
-
- 16 Apr, 2021 1 commit
-
-
yjk21 authored
-
- 15 Apr, 2021 3 commits
-
-
Jay Rodge authored
Fixed a typo
-
Kexin Yu authored
* enable no_copy * barrier for SHARP * set verbose=False by default Co-authored-by:
Kexin Yu <kexiny@nvidia.com>
-
Sudhakar Singh authored
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-