- 02 9月, 2021 4 次提交
-
-
由 Thor Johnsen 创作于
-
由 Thor Johnsen 创作于
-
由 Thor Johnsen 创作于
Add functions to compute grad_out1, grad_out1_halo
-
由 Thor Johnsen 创作于
-
- 01 9月, 2021 3 次提交
-
-
由 Burc Eryilmaz 创作于
* fuse norm into scale * add fused norm into dlamb Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
由 Burc Eryilmaz 创作于
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
由 Kexin Yu 创作于
wrapper function for flat view creation in _lazy_init_stage2
-
- 31 8月, 2021 3 次提交
-
-
由 Thor Johnsen 创作于
Spatially Distributed Fast Bottleneck block
-
由 Thor Johnsen 创作于
-
由 Thor Johnsen 创作于
-
- 30 8月, 2021 1 次提交
-
-
由 Thorsten Kurth 创作于
Wrote a small wrapper function for flat view creation in _lazy_init_stage2 to support channels last data formats
-
- 21 8月, 2021 1 次提交
-
-
由 X Wang 创作于
-
- 17 7月, 2021 3 次提交
-
-
由 Nan Zheng 创作于
* Added support for fused ReLU and dropout into transducer joint * Reorganized code selection path in transducer joint fwd * Added support for fused ReLU+dropout into transducer joint * Vectorize transducer loss backward with fused softmax (#3) * Nanz/transducer loss (#4) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potential IMA * Nanz/transducer loss (#5) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potentional IMA * Added more predicates to avoid IMAs * Updated documentations for newly added features. * Fixed a error in transducer.py
-
由 yjk21 创作于
-
由 X Wang 创作于
* local_rank and install cuda version fix
-
- 15 6月, 2021 2 次提交
- 26 5月, 2021 1 次提交
-
-
由 Kexin Yu 创作于
* clip before reduce scatter * provide clip before/after RS option * change to clip after ar (avoid confusion) * fix comments
-
- 17 5月, 2021 1 次提交
-
-
由 Burc Eryilmaz 创作于
Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 20 4月, 2021 1 次提交
-
-
由 Burc Eryilmaz 创作于
* don't create cublasLt handle, fix zero block size case * cleanup
-
- 17 4月, 2021 3 次提交
-
-
由 Burc Eryilmaz 创作于
* initial cublaslt support * 64 bit input * add license headers * cleanup * remove license Co-authored-by:
pbialecki <pbialecki@nvidia.com>
-
由 ptrblck 创作于
-
由 Deyu Fu 创作于
* initial commit for adding fast bottleneck * sync cudnn-frontend module Co-authored-by:
pbialecki <pbialecki@nvidia.com>
-
- 16 4月, 2021 1 次提交
-
-
由 yjk21 创作于
-
- 15 4月, 2021 3 次提交
-
-
由 Jay Rodge 创作于
Fixed a typo
-
由 Kexin Yu 创作于
* enable no_copy * barrier for SHARP * set verbose=False by default Co-authored-by:
Kexin Yu <kexiny@nvidia.com>
-
由 Sudhakar Singh 创作于
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-
- 24 3月, 2021 2 次提交
-
-
由 Kexin Yu 创作于
* sync free Distributed LAMB * init lr with provided value * wait l2 norm strem * reorder param * fix indent Co-authored-by:
Kexin Yu <kexiny@nvidia.com>
-
由 Nan Zheng 创作于
* Initial check-in of the transducer extension. * Added more comments to help explain the code * Corrected minor typos * 1. Renamed variable in tests to match the extension 2. Disabled ninja build option
-
- 23 2月, 2021 1 次提交
-
-
由 yjk21 创作于
-
- 10 2月, 2021 1 次提交
-
-
由 Shoufa Chen 创作于
* copy-paste friendly * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * keep existing for pytorch1.7 and earlier
-
- 20 1月, 2021 1 次提交
-
-
由 Burc Eryilmaz 创作于
Co-authored-by:
Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 18 12月, 2020 2 次提交
-
-
由 Thor Johnsen 创作于
Update ASP README to highlight default recipe
-
由 jpool-nv 创作于
The Recipe was presented after some non-standard API calls, so moving the suggested usage up, giving it its own section, and reinforcing the suggested usage in the non-standard section.
-
- 04 12月, 2020 3 次提交
-
-
由 Stas Bekman 创作于
-
由 Kexin Yu 创作于
* add flag for DistributedAdam: step_support_amp_scaling Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
由 Burc Eryilmaz 创作于
* fuse dropout into softmax in fprop for additive mask case
-
- 02 12月, 2020 1 次提交
-
-
由 Janusz Lisiecki 创作于
- resume() is a nested function and when it loads best_prec1 it creates a local variable that hides the one from the parent function (which refers to the global one). This PR adds `global` to modify the global variable as intended Signed-off-by:
Janusz Lisiecki <jlisiecki@nvidia.com>
-
- 01 12月, 2020 1 次提交
-
-
由 Kexin Yu 创作于
DistributedFusedAdam Model Parallelism Support (Megatron) Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
- 20 10月, 2020 1 次提交
-
-
由 lly-zero-one 创作于
In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation. For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path We also batch the all_reduce in backward path We add the contiguous call on the input of welford_parallel kernel. If there is any standard perf benchmark, I would be happy to run it.
-