Multi NODE Slurm Skill
Convert single-node scripts to multi-node Slurm sbatch jobs 與 debug common multi-node failures。
這裡收錄 repo 裡找得到的完整 SKILL.md、awesome-agent-skills 上游索引,以及 skill repo 的技能商店與本地落地版。頁面上的標題、用途、說明與 Skill 內容都會整理成台灣慣用正體中文;來源連結會保留下來。
沒有符合條件的 Skill。
Convert single-node scripts to multi-node Slurm sbatch jobs 與 debug common multi-node failures。
External NeMo-RL 端到端 validation 工作流 用於 Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron poli..。
Structured framework 用於 verifying numerical parity of HF<->MCore weight conversions。
Validate 與 use selective 與 full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute。
Validate 與 use CPU offloading in Megatron Bridge, including layer-level activation offloading 與 fractional optimizer state offloading 搭配 HybridDeviceOptimizer。
Validate 與 use CUDA graph capture in Megatron Bridge, including local full-iteration graphs 與 Transformer Engine scoped graphs 用於 attention, MLP, 與 MoE modules。
Validate 與 use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, 與 flex dispatcher backends such as..。
Operational 指南 用於 enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, 與 verification。
Operational 指南 用於 enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, 與 verification。
Techniques 用於 reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, 與 common OOM fixes。
協助處理 PERF MOE COMM Overlap 相關工作,並依原始 Skill 說明完成設定與執行。
Choose the right MoE token dispatcher (`alltoall`, DeepEP, 或 HybridEP) 用於 the hardware, EP degree, 與 optimization stage。
Representative MoE training playbooks by hardware platform 與 model family。
Long-context MoE training guidance 用於 Megatron Bridge。
Systematic 工作流 用於 MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper。
Practical guidance 用於 training MoE VLMs in Megatron Bridge。
Operational 指南 用於 choosing 與 combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, 與 combined parallelism configuration。
Validate 與 use packed sequences 與 long-context training in Megatron-Bridge, distinguishing offline packed SFT 用於 LLMs from in-batch packing 用於 VLMs, 與 applying the right CP..。
Operational 指南 用於 enabling TP, DP, 與 PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, 與 verification。
Recommend 與 customize Megatron Bridge recipes 用於 a user's model, GPU count, 與 training goal。
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, 與 re-run state machine。
Testing 參考資料 用於 Megatron Bridge — unit 與 functional test layout, tier semantics (L0/L1/L2/flaky), script conventions, running 測試 locally, adding/moving/disabling 測試,..。
External verl 端到端 validation 工作流 用於 Megatron-Bridge model/provider changes。
Container-based dev environment setup 與 dependency management 用於 Megatron-LM。
協助處理 BUMP BASE Image 設計 相關工作,並依原始 Skill 說明完成設定與執行。
CI/CD 參考資料 用於 Megatron-LM。
Investigate a failing GitHub Actions run 或 job 與 create a GitHub issue 用於 the failure。
Linting 與 formatting 用於 Megatron-LM。
Domain knowledge 用於 the nightly main-to-dev sync 工作流。
Onboard 1-node GitHub MR functional 測試 用於 GB200 from existing mr-scoped 2-node 測試。
Research 與 draft a response to a GitHub issue 或 question from an external contributor。
協助處理 RUN ON Slurm 相關工作,並依原始 Skill 說明完成設定與執行。
協助處理 Split PR 設計 相關工作,並依原始 Skill 說明完成設定與執行。
測試 system 用於 Megatron-LM。
Refresh golden values from a GitHub Actions 工作流 run (failing-only 或 all jobs), score the change 搭配 average normalized relative differences, 與 produce a PR-ready summary。
Query 與 browse evaluation results stored in MLflow。
協助處理 Debug 雲端部署 相關工作,並依原始 Skill 說明完成設定與執行。
Serve a quantized 或 unquantized LLM checkpoint as an OpenAI-compatible API endpoint 使用 vLLM, SGLang, 或 TRT-LLM。
Evaluates accuracy of quantized 或 unquantized LLMs 使用 NeMo Evaluator Launcher (NEL)。
Run, monitor, analyze, 與 debug LLM evaluations via nemo-evaluator-launcher。
Monitor submitted jobs (PTQ, evaluation, 部署) on SLURM clusters。
協助處理 PTQ 相關工作,並依原始 Skill 說明完成設定與執行。
Cherry-pick merged PRs labeled 用於 a release branch into that branch, then open a PR 與 apply the cherry-pick-done label。
建立 custom LLM evaluation benchmarks 使用 the BYOB decorator framework。
Query 與 browse evaluation results stored in MLflow。
Run, monitor, analyze, 與 debug LLM evaluations via nemo-evaluator-launcher。
Interactive config wizard 用於 NeMo Evaluator Launcher (NEL)。
> 指南 用於 adding a new benchmark 或 training environment to NeMo-Gym。
>- Use when debugging a Nemo Gym run 或 reward profiling job。
> Maintain the NeMo Gym Fern docs site — add, update, move, 或 remove pages under fern/。
>- Use when creating, validating, 或 documenting Nemo Gym pivot datasets from rollout, trajectory, chat-completion, Responses API, 或 tool-call artifacts。
>- Use to help users get started 搭配 Nemo Gym reward profiling。
Autonomous NeMo-RL research agent 工作流 用於 directed hypothesis testing 與 open-ended discovery。
Brev instance operating guidance 用於 NeMo-RL agents working in /home/ubuntu/RL 搭配 limited workspace disk, a larger /ephemeral volume, 與 optional /home/ubuntu/RL/.env secrets。
建置 and dependency management 用於 NeMo-RL。
CI/CD 參考資料 用於 NeMo-RL。
Configuration conventions 用於 NeMo-RL。
Contribution conventions 用於 NeMo-RL。
NVIDIA copyright header requirements 用於 NeMo-RL。
檔案 conventions 用於 NeMo-RL。