Skip to content
Better HN
Top
Best
Ask
Show
New
Jobs
Search
⌘K
Instrumentation checklist for running large GPU clusters
(opens in new tab)
(together.ai)
2 points
amaldavid
1y ago
2 comments
Save
Share
2 comments
2 comments · 1 top-level
top
newest
oldest
amaldavid
OP
1y ago
· 1 in thread
Just stumbled upon this blog which details out testing and validating large GPU clusters before running training workloads. Any other similar blogs which adds more nuance in terms of debugging the issues once we identify them as well?
roanakb
1y ago
this is a good one for debugging rdma:
https://docs.redhat.com/en/documentation/red_hat_enterprise_...
j
/
k
navigate · click thread line to collapse