Reputation
Badges 1
8 × Eureka!CostlyOstrich36 My particular Python error is due to a mismatch between my torch version and lightning. But the real issue is I do not have exact control of the version that is installed.
@<1523701205467926528:profile|AgitatedDove14> Because I want to schedule each sweep job as a task for remote execution, allowing for running each task in parallel on a worker.
I believe that is the right terminology, yes.
CostlyOstrich36 Yes, I manually updated the port mapping in the docker-compose yaml. An alternative way would be to keep the 8080 port in the config, but then on the server forward all requests from 8080 to 80.
SuccessfulKoala55 At peak we’ve been running ~50 experiments simultaneously that have been somewhat generous in reported metrics, although not extreme. Our CML server is hosted on an Azure D2S_v3 VM (2 vCPU, 8 GB RAM, 3200 IOPS). Looks like we should probably upgrade especially the disk specs. (Taking another look at our VM metrics we reached 100% OS disk IOPS consumed a couple of times.)
That would (likely) work, yes .. if it worked 🙂 However, remote_execute
kills the thread so the multirun stops at the first sub-task.
I'll do that. As a temporary workaround I'll create/schedule the tasks from an external script, and avoid using hydra multi-runs. (Which is a pity, so I'll be looking forward to a fix 😉 )
@<1523701205467926528:profile|AgitatedDove14> Yes, but that is not allowed (together with not clone ), as per the current implementation 😄