Reputation
Badges 1
979 × Eureka!ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂
I also did run sudo apt install nvidia-cuda-toolkit
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts 🙂
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
you mean “docker” was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks
and this works. However, without the trick from UnevenDolphin73 , the following won’t work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
Actually I think I am approaching the problem from the wrong angle
you mean to run it on the CI machine ?
yes
That should not happen, no? Maybe there is a bug that needs fixing on clearml-agent ?
It just to test that the logic being executed in if not Task.running_locally()
is correct
In my github action, I should just have a dummy clearml server and run the task there, connecting to this dummy clearml server
Yes, I guess that's fine then - Thanks!
I’ll definitely check that out! 🤩
Ho wow! is it possible to not specify a remote task? (If i am working with Task.set_offline(True))
the instances takes so much time to start, like 5 mins
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm