Reputation
Badges 1
981 × Eureka!So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work
Actually I think I am approaching the problem from the wrong angle
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
Ho nice, thanks for pointing this out!
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Whohoo! Thanks 👌
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
Installing collected packages: my-engine Attempting uninstall: my-engine Found existing installation: my-engine 1.0.0 Uninstalling my-engine-1.0.0: Successfully uninstalled my-engine-1.0.0 Successfully installed my-engine-1.0.0
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0) Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
Thanks! Corrected both, now its building
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].