Reputation
Badges 1
981 × Eureka!I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
Ho nice, thanks for pointing this out!
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Whohoo! Thanks 👌
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
Installing collected packages: my-engine Attempting uninstall: my-engine Found existing installation: my-engine 1.0.0 Uninstalling my-engine-1.0.0: Successfully uninstalled my-engine-1.0.0 Successfully installed my-engine-1.0.0
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0) Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
Thanks! Corrected both, now its building
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I guess I’ll get used to it 😄
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Super! I’ll give it a try and keep you updated here, thanks a lot for your efforts 🙏
Ho the object is actually available in previous_task.artifacts
Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Thanks a lot AgitatedDove14 !
Answering myself: Yes, Task.set_base_docker RTFM!!!
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched