Reputation
Badges 1
981 × Eureka!AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
Installing collected packages: my-engine Attempting uninstall: my-engine Found existing installation: my-engine 1.0.0 Uninstalling my-engine-1.0.0: Successfully uninstalled my-engine-1.0.0 Successfully installed my-engine-1.0.0
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0) Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
Thanks! Corrected both, now its building
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I guess I’ll get used to it 😄
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Super! I’ll give it a try and keep you updated here, thanks a lot for your efforts 🙏
Ho the object is actually available in previous_task.artifacts
Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Thanks a lot AgitatedDove14 !
Answering myself: Yes, Task.set_base_docker RTFM!!!
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
Hi AgitatedDove14 , I don’t see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?
AgitatedDove14 Should I create an issue for this to keep track of it?
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?
