Reputation
Badges 1
981 × Eureka!What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9 (both when executed in trains-agent and in local script)
So in my minimal reproducable example, it does work 🤣 very frustrating, I will continue searching for that nasty bug
I just read, I do have the trains version 0.16 and the experiment is created with that version
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂
As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve
(btw, yes I adapted to use Task.init(...output_uri=)
Also tried task.get_logger().report_text(str(task.data.hyperparams))
-> AttributeError: 'Task' object has no attribute 'hyperparams'
did you try with another availability zone?
Because it lives behind a VPN and github workers don’t have access to it
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
Hey SuccessfulKoala55 , unfortunately this doesn’t work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite and will make OmegaConf.create fail
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean 👍
So I guess the problem is that the following snippet:from clearml import Task Task.init()Should be added before the if __name__ == "__main__": ?
Interestingly, I do see the 100gb volume in the aws console:
I'll try to pass these values using the env vars
The rest of the configuration is set with env variables
Alright, how can I then mount a volume of the disk?
Yes AgitatedDove14 🙂
Interesting - I can reproduce easily
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
mmmh good point actually, I didn’t think about it