Reputation
Badges 1
981 × Eureka!AgitatedDove14 I cannot confirm at 100%, the context is different (see previous messages) but it could be the same bug behind the scene...
What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9 (both when executed in trains-agent and in local script)
So in my minimal reproducable example, it does work π€£ very frustrating, I will continue searching for that nasty bug
I just read, I do have the trains version 0.16 and the experiment is created with that version
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another π
As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve
(btw, yes I adapted to use Task.init(...output_uri=)
Also tried task.get_logger().report_text(str(task.data.hyperparams))
-> AttributeError: 'Task' object has no attribute 'hyperparams'
did you try with another availability zone?
Because it lives behind a VPN and github workers donβt have access to it
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
Hey SuccessfulKoala55 , unfortunately this doesnβt work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite and will make OmegaConf.create fail
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean π
So I guess the problem is that the following snippet:from clearml import Task Task.init()Should be added before the if __name__ == "__main__": ?
Interestingly, I do see the 100gb volume in the aws console:
I'll try to pass these values using the env vars
The rest of the configuration is set with env variables
Alright, how can I then mount a volume of the disk?
Yes AgitatedDove14 π