Reputation
Badges 1
981 × Eureka!Also tried task.get_logger().report_text(str(task.data.hyperparams))
-> AttributeError: 'Task' object has no attribute 'hyperparams'
did you try with another availability zone?
Because it lives behind a VPN and github workers donβt have access to it
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
Hey SuccessfulKoala55 , unfortunately this doesnβt work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite and will make OmegaConf.create fail
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
AgitatedDove14 I think itβs on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you π
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean π
So I guess the problem is that the following snippet:from clearml import Task Task.init()Should be added before the if __name__ == "__main__": ?
Interestingly, I do see the 100gb volume in the aws console:
I'll try to pass these values using the env vars
The rest of the configuration is set with env variables
Alright, how can I then mount a volume of the disk?
Yes AgitatedDove14 π
Interesting - I can reproduce easily
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
mmmh good point actually, I didnβt think about it
because at some point it introduces too much overhead I guess
Nice, the preview param will do π btw, I love the new docs layout!
AgitatedDove14 This seems to be consistent even if I specify the absolute path to /home/user/trains.conf
btw, I tried with alpine instead of ubuntu:18.04, got :
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pulling fs layer
df20fa9351a1: Verifying Checksum
df20fa9351a1: Download complete
df20fa9351a1: Pull complete
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting containe...
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |