Reputation
Badges 1
979 × Eureka!So it is there already, but commented out, any reason why?
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats π π Was this bug fixed with this new version?
on /data or /opt/clearml? these are two different disks
SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts π
Well actually I do see many errors like that in the browser console:
But we can easily extend, right?
AgitatedDove14 Didnβt work π
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
mmmh good point actually, I didnβt think about it
TimelyPenguin76 , no, Iβve only set the sdk.aws.s3.region = eu-central-1
param
Yea I really need that feature, I need to move away from key/secrets to iam roles
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
Hi AgitatedDove14 , coming by after a few experiments this morning:
Indeed torch 1.3.1 does not support cuda, I tried with 1.7.0 and it worked, BUT trains was not able to pick the right wheel when I updated the torch req from 1.3.1 to 1.7.0: It downloaded wheel for cuda version 101. But in the experiment log, the agent correctly reported the cuda version (111). I then replaced the torch==1.7.0 with the direct https link to the torch wheel for cuda 110, and that worked (I also tried specifyin...
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed π
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far π
I will go for lunch actually π back in ~1h
Seems like it just went unresponsive at some point
Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
Here is what happens with polling_interval_time_min=1
when I add one task to the queue. The instance takes ~5 mins to start and connect. During this timeframe, the autoscaler starts to new instances, then spin them down. So it acts as if max_spin_up_time_min=10
is not taken into account
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
Yes, I stayed with an older version for a compatibility reason I cannot remember now π - just tested with 1.1.2 and itβs the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
Sorry both of you, my problem was actually lying somewhere else (both buckets are in the same region) - thanks for you time!