Reputation
Badges 1
979 × Eureka!Thanks for sharing the issue UnevenDolphin73 , I’ll comment on it!
So probably only the main process (rank=0) should attach the ClearMLLogger?
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
(Just to know if I should wait a bit or go with the first solution)
Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?
I was asking to exclude this possibility from my debugging journey 😁
ok, and if not the case, it will fall back to 3.8, right? Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
but not as much as the ELB reports
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
Yea again I am trying to understand what I can do with what I have 😄 I would like to be able to export as an environment variable the runtime where the agent is installing, so that one app I am using inside the Task can use the python packages installed by the agent and I can control the packages using clearml easily
TimelyPenguin76 , no, I’ve only set the sdk.aws.s3.region = eu-central-1
param
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
Is there any logic on the server side that could change the iteration number?
the first problem I had, that didn’t gave useful infos, was that docker was not installed in the agent machine x)
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
because I cannot locate libcudart or because cudnn_version = 0?
That said, v1.3.1 is already out, with what seems like a fix:
So you mean 1.3.1 should fix this bug?
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
Ok, this I cannot locate
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
Actually I think I am approaching the problem from the wrong angle
Could you please share the stacktrace?
Here I have to do it for each task, is there a way to do it for all tasks at once?