Reputation
Badges 1
979 × Eureka!because I cannot locate libcudart or because cudnn_version = 0?
Hi AgitatedDove14 , sorry somehow this message got lost ๐
clearml version is the latest at the time, 1.7.1
Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 โข 1.1.1 โข 2.14
. What these numbers correspond to?
Thatโs why I said โnot reallyโ ๐
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
So in my minimal reproducable example, it does work ๐คฃ very frustrating, I will continue searching for that nasty bug
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
(I am not part of the awesome ClearML team, just a happy user ๐ )
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
in the controller, I want to upload an artifact and start a task that will query that artifact and I want to make sure that the artifact exists when the task will try to retrieve it
Hey FriendlySquid61 ,
I ended up asking for full control of EC2 not to be blocked, so unfortunately I cannot give you a more precise list ๐
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good ๐
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
ok, but will it install as expected the engine and its dependencies?
I also did run sudo apt install nvidia-cuda-toolkit
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
The main issue is the task_logger.report_scalar()
not reporting the scalars
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Sorry, its actuallytask.update_requirements(["."])ย
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
It failed as well
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now ๐
It could be yes but the difference between now
and last_report_time
doesnโt match the difference I observe
I can also access these files directly if I enter the url in the browser