
Reputation
Badges 1
31 × Eureka!I get that error whether I select "Remove all related artifacts and debug samples from ClearML file server" or not
Thanks AgitatedDove14 . Does this go in the local clearml.conf file w/ each user's credentials, or in the conf file for the server?
It seems to an a memory issue w/ the VM that hosts clearml filling up. I am trying to delete some experiments but now i get:
TimelyPenguin76 not sure what you mean by "as a service or via the apps", but we are self-hosting it. Does that answer the question?
Also, not sure what you mean by which "clearml version". How do we check this? The clearml python package is 1.1.4. Is that what you wanted?
@<1855782498290634752:profile|AppetizingFly3>
Thanks! I installed CUDA/CuDNN on the image and now the GPU is being utilized.
yes, on my windows machine I am running:cloned_task = Task.clone(source_task=base_task, name="Auto generated cloned task") Task.enqueue(cloned_task.id, queue_name='test_queue')
I see the task successfully start in the clearml server. In the installed packages section it includes pywin32 == 303
even though that is not in my requirements.txt.
In the results --> console section, I see the agent is running and trying to install all packages, but then stops at pywin32. Some lines from t...
yall thought of everything. this fixed it! Having another issue now, but will post seperately
yes. i think the problem is that its trying to recreate the environment the task was spun up on - which was on a windows machine - on a linux ec2 instance
no 64 bit. but do you mean the PC where I am spinning the task up or the machine where I am running the task
is there a way to explicitly make it some install certain packages, or at least stick to the requirements.txt file rather than the actual environment
is there a way to get the instance's external IP address from clearml? i wouldve thought it would be in the info tab, but its not
huh. i really like how easy it is w/ the automated TB. Is there a way to still use the auto_connect but limit the amount of debug imgs?
perfect. exactly what i was looking for!
gotit. the instance logs showed/var/log/syslog.1:May 5 03:25:27 ip-172-31-37-234 kernel: [53387.840425] python invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 /var/log/syslog.1:May 5 03:25:27 ip-172-31-37-234 kernel: [53387.840442] oom_kill_process+0xe6/0x120
I assume this is something I have to fix on my end (or increase instance memory). Does ClearML also happen to have solutions for this?
relately, I just noticed that the GPU is not starting. This was in the logs:2022-04-07 20:59:54.464854: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Do we need to call a specific instance w/ CUDA preinstalled or does clearml take care of it?
This is what the instance state looks like, as logged by clearml:
We were able to find an error from the autoscalaer agent:
Stuck spun instance dynamic_worker:clearml-agent-autoscale:p2.xlarge:i-015001a93e0910a09 of type clearml-agent-autoscale
2022-04-19 19:16:58,339 - clearml.auto_scaler - INFO - Spinning down stuck worker: 'dynamic_worker:clearml-agent-autoscale:p2.xlarge:i-015001a93e0910a09
Im thinking it may have something to do withUsing cached repository in "/home/ubuntu/.clearml/vcs-cache/ai_dev.git.42a0e941ddbf5c69216f37ceac2eca6b/ai_dev.git"
We tried to reset the machines but the cache is still there. any idea how to clear it?
the 2nd option looks good. would everyone's credentials be displayed on the server though?
ok, i suppose that will have to do for now. thank you!
We are using self-hosted clearMl w/ the following versions:
Worker CLEARML-AGENT version 1.1.2
The autoscaler instance Clearml-AGENT version: 1.2.3
ClearML WebApp: 1.2.0-153 Server: 1.2.0-153 API: 2.16
python pip package 1.3.2
@<1856144902816010240:profile|SuccessfulCow78> can you please help provide