Reputation
Badges 1
282 × Eureka!Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
Hi, any advice on this? thanks.
clearml=1.0.3
python=3.8.10clearml-data upload --id 12314jhg42342j4j --storage
http://ecs.ai is an on-prem DELL EMC ECS that serves as our S3 storage configured with s self signed cert.
I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?
Hi AgitatedDove14 , that's what i am trying to figure out as well. The task has nothing to do with torch, and the requirements.txt doesn't have any torch packages as well.
Just to put a ping for those on this side of the timezone to look at. Thanks.
ah... thanks!
the hackathon is 3 days.
Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
Its. 0.17-63.
It doesn't appear in profile page.
Sorry, in case i misunderstood you. Are you refering to the extra_docker_shell_script
.
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
Hi it is missing --docker on the agent. Thanks! Dynamic GPU option only available with Enterprise version right?
I would like to run ClearML agent on kubernetes. So basically I need to run the image on a pod, but there isn't any information on how the agent would communicate with the code, nor how it would spawn more pods to run the task.
No i didn't indicate this particular issue on the git issue. Only the apply template.yml is on the issue.
Hi, I was expecting to see the container rather then the actual physical machine. For example, in the file panel on the left of the jupyter panel, I see the file contents of the physical machine. I was expecting this to be the container.
Sorry take back. Just realised that this argument only worked on running the agent, but when you enqueue a task into this agent, the argument is not passed on to the container that the agent spawned.
This is the same issue for the docker image. It reverts back to nvidia/cuda:10.1-runtime-ubuntu18.04 despite me setting something else.
Hi CostlyOstrich36 , nothing in particular. I was doing a research and noticed that ML Pipelines was mentioned not even once in the literature. So i wonder if one should be done. I'm looking at other aspects as well but i'll gradually ask on those.
Hi, i'm gonna hijack this thread a bit. My community uses ClearML and is looking at various model deployment strategies. We are looking at a seamless integration with Triton but noted they Triton does not support deployment strategies. ClearML-Serving seems to but the strategies are rather limited. Is there a roadmap to expand Clearml-serving?
The first is probably done using pipeline controllers, the second using Datasets or HyperDatasets. Its not very clear how the last one is achieved, especially on the searchable data catalogs.
Hi HelpfulDeer76 , I'm facing similar issues. Would you mind describing in detail how you deploy clearml-agent? Is it running as a pod on k8s?
Hi this is the log. I didn't see any attempt from the agent to install virtualenv on the base image.
` 1618369068169 clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3 DEBUG Current configuration (clearml_agent v0.17.2, location: /tmp/.clearml_agent.wgsmv2t9.cfg):
agent.worker_id = clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3
agent.worker_name = clearml-gpu-id-b926b4b809f544c49e99625...
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
ah ok, so if i see Jax's workspace on https://app.community.clear.ml/dashboard , then i'm on the right track? How regular does this reset then?
Its 1.0.0. As printed on the top of the logs in ClearML Server UI.
I also think it make sense that when you do certain definitive CI actions like publish, it would support some custom scripts to run.
Hi, i was reading this thread and wondered which version of clearml-server and clearml-agent has this taken effect with?
Hi, Self-hosted using docker-compose.