Reputation
Badges 1
282 × Eureka!Ok thanks.
The first stage is a rank0 pytorch script. The downstream stages are rankN scripts, they are waiting for the IP address of the first stage. But the first stage doesn’t return, it simply waits for the rankN scripts to connect to it. But in this case, the rankN scripts doesn’t start. So its probably necessary to have just a single stage.
If i were to start a single rank0, and subsequent rankN tasks, it would be rather messy on ClearML Dashboard. Best to have either a single clearml application...
Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg
an issue?
Hi, the problem is the same.
I noticed that its not checking out the latest version in gitlab. This latest version would contain the requirements.txt.Using cached repository in "/root/.clearml/vcs-cache/pytorchmnist.f220373e7227ec760b28c7f4cd99b534/pytorchmnist" warning: redirecting to
Note: checking out 'cfb833bcc70f3e10d3b6a96cfad3225ed682382b'.
But i'm guessing this block below applied the diff..does it include the requirements.txt though?
` HEAD is now at cfb833b Upload New Fil...
Hi, i was reading this thread and wondered which version of clearml-server and clearml-agent has this taken effect with?
I'm using this feature, in this case i would create 2 agents, one with cpu only queue and the other with gpu queue. And then at the code level decide with queue to send to.
i passed it through the yaml as follows.apiVersion: v1 kind: Pod spec: containers: - image: clearml-agent:latest" env: - name: PIP_INDEX_URL value: "
" - name: PIP_TRUSTED_HOST value: "192.168.56.253" - name: PIP_FIND_LINKS value: "
` "
- name: GIT_SSL_NO_VERIFY
value: true
resources:
requests:
cpu: "2"
...
Then you pass the tolerations definition through a different pod template?
Yup.
Hi SuccessfulKoala55 , just to add, my clearml.conf (client) and clearml.agent.conf (agent) can have differing values. I'm not sure which one takes precedence and if this could be the cause.
ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
clearml=1.0.3
python=3.8.10clearml-data upload --id 12314jhg42342j4j --storage
http://ecs.ai is an on-prem DELL EMC ECS that serves as our S3 storage configured with s self signed cert.
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?
ah ok, so if i see Jax's workspace on https://app.community.clear.ml/dashboard , then i'm on the right track? How regular does this reset then?
the hackathon is 3 days.
Hi, the latest k8sglue-example.py was last commited about 4 months ago. Are you refering to that version?
We are deploying ClearML Server via the docker-compose.
For ClearML-Agent. We have the choice of Docker or K8S preferred (Using the Glue).
For K8S, we can't get the glue to work ( https://clearml.slack.com/archives/CTK20V944/p1614525898114200?thread_ts=1613923591.002100&cid=CTK20V944 ) so we can't make an assessment of whether it actually works for us.
Just to put a ping for those on this side of the timezone to look at. Thanks.
Okay this part I missed, why would you need to add additional "catalog" when you have the UI?
Yeah this is the part i am trying to reconcile. I don't see any UI for datasets, Or is this a feature of hyperdatasets and i just mixed them up.
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?
Hi, for both of them, args.lastiter
is the exact same value. But when plotted out, they are 2 actually iterations apart.
Hi, the idea is to load the gituser and password into the --env by loading it via a env var so the client could access the resources without divulging the credentials in source code and it would be removed after completion since the container would be removed. Its actually doing well with ClearML except the part that the agent seems to print the content of docker_cmd on running the task.
I would like to note that this behaviour doesn't exist with the clearml-agent daemon though. It only exis...
To note, the latest codes have been pushed to the Gitlab repo.
Yes of cos, its a long one.
unfortunately, our security posture is so strict that we cannot have an agent git user that have unfettered read access to all repos.
Hi, clearml-agent==0.17.2rc3 did work. I'm on a 1.19 k8s cluster, and has this error when a task is pulled. Is the glue not compatible with 1.19?
` Pulling task 3a90802d1dfa4ec09fbccba0beffbaa8 launching on kubernetes cluster
Pushing task 3a90802d1dfa4ec09fbccba0beffbaa8 into temporary pending queue
Kubernetes scheduling task id=3a90802d1dfa4ec09fbccba0beffbaa8
kubectl output:
Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been depre...