No issues. I know its hard to track open threads with Slack. I wish there's a plugin for this too. 🙂
I've been reading the documentation for a while and I'm not getting the following very well.
Given an open source codes say, huggingface. I wanted to do some training and i wanted to track my experiments using ClearML. The obvious choice would be to use Explicit Reporting in ClearML. But the part on sending my training job. and let ClearML orchestrate is vague. Would appreciate if i can be guided to the right documentation on this.
Okay this part I missed, why would you need to add additional "catalog" when you have the UI?
Yeah this is the part i am trying to reconcile. I don't see any UI for datasets, Or is this a feature of hyperdatasets and i just mixed them up.
Hi, please correct me if i am wrong, to use the glue, i need the following.
A k8s cluster A kubectl that is connected to the k8s cluster A pip install of clearml-agent 0.17.1
So i did all the above, I'm not what it meant by running the entire thing on own machine.
running git diff
on my terminal in this repo gave nothing. nothing at all.
Hi SuccessfulKoala55 , thanks, tested the patch and its working as expected now.
Its running as a long running POD on K8S. I'm using log -f
to track its stdout.
python k8s_glue_example.py --queue gpu --namespace defaultÂ
Traceback (most recent call last):
 File "k8s_glue_example.py", line 86, in <module>
  main()
 File "k8s_glue_example.py", line 80, in main
  namespace=args.namespace,
 File "/home/administrator/clearml-agent-k8s/venv/lib/python3.6/site-packages/clearml_agent/helper/base.py", line 239, in _ call _
  cls. instances[cls] = super(Singleton, cls). call_(*args, **kwargs)
TypeError: _ init _() got an unexpected keyword argument 'base_pod...
The doc also mentioned preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
Would you have any examples of how to do this?
The first stage is a rank0 pytorch script. The downstream stages are rankN scripts, they are waiting for the IP address of the first stage. But the first stage doesn’t return, it simply waits for the rankN scripts to connect to it. But in this case, the rankN scripts doesn’t start. So its probably necessary to have just a single stage.
If i were to start a single rank0, and subsequent rankN tasks, it would be rather messy on ClearML Dashboard. Best to have either a single clearml application...
Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.
If we run all the rank 0 and rank n tasks individually, it's defeats the purpose of using ClearML.
I thought of another potential way but not sure if the SDK supports it.
We will perform manual save and upload of model using vanilla boto3 and credentials passed in as env var. Use ClearML SDK to update the Model Repo on the location of the model, without ClearML uploading it explicitly.
Would the above work?
Although I think you can also pull specific chunks of dataset
How do you do that with clearml-data?
Hi,
It did, nvidia/cuda:10.1-runtime-ubuntu18.04.
So if i need to set this every time, what is the following config for? And how do i pass in new env parameters?
` default_docker: {
# default docker image to use when running in docker mode
image: "dockerrepo/mydocker:custom"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
arguments: ["--env GIT_SSL_NO_VERIFY=true",]
} `
Ok thanks, that worked.
Thank. Gonna try that out. But i hit another snag. Strangely, the Agent is not creating the right venv. This is what the Agent created.
` pip:
- asn1crypto==0.24.0
- attrs==20.3.0
- certifi==2020.12.5
- chardet==4.0.0
- cryptography==2.1.4
- Cython==0.29.22
- furl==2.1.0
- future==0.18.2
- humanfriendly==9.1
- idna==2.6
- importlib-metadata==3.7.0
- jsonschema==3.2.0
- keyring==10.6.0
- keyrings.alt==3.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pycrypto==2.6.1
- pygobject...
Next step to figure out if i can do all that in the python code instead of UI.
Sorry i don't quite understand this. The task itself was submitted as I run the code on the client. I suppose the dependancies requirements would be copied over as the experiment is cloned?
To note, the latest codes have been pushed to the Gitlab repo.
Yes, as listed in the snippet. The torch library is torchvision.
Hi, the problem is the same.
I noticed that its not checking out the latest version in gitlab. This latest version would contain the requirements.txt.Using cached repository in "/root/.clearml/vcs-cache/pytorchmnist.f220373e7227ec760b28c7f4cd99b534/pytorchmnist" warning: redirecting to
Note: checking out 'cfb833bcc70f3e10d3b6a96cfad3225ed682382b'.
But i'm guessing this block below applied the diff..does it include the requirements.txt though?
` HEAD is now at cfb833b Upload New Fil...
Ok that worked. So every time i have changes in codes, i will have to rerun the experiment on my own machine that doesn't have any GPUs?
Kinda defeat the purpose of using ClearML Agent.
Thanks. That's easy to miss as its not quite apparent in the main docs. How should i pass in env variables with Task?
That didn't work as well...
Ok that works. thanks.
Yes of cos, its a long one.
Hi, i have the same question. Why would this be ignored if called remotely?
https://clear.ml/docs/latest/docs/references/sdk/task/#set_base_docker
Hi, the idea is to load the gituser and password into the --env by loading it via a env var so the client could access the resources without divulging the credentials in source code and it would be removed after completion since the container would be removed. Its actually doing well with ClearML except the part that the agent seems to print the content of docker_cmd on running the task.
I would like to note that this behaviour doesn't exist with the clearml-agent daemon though. It only exis...
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...