Reputation
Badges 1
25 × Eureka!I tested and I have no more warning messages
if self._active_gpus and i not in self._active_gpus: continueThis solved it?
If so, PR pretty please 🙂
Agreed, MotionlessCoral18 could you open a feature request on the clearml-agent repo please? (I really do not want this feature to get lost, and I'm with you on the importance, lets' make sure we have it configured from the outside)
CrookedWalrus33 can you post the clearml.conf you have on the agent machine?
JitteryCoyote63 did you use the new implementation:
https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py
Hi @<1523701079223570432:profile|ReassuredOwl55>
I want to kick off the pipeline and then check completion
outside
of the pipeline task. (edited)
Basically the pipeline is a Task (of a certain type).
You do the "standard" thing, you clone the pipeline Task, you enqueue it, and you wait for it's status
task = Task.clone(source_task="<pipeline ID here>")
Task.enqueue(task, queue_name=services)
task.wait_for_status(...)
wdyt?
Would the training script be able to register to the server when there is no public IP.
I guess it's more related to networking inside GCP, but just wanted to know if anyone tried it.
Hmm, as long as you can access it from within the LAN, it should work ...
Please update when you try it out 🙂
WickedGoat98
I will try to collect the installation steps in a document and share it to the community once ready
Thank you! this will be awesome !
We're here if you need anything 🙂
We host in a private (self-hosed) git-Lab, which can only be cloned through the SSH, and we would like to import packages by compiling it from a public git-Hub (or by installing it using wheels with find-links).
Try removing force_git_ssh_protocol: true
If you do not provide user/pass , and assuming the repo link on the task is internal, hence SSH, it should leave links as they are (ssh for ssh http for http), this should solve the issue (I hope)
It just seems frozen at the place where it should be spinning up the tasks within the pipeline
And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)
Does it say it runs something ?
(on the workers tab on the agents table it should say which Task it is running)
Hi JitteryCoyote63
If you want to refresh the task object, call task.reload() It will also refresh the artifacts.
The reason for not always do so when accessing the .artifacts objects is for speed optimization (It might be slow compared to dict access, and we assume users will expect it to behave the dict)
SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files
we also provide a custom
aux-config
file. We also had to make sure to update the name inside
config.pbtxt
so that Triton is happy:
Good point, what would be the logic of the auto "config.pbtxt" patching we should employ ?
Hi VivaciousWalrus21
After restarting training huge gaps appear in iteration axis (see the screenshot).
The Task.init actually tries to understand what was the last reported interation and continue from that iteration, I'm assuming that what happens is that your code does that also, which creates a "double shift" that you see as the jump. I think the next version will try to be "smarter" about it, and detect this double gap.
In the meantime, you can do:
` task = Task.init(...)...
Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.
The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:
api.http.default_method: "put"
Hmm this is odd, when you press on the parent dataset in the UI, and go to full-details, then the INFO tab. Can you copy here everything ?
link to the line please 🙂
JitteryCoyote63 you mean from code?
Hurray 🙂
BTW: the next version will have a project level "readme alike" markdown embedded in the UI, so hopefully you will be able to add all the graphs there :)
What is the proper way to change a clearml.conf ?
inside a container you can mount an external clearml.conf, or override everything with OS environment
https://clear.ml/docs/latest/docs/configs/env_vars#server-connection
Hi @<1523711619815706624:profile|StrangePelican34>
Hmm, I think this is missing from the docs, let me ping the guys about that 🙏
Yep, and this is the root cause of the issue (But easily fixable) 🙂
Is there any way to get just one dataset folder of a Dataset? e.g. only "train" or only "dev"?
They are usually stored in the same "zip" so basically you have to download both folders anyhow, but I guess if this saves space we could add this functionality, wdyt?
Hi ClumsyElephant70
extra_docker_shell_script: ["export SECRET=SECRET", ]
I think ${SECRET} will not get resolved you have to specifically have text value there.
That said it is a good idea to resolve it if possible, wdyt?
In both case if I get the element from the list, I am not able to get when the task started. Where is info stored?
If you are using client.tasks.get_all( ...) should be under started field
Specifically you can probably also do:queried_tasks = Task.query_tasks(additional_return_fields=['started']) print(queried_tasks[0]['id'], queried_tasks[0]['started'],)
Worker just installs by name from pip, and it installs not my package!
Oh dear ...
Did you configure additional pip repositories in the Agent's clearml.conf ? https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L77 It might be that (1) is not enough, as pip will first try to search the package in the pip repository, and only then in the private one. To avoid that, in your code you can point directly to an https of your package` Ta...
So when the agent fire up it get's the hostname, which you can then get from the API,
I think it does something like "getlocalhost", a python function that is OS agnostic
SourOx12
Run this example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Once, then change line #26 to:task = Task.init(project_name="examples", task_name="scalar reporting", continue_last_task=True)and run again,