Reputation
Badges 1
25 × Eureka!DeliciousBluewhale87 and is it working?
Ohh okay something seems to half work in terms of configuration, the agent has enough configuration to register itself, but fails to pass it to the task.
Can you test with the latest agent RC:0.17.2rc4
Hi DeliciousBluewhale87
clearml-agent 0.17.2 was just release with the fix, let me know if it works
DeliciousBluewhale87
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..
Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)
SubstantialElk6
The CA is taken automatically by urllib, check the OS environments you need to configure it.
https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-errorSSL_CERT_FILE REQUESTS_CA_BUNDLE
Okay let me check if I can test on this git version.
PompousBeetle71 , what you are saying that for some reason the --gpus all will not configure the Nvidia drivers to use all the gpus, when running bare metal (i.e no docker). Did I understand you correctly ?
That would match what
add_dataset_trigger
and
add_model_trigger
already have so it would be good
Sounds good, any chance you can open a github issue, so that we do not forget?
Another parameter for when the task is deleted might also be useful
That actually might be more complicated, because there might be a race condition, basically missing the delete operation...
What would be the use case?
Hi DangerousDragonfly8
You mean you want to trigger something when users archive a Task ?
Hmm that is a good idea, and I think you are correct, it cannot support it. But it will be easy to do, maybe adding an argument trigger_on_archive
? wdyt?
Hi DangerousDragonfly8
, is it possible to somehow extract the information about the experiment/task of which status has changed?
From the docstring of add_task_trigger
```py def schedule_function(task_id): pass ```
This means you are getting the Task ID that caused the trigger, now you can get all the info that you need with Task.get_task(task_id)
` def schedule_function(task_id):
the_task = Task.get_task(task_id)
# now we have all the info on the Task tha...
I get gaps in the graphs.
For example, the first time I run, I create a task and run a loop:
Hi SourOx12
Is this related to this one?
https://github.com/allegroai/clearml/issues/496
Thanks JitteryCoyote63 !
Any chance you want to open github issue with the exact details or fix with a PR ?
(I just want to make sure we fix it as soon as we can π )
the services queue (where the scaler runs) will be automatically exposed to new EC2 instance?
Yes, using this extra_clearml_conf
parameter you can add configuration that will be passed to the clearml.conf
of the instances it will spin.
Now an example to the values you want to add :agent.extra_docker_arguments: ["-e", "ENV=value"]
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L149
wdyt?
LOL yes π
just make sure it won't be part of the uncommitted changes of the AWS autoscaler π
Yes, that makes sense. Then you would need to use wither the AWS vault features, or the ClearML vault features ...
Hi @<1601023807399661568:profile|PompousSpider11>
Yes "activating" a conda/python environment in a docker is more complicated then it should be ...
To debug, what are you getting when you do:
docker run -it <docker name here> bash -c "set"
PipelineController works with default image, but it incurs overhead 4-5 min
You can try to spin the "services" queue without docker support, if there is no need for containers it will accelerate the process.
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
This error is about failing to clone the pipeline code repo, how is that connected to changing the container ?!
Can you provide the full log?
Can you see that the environment is actually being passed ?
Hmm yes this is exactly what should not happen π
Let me check it
LovelyHamster1 Now I see... Interesting credentials ability. Specifically all the S3 access on trains is derived from the ~/clearml.conf
credentials section :
https://github.com/allegroai/clearml/blob/ebc0733357ac9ead044d0ed32d41447763f5797e/docs/clearml.conf#L73
( or the AWS S3 environment variables )
I'm not sure how this AWS feature works, I suspect it is changing the AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY variables on the ec2 instance. If this is the case, it should work out of...
So how can I temporarily fix it?
Try:task.output_uri = task.get_output_destination()
Iβve did saw this βpublishβ option for pipelines, just for models, is this a new feature?
Kind of hidden in the UI (not sure if on purpose), but if you click on the pipeline then go to details, in the new tab (of the pipeline Task) you can publish the Task (aka the pipeline)
In this example:
https://github.com/allegroai/clearml-actions-train-model/blob/7f47f16b438a4b05b91537f88e8813182f39f1fe/train_model.py#L14
replace with something like:
` task = Task.get_tasks(project_name="pipel...
Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
Hmm I'm sorry it might be "continue_last_task", can you try:Task.init(..., continue_last_task="aabb11")
Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip>
?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
Looks great, let me see if I can understand what's missing, because it should have worked ...