
Reputation
Badges 1
31 × Eureka!Is there documentation for this as I was not able to figure this out unfortunately.
@<1523701087100473344:profile|SuccessfulKoala55> Thanks for getting back to me. My image contains clearml-agent==1.9.1
. There is a recent release to 1.9.2
and now on every run the agent installs this newer version thanks to the -U
flag which is being passed. From the docs it looks like there may be a way to prevent this upgrade but it's not clear to me exactly how to do this. Is it possible?
👍 Thanks for getting back to me.
Another issue I found was that I could only use vpc subnets from the google project I am launching the VMs in.
I cannot use shared vpc subnets from another project. This would be a useful feature to implement as GCP recommends segmenting the cloud estate so that the vpc and VMs are in different projects.
It's not immediately obvious from the GCP documentation and you don't need to do this on AWS or Azure so it can catch you out. For what it's worth, the image I used originally was from the same family Marko has referenced above.
Apologies for the delay.
I have obfuscated the private information with XXX
. Let me know if you think any of it is relevant.
{"gcp_project_id":"XXX","gcp_zone":"XXX","subnetwork":"XXX","gcp_credentials":"{\n \"type\": \"service_account\",\n \"project_id\": \"XXX\",\n \"private_key_id\": \"XXX\",\n \"private_key\": \"XXX\",\n \"client_id\": \"XXX\",\n \"auth_uri\": \"XXX\",\n \"token_uri\": \"XXX\",\n \"auth_provider_x509_cert_url\": \"XXX\",\n \"client_x509_cert_url\": \"...
I ran again without the debug mode option and got this error:
>
> Starting Task Execution:
>
>
> Traceback (most recent call last):
> File "/root/.clearml/venvs-builds/3.6/code/interactive_session.py", line 377, in <module>
> from tcp_proxy import TcpProxy
> ModuleNotFoundError: No module named 'tcp_proxy'
>
> Process failed, exit code 1
No particular reason. This was our first time trying it and it seemed the quickest way to get off the ground. When I try without I have a similar error trying to connect although that could be due to the instance.
@<1537605940121964544:profile|EnthusiasticShrimp49> How do I specify to not attach a gpu? I thought ticking 'Run in CPU Mode' would be sufficient. Is there something else I'm missing?
I don't think there's really a way around this because AWS Lambda doesn't allow for multiprocessing.
Instead, I've resorted to using a clearml Scheduler which runs on a t3.micro instance for jobs which I want to run on a cron.
Here it is:
Given that nvidia-smi
is working you may have already done that. In this case depending on your ubuntu version you may have another problem. ubuntu 22+ has this issue which has workaround. This also caught me out...
I run using the GCP Autoscaler successfully for GPU. Have you included this line in the init-script of the autoscaler? This was a gotcha for me...
/opt/deeplearning/install-driver.sh
Is there a way I can do this with the python APIClient or even with the requests library?
According to the documentation users.user
should be a valid endpoint?
I have just encountered this. I believe it is because of the clearml-agent 1.7.0 release which added this as default: agent.enable_git_ask_pass: true
To fix, add in agent.enable_git_ask_pass: false
to your config.
Thanks. I am trying to completely minimise the start up time. Given I am using a docker image which has clearml-agent
and pip
installed, is there a way I can skip the installation of this when a task starts up using the daemon?
I am having the same error since yesterday on Ubuntu. Works fine on Mac.
I cannot ping api.clear.ml
I am using ClearML version 1.9.1. In code, I am creating a plot using matplotlib. I am able to see this in Tensorboard but it is not available in ClearML Plots
👍 thanks for clearing that up @<1523701087100473344:profile|SuccessfulKoala55>
@<1523701070390366208:profile|CostlyOstrich36> Thank you. Which docker image do you use with this machine image?
Solved for me as well now.
Hi,
I've managed to fix it.
Basically, I had a tracker running on our queues to ensure that none of them were lagging. This was using get_next_task
from APIClient().queues
.
If you call get_next_task
it removes the task from the queue but does not put it into another state. I think because typically get_next_task
is immediately followed by something to make the task run in the daemon or delete it.
Hence you end up in this weird state were the task thinks its queued bec...
Yep that's correct. If I have a task which runs every 5 minutes, I don't want a new task every 5 minutes as that will create a lot of tasks over a day. It would be better if I had just one task.