Reputation
Badges 1
31 × Eureka!I don't think there's really a way around this because AWS Lambda doesn't allow for multiprocessing.
Instead, I've resorted to using a clearml Scheduler which runs on a t3.micro instance for jobs which I want to run on a cron.
This is not working. Please see None which details the problem
According to the documentation users.user should be a valid endpoint?
Thanks. I am trying to completely minimise the start up time. Given I am using a docker image which has clearml-agent and pip installed, is there a way I can skip the installation of this when a task starts up using the daemon?
Nope. But there are steps you can take to prevent this through publishing tasks and reports I believe.
Here it is:
I am using ClearML version 1.9.1. In code, I am creating a plot using matplotlib. I am able to see this in Tensorboard but it is not available in ClearML Plots
👍 thanks for clearing that up @<1523701087100473344:profile|SuccessfulKoala55>
Is there a way I can do this with the python APIClient or even with the requests library?
Ah, didn’t know that. Yes in that case that would work 👍
I run using the GCP Autoscaler successfully for GPU. Have you included this line in the init-script of the autoscaler? This was a gotcha for me...
/opt/deeplearning/install-driver.sh
@<1523701070390366208:profile|CostlyOstrich36> Thank you. Which docker image do you use with this machine image?
@<1537605940121964544:profile|EnthusiasticShrimp49> How do I specify to not attach a gpu? I thought ticking 'Run in CPU Mode' would be sufficient. Is there something else I'm missing?
👍 Thanks for getting back to me.
Another issue I found was that I could only use vpc subnets from the google project I am launching the VMs in.
I cannot use shared vpc subnets from another project. This would be a useful feature to implement as GCP recommends segmenting the cloud estate so that the vpc and VMs are in different projects.
I’ve had some issues with clearml sessions. I’d be interested in seeing a PR. Would you mind posting a link please?
No particular reason. This was our first time trying it and it seemed the quickest way to get off the ground. When I try without I have a similar error trying to connect although that could be due to the instance.
I have managed to connect. Our EC2 instances run in a private subnet so the ssh connection was not working for that reason I believe. Once I connected to my VPN it now worked.
I ran again without the debug mode option and got this error:
>
> Starting Task Execution:
>
>
> Traceback (most recent call last):
> File "/root/.clearml/venvs-builds/3.6/code/interactive_session.py", line 377, in <module>
> from tcp_proxy import TcpProxy
> ModuleNotFoundError: No module named 'tcp_proxy'
>
> Process failed, exit code 1
@<1523701087100473344:profile|SuccessfulKoala55> Thanks for getting back to me. My image contains clearml-agent==1.9.1 . There is a recent release to 1.9.2 and now on every run the agent installs this newer version thanks to the -U flag which is being passed. From the docs it looks like there may be a way to prevent this upgrade but it's not clear to me exactly how to do this. Is it possible?
@<1523701087100473344:profile|SuccessfulKoala55> Just following up as I figured out what was happening here and could be useful for the future.
The prefilled value for Number of GPUs in the GCP Autoscaler is 1 .
When one ticks Run in CPU mode (no gpus) it hides the GPU Type and Number of GPUs fields. However, the value which was these fields are still submitted in the API Request (I'm guessing here) when the Autoscaler is launched.
Hence, to get past this, you need to...
I cannot ping api.clear.ml on Ubuntu. Works fine on Mac though.
Solved for me as well now.
Hi,
I've managed to fix it.
Basically, I had a tracker running on our queues to ensure that none of them were lagging. This was using get_next_task from APIClient().queues .
If you call get_next_task it removes the task from the queue but does not put it into another state. I think because typically get_next_task is immediately followed by something to make the task run in the daemon or delete it.
Hence you end up in this weird state were the task thinks its queued bec...
