Reputation
Badges 1
31 × Eureka!I ran again without the debug mode option and got this error:
>
> Starting Task Execution:
>
>
> Traceback (most recent call last):
> File "/root/.clearml/venvs-builds/3.6/code/interactive_session.py", line 377, in <module>
> from tcp_proxy import TcpProxy
> ModuleNotFoundError: No module named 'tcp_proxy'
>
> Process failed, exit code 1
Thanks. I am trying to completely minimise the start up time. Given I am using a docker image which has clearml-agent and pip installed, is there a way I can skip the installation of this when a task starts up using the daemon?
I believe this was an example report I made for a demo and I've since deleted the tasks which generated it 👍
Nope. But there are steps you can take to prevent this through publishing tasks and reports I believe.
👍 thanks for clearing that up @<1523701087100473344:profile|SuccessfulKoala55>
I have managed to connect. Our EC2 instances run in a private subnet so the ssh connection was not working for that reason I believe. Once I connected to my VPN it now worked.
The code is quite nested by I've tried to extract out the important parts ( summmary_writer is a tensorboard logger).
self.figure, (ax1, ax2, axc) = plt.subplots(1, 3, figsize=(total_width, total_height), facecolor="white")
self.summary_writer = self.tb_logger.experiment
self.summary_writer.add_figure(Partition.TRAINING.value, train_plot.figure, global_step=self.current_epoch + 1)
The train_plot.figure is a matplotlib figure created using seaborn.
Let me know if this...
Here it is:
No particular reason. This was our first time trying it and it seemed the quickest way to get off the ground. When I try without I have a similar error trying to connect although that could be due to the instance.
@<1523701070390366208:profile|CostlyOstrich36> Thank you. Which docker image do you use with this machine image?
It's not immediately obvious from the GCP documentation and you don't need to do this on AWS or Azure so it can catch you out. For what it's worth, the image I used originally was from the same family Marko has referenced above.
Yep that's correct. If I have a task which runs every 5 minutes, I don't want a new task every 5 minutes as that will create a lot of tasks over a day. It would be better if I had just one task.
Further to this, I have inspected further. This is working as expected for ClearML 1.8.3 but not for ClearML 1.9.0.
I looked at the commits and found that a change had been made to the _decode_image method:
This aligns with the error message I'm seeing:
2023-02-08 15:17:25,539 - clearml - WARNING - Error: I/O operation on closed file.
Can this be actioned for the next release plea...
I did not touch the interactive session code at all.
I installed clearml-session using pip and ran the above command with a task id from a task I'd already run.
I don't think there's really a way around this because AWS Lambda doesn't allow for multiprocessing.
Instead, I've resorted to using a clearml Scheduler which runs on a t3.micro instance for jobs which I want to run on a cron.
Thanks Jake. Do you know how I set the GPU count to 0?
I cannot ping api.clear.ml on Ubuntu. Works fine on Mac though.
I am having the same error since yesterday on Ubuntu. Works fine on Mac.
I cannot ping api.clear.ml
I’ve had some issues with clearml sessions. I’d be interested in seeing a PR. Would you mind posting a link please?
I run using the GCP Autoscaler successfully for GPU. Have you included this line in the init-script of the autoscaler? This was a gotcha for me...
/opt/deeplearning/install-driver.sh
This is something you can do in the GCP console, one would imagine it can be done using their python library.
I think the limitation is that you can only pass a relative subnet path in the GCP Autoscaler console. Then, by the looks of the error message, the ClearML Autoscaler constructs the full path under the hood /project/<project_id>/subnet/<subnet_id> .
I'd like the option to specify the full path myself in the Autoscaler which would then allow me to use a shared subnet.
I am using ClearML version 1.9.1. In code, I am creating a plot using matplotlib. I am able to see this in Tensorboard but it is not available in ClearML Plots
Hi,
I've managed to fix it.
Basically, I had a tracker running on our queues to ensure that none of them were lagging. This was using get_next_task from APIClient().queues .
If you call get_next_task it removes the task from the queue but does not put it into another state. I think because typically get_next_task is immediately followed by something to make the task run in the daemon or delete it.
Hence you end up in this weird state were the task thinks its queued bec...
@<1523701087100473344:profile|SuccessfulKoala55> Thanks for getting back to me. My image contains clearml-agent==1.9.1 . There is a recent release to 1.9.2 and now on every run the agent installs this newer version thanks to the -U flag which is being passed. From the docs it looks like there may be a way to prevent this upgrade but it's not clear to me exactly how to do this. Is it possible?
Ah, didn’t know that. Yes in that case that would work 👍
I have just encountered this. I believe it is because of the clearml-agent 1.7.0 release which added this as default: agent.enable_git_ask_pass: true
To fix, add in agent.enable_git_ask_pass: false to your config.
Solved for me as well now.