Reputation
Badges 1
29 × Eureka!I am using ClearML version 1.9.1. In code, I am creating a plot using matplotlib. I am able to see this in Tensorboard but it is not available in ClearML Plots
Further to this, I have inspected further. This is working as expected for ClearML 1.8.3 but not for ClearML 1.9.0.
I looked at the commits and found that a change had been made to the _decode_image
method:
This aligns with the error message I'm seeing:
2023-02-08 15:17:25,539 - clearml - WARNING - Error: I/O operation on closed file.
Can this be actioned for the next release plea...
The code is quite nested by I've tried to extract out the important parts ( summmary_writer
is a tensorboard logger).
self.figure, (ax1, ax2, axc) = plt.subplots(1, 3, figsize=(total_width, total_height), facecolor="white")
self.summary_writer = self.tb_logger.experiment
self.summary_writer.add_figure(Partition.TRAINING.value, train_plot.figure, global_step=self.current_epoch + 1)
The train_plot.figure
is a matplotlib figure created using seaborn.
Let me know if this...
I don't think there's really a way around this because AWS Lambda doesn't allow for multiprocessing.
Instead, I've resorted to using a clearml Scheduler which runs on a t3.micro instance for jobs which I want to run on a cron.
This is something you can do in the GCP console, one would imagine it can be done using their python library.
I think the limitation is that you can only pass a relative subnet path in the GCP Autoscaler console. Then, by the looks of the error message, the ClearML Autoscaler constructs the full path under the hood /project/<project_id>/subnet/<subnet_id>
.
I'd like the option to specify the full path myself in the Autoscaler which would then allow me to use a shared subnet.
👍 Thanks for getting back to me.
Another issue I found was that I could only use vpc subnets from the google project I am launching the VMs in.
I cannot use shared vpc subnets from another project. This would be a useful feature to implement as GCP recommends segmenting the cloud estate so that the vpc and VMs are in different projects.
@<1523701087100473344:profile|SuccessfulKoala55> Just following up as I figured out what was happening here and could be useful for the future.
The prefilled value for Number of GPUs
in the GCP Autoscaler is 1
.
When one ticks Run in CPU mode (no gpus)
it hides the GPU Type
and Number of GPUs
fields. However, the value which was these fields are still submitted in the API Request (I'm guessing here) when the Autoscaler is launched.
Hence, to get past this, you need to...
Thanks Jake. Do you know how I set the GPU count to 0?
Here it is:
@<1537605940121964544:profile|EnthusiasticShrimp49> How do I specify to not attach a gpu? I thought ticking 'Run in CPU Mode' would be sufficient. Is there something else I'm missing?
Is there a way I can do this with the python APIClient or even with the requests library?
Hi,
I've managed to fix it.
Basically, I had a tracker running on our queues to ensure that none of them were lagging. This was using get_next_task
from APIClient().queues
.
If you call get_next_task
it removes the task from the queue but does not put it into another state. I think because typically get_next_task
is immediately followed by something to make the task run in the daemon or delete it.
Hence you end up in this weird state were the task thinks its queued bec...
👍 thanks for clearing that up @<1523701087100473344:profile|SuccessfulKoala55>
Furthermore, when using APIClient()
, users
is not a valid endpoint at all.
class APIClient(object):
auth = None # type: Any
queues = None # type: Any
tasks = None # type: Any
workers = None # type: Any
events = None # type: Any
models = None # type: Any
projects = None # type: Any
This is taken from clearml/backend_api/session/client/client.py
@<1523701070390366208:profile|CostlyOstrich36> Thank you. Which docker image do you use with this machine image?
I believe this was an example report I made for a demo and I've since deleted the tasks which generated it 👍
Nope. But there are steps you can take to prevent this through publishing tasks and reports I believe.
$ curl -H "Authorization: Bearer <TOKEN>" -X GET
{"meta":{"id":"ed6c52d030f240a89f001b447ee64a6b","trx":"ed6c52d030f240a89f001b447ee64a6b","endpoint":{"name":"debug.ping","requested_version":"2.26","actual_version":"1.0"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":null,"error_data":{},"alarms":{}},"data":{"msg":"Hello World"}}%
$ curl -H "Authoriz...
Is there documentation for this as I was not able to figure this out unfortunately.
If a Task is in the 'Completed' I think the only option is to 'Reset' it (see image). You do clear the previous run execution but I think for a repetitive task this is fine.
Maybe this should only be the case if it is in a 'Completed' state rather than 'Failed'. I can see that in this case you would not want to clear the execution because you would want to see why it Failed. Thoughts?
Yep that's correct. If I have a task which runs every 5 minutes, I don't want a new task every 5 minutes as that will create a lot of tasks over a day. It would be better if I had just one task.
This is not working. Please see None which details the problem
Solved for me as well now.
I cannot ping api.clear.ml on Ubuntu. Works fine on Mac though.