
Reputation
Badges 1
53 × Eureka!So without the flush I got the error apparently at the very end of the script - all commands of my actual Python code had run.
@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. I’ve re-started the docker-compose of course.
Did I miss something here?
google.storage {
credentials_json: "/home/.../my-crendetials.json"
}
Ok, I re-checked and saw that the data was indeed cached and re-loaded - maybe I waited a little too long last time and it was already a new instance. Awesome implementation guys!
Well duh, now it makes total sense! Should have checked docs or examples more closely 🙏
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.
I meant maybe me activating offline mode, somehow changes something else in the runtime and that in turn leads to the interruption. Let me try to build a minimal reproducible version 🙂
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...
Hi @<1523703436166565888:profile|DeterminedCrab71> and @<1523701070390366208:profile|CostlyOstrich36> , coming back to this after a while. It actually seems to be related to Google Cloud permissions:
- The images in the ClearML dashboard to not show as discussed above
- If I copy the image url (coming out as something like None and open it in another tab where I’m logged into my Google Account, the image loads
- If I do t...
Hi Jake, yes I’d love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?
Unfortunately not, task.data.output
just contains <tasks.Output: { "destination": "
s3://some_bucket " }>
and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
Hey guys, really appreciating the help here!
So what I meant by “it does work” is that the environment variables go through to the container, I can use them there, everything runs.
The remaining problem is that this way, they are visible in the ClearML web UI which is potentially unsafe / bad practice, see screenshot below.
Happy to and thanks!
So the container itself gets deleted but everything is still cached because the cache directory is mounted to the host machine in the same place? Makes absolute sense and is what I was hoping for, but I can’t confirm this currently - I can see that the data is reloaded each time, even if the machine was not shut down in between. I’ll check again to find the cached data on the machine
Ah got it - that is already the case though. I’m logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucket…
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.
Although, some correction here: While the secret is indeed hidden in the logs, it is still visible in the “execution” tab of the experiment, see two screenshots below.
One again I set them withtask.set_base_docker(docker_arguments=["..."])
To recap, the server started up on GCP as expected before migrating the data over. The migration was done by
- deleting the current data
sudo rm -fR /opt/clearml/data/*
- unpacking the backup
sudo tar -xzf ~/clearml_backup_data.tgz -C /opt/clearml/data
- setting permissions
sudo chown -R 1000:1000 /opt/clearml
Hi AgitatedDove14 , so it took some time but I’ve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...
Sorry to ask again, but the values are still showing up in the WebUI console logs this way (see screenshot.)
Here is the config that I paste into the EC2 Autoscaler Setup:
` agent {
extra_docker_arguments: ["-e AWS_ACCESS_KEY_ID=XXXXXX", "-e AWS_SECRET_ACCESS_KEY=XXXXXX"]
hide_docker_command_env_vars {
enabled: true
extra_keys: ["AWS_SECRET_ACCESS_KEY"]
parse_embedded_urls: true
}
} `Never mind, it came from setting the options wrong, it has to be ...
So my own repo I’m launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
Won’t they be printed out as well in the Web UI? That shows the full Docker command for running the task right…
Ok so actually if I run task.flush(wait_for_uploads=True)
at the end of the script it works ✔
Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when “recording” the local run on Mac and on Linux.
I’m on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow
That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML 😉
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...