Reputation
Badges 1
49 × Eureka!I actually wanted to load a specific artifact, but didnât think of looking through the tasks output models. I have now changed to that approach which feels much safer, so we should be all done here. Thanks!
Unfortunately not, task.data.output
just contains <tasks.Output: { "destination": "
s3://some_bucket " }>
and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.
I meant maybe me activating offline mode, somehow changes something else in the runtime and that in turn leads to the interruption. Let me try to build a minimal reproducible version đ
Hi AgitatedDove14 , so it took some time but Iâve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...
Happy to and thanks!
So the container itself gets deleted but everything is still cached because the cache directory is mounted to the host machine in the same place? Makes absolute sense and is what I was hoping for, but I canât confirm this currently - I can see that the data is reloaded each time, even if the machine was not shut down in between. Iâll check again to find the cached data on the machine
Ok, I re-checked and saw that the data was indeed cached and re-loaded - maybe I waited a little too long last time and it was already a new instance. Awesome implementation guys!
Yes for example, or some other way to get credentials over to the container safely without them showing up in the checked-in code or web UI
SuccessfulKoala55 AgitatedDove14 So Iâve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them inâŚ
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myselfâŚ)
Wonât they be printed out as well in the Web UI? That shows the full Docker command for running the task rightâŚ
Hey guys, really appreciating the help here!
So what I meant by âit does workâ is that the environment variables go through to the container, I can use them there, everything runs.
The remaining problem is that this way, they are visible in the ClearML web UI which is potentially unsafe / bad practice, see screenshot below.
Although, some correction here: While the secret is indeed hidden in the logs, it is still visible in the âexecutionâ tab of the experiment, see two screenshots below.
One again I set them withtask.set_base_docker(docker_arguments=["..."])
That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML đ
Hi SuccessfulKoala55 , thanks for getting back to me!
In the docs of Task.set_base_docker()
it says âWhen running remotely the call is ignoredâ. Does that mean that this function call is executed when running locally to ârecordâ the arguments and then when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.
Well duh, now it makes total sense! Should have checked docs or examples more closely đ
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients âmanuallyâ. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...
Hi @<1523701205467926528:profile|AgitatedDove14> , so Iâve managed to reproduce a bit more.
When I run very basic code via torchrun
or torch.distributed.run
then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via torch.multiprocessing
like below, everything works as expected. The âscript pathâ just shows the single python script, all logs an...
Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL đ
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
If that helps: The URL I get when I copy it out of the ClearML dashboard is the same one as is listed under âAuthenticated URLâ when looking up the image in Google Cloud storage. And of course that opens the image if I go to it in another tab
It was related, special characters also made prevented some access.
But it was and is also related to some authentication problem with Google: If you open the dashboard in Chrome, go to the developer console, you see a bunch of failed links to authenticate to. If you click one of them in another tab, it shows the Google signin screen and afterwards you can see the Debug Samples in Dashboard.
That all does not work in safari though for some reason đ
So my own repo Iâm launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when ârecordingâ the local run on Mac and on Linux.
Yes totally, but weâve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe itâs easier to get one big one than many small ones, but Iâve never actually checked if that is true đ Thanks anyhow!
More stack trace:
clearml-elastic | ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];
clearml-elastic | Likely root cause: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
clearml-elastic | at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
clearml-elastic | at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
clearml-el...