Reputation
Badges 1
49 × Eureka!@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. Iāve re-started the docker-compose of course.
Did I miss something here?
google.storage {
credentials_json: "/home/.../my-crendetials.json"
}
Well duh, now it makes total sense! Should have checked docs or examples more closely š
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.
Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL š
So my own repo Iām launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...
When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
Sorry to ask again, but the values are still showing up in the WebUI console logs this way (see screenshot.)
Here is the config that I paste into the EC2 Autoscaler Setup:
` agent {
extra_docker_arguments: ["-e AWS_ACCESS_KEY_ID=XXXXXX", "-e AWS_SECRET_ACCESS_KEY=XXXXXX"]
hide_docker_command_env_vars {
enabled: true
extra_keys: ["AWS_SECRET_ACCESS_KEY"]
parse_embedded_urls: true
}
} `Never mind, it came from setting the options wrong, it has to be ...
Unfortunately not, task.data.output
just contains <tasks.Output: { "destination": "
s3://some_bucket " }>
and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.
Hi AgitatedDove14 , so it took some time but Iāve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...
Wonāt they be printed out as well in the Web UI? That shows the full Docker command for running the task rightā¦
That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML š
Yes, when the WebUI prompted me for them. They also seem to work since images in Debug Samples (also in S3) show up after I entered them.
Also, I can see that the plot is also saved in Debug Samples after explicit reporting, even though I donāt set report_interactive=False
Hi SuccessfulKoala55 , thanks for getting back to me!
In the docs of Task.set_base_docker()
it says āWhen running remotely the call is ignoredā. Does that mean that this function call is executed when running locally to ārecordā the arguments and then when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?
SuccessfulKoala55 AgitatedDove14 So Iāve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them inā¦
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myselfā¦)
So without the flush I got the error apparently at the very end of the script - all commands of my actual Python code had run.
Ok so actually if I run task.flush(wait_for_uploads=True)
at the end of the script it works ā
If that helps: The URL I get when I copy it out of the ClearML dashboard is the same one as is listed under āAuthenticated URLā when looking up the image in Google Cloud storage. And of course that opens the image if I go to it in another tab
Yes and yes - is that the issue and it might likely go away if we host it via HTTPS?
Ah got it - that is already the case though. Iām logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucketā¦
Iām on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
Hi @<1523703436166565888:profile|DeterminedCrab71> and @<1523701070390366208:profile|CostlyOstrich36> , coming back to this after a while. It actually seems to be related to Google Cloud permissions:
- The images in the ClearML dashboard to not show as discussed above
- If I copy the image url (coming out as something like None and open it in another tab where Iām logged into my Google Account, the image loads
- If I do t...
Yes makes sense, it sounded like that from the start. Luckily, the task.flush(...)
way seems to work for now š
More stack trace:
clearml-elastic | ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];
clearml-elastic | Likely root cause: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
clearml-elastic | at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
clearml-elastic | at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
clearml-el...
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.
Hi Jake, yes Iād love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?