Reputation
Badges 1
49 × Eureka!It was related, special characters also made prevented some access.
But it was and is also related to some authentication problem with Google: If you open the dashboard in Chrome, go to the developer console, you see a bunch of failed links to authenticate to. If you click one of them in another tab, it shows the Google signin screen and afterwards you can see the Debug Samples in Dashboard.
That all does not work in safari though for some reason š
Ah got it - that is already the case though. Iām logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucketā¦
Yes and yes - is that the issue and it might likely go away if we host it via HTTPS?
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
If that helps: The URL I get when I copy it out of the ClearML dashboard is the same one as is listed under āAuthenticated URLā when looking up the image in Google Cloud storage. And of course that opens the image if I go to it in another tab
Yes makes sense, it sounded like that from the start. Luckily, the task.flush(...)
way seems to work for now š
So my own repo Iām launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...
Yes, when the WebUI prompted me for them. They also seem to work since images in Debug Samples (also in S3) show up after I entered them.
Also, I can see that the plot is also saved in Debug Samples after explicit reporting, even though I donāt set report_interactive=False
SuccessfulKoala55 just in case you have any more thoughts, but we could also continue as is š
Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL š
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...
Unfortunately not, task.data.output
just contains <tasks.Output: { "destination": "
s3://some_bucket " }>
and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.
Hi Jake, yes Iād love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?
By the way, if we donāt wrap other calls in is_offline()
we get errors like āDateTime object is not serializableā, but thatās a secondary issue.
More stack trace:
clearml-elastic | ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];
clearml-elastic | Likely root cause: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
clearml-elastic | at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
clearml-elastic | at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
clearml-el...
Hi @<1523701205467926528:profile|AgitatedDove14> , so Iāve managed to reproduce a bit more.
When I run very basic code via torchrun
or torch.distributed.run
then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via torch.multiprocessing
like below, everything works as expected. The āscript pathā just shows the single python script, all logs an...
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients āmanuallyā. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
Iām on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow
To recap, the server started up on GCP as expected before migrating the data over. The migration was done by
- deleting the current data
sudo rm -fR /opt/clearml/data/*
- unpacking the backup
sudo tar -xzf ~/clearml_backup_data.tgz -C /opt/clearml/data
- setting permissions
sudo chown -R 1000:1000 /opt/clearml
@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. Iāve re-started the docker-compose of course.
Did I miss something here?
google.storage {
credentials_json: "/home/.../my-crendetials.json"
}
SuccessfulKoala55 AgitatedDove14 So Iāve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them inā¦
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myselfā¦)
Yes totally, but weāve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe itās easier to get one big one than many small ones, but Iāve never actually checked if that is true š Thanks anyhow!
Well duh, now it makes total sense! Should have checked docs or examples more closely š
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when ārecordingā the local run on Mac and on Linux.
So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE
environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...