Reputation
Badges 1
49 × Eureka!So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE
environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...
Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun
or torch.distributed.run
then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via torch.multiprocessing
like below, everything works as expected. The “script path” just shows the single python script, all logs an...
Hi Jake, yes I’d love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?
It was related, special characters also made prevented some access.
But it was and is also related to some authentication problem with Google: If you open the dashboard in Chrome, go to the developer console, you see a bunch of failed links to authenticate to. If you click one of them in another tab, it shows the Google signin screen and afterwards you can see the Debug Samples in Dashboard.
That all does not work in safari though for some reason 🙂
@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. I’ve re-started the docker-compose of course.
Did I miss something here?
google.storage {
credentials_json: "/home/.../my-crendetials.json"
}
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
If that helps: The URL I get when I copy it out of the ClearML dashboard is the same one as is listed under “Authenticated URL” when looking up the image in Google Cloud storage. And of course that opens the image if I go to it in another tab
Happy to and thanks!
Hi @<1523703436166565888:profile|DeterminedCrab71> and @<1523701070390366208:profile|CostlyOstrich36> , coming back to this after a while. It actually seems to be related to Google Cloud permissions:
- The images in the ClearML dashboard to not show as discussed above
- If I copy the image url (coming out as something like None and open it in another tab where I’m logged into my Google Account, the image loads
- If I do t...
Ah got it - that is already the case though. I’m logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucket…
Yes and yes - is that the issue and it might likely go away if we host it via HTTPS?
Yes makes sense, it sounded like that from the start. Luckily, the task.flush(...)
way seems to work for now 🙂
So the container itself gets deleted but everything is still cached because the cache directory is mounted to the host machine in the same place? Makes absolute sense and is what I was hoping for, but I can’t confirm this currently - I can see that the data is reloaded each time, even if the machine was not shut down in between. I’ll check again to find the cached data on the machine
I’m on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow
Well duh, now it makes total sense! Should have checked docs or examples more closely 🙏
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.
Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL 🙈