![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/ScantChimpanzee51.png)
Reputation
Badges 1
49 × Eureka!I meant maybe me activating offline mode, somehow changes something else in the runtime and that in turn leads to the interruption. Let me try to build a minimal reproducible version š
SuccessfulKoala55 AgitatedDove14 So Iāve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them inā¦
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myselfā¦)
That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML š
Iām on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow
Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
It was related, special characters also made prevented some access.
But it was and is also related to some authentication problem with Google: If you open the dashboard in Chrome, go to the developer console, you see a bunch of failed links to authenticate to. If you click one of them in another tab, it shows the Google signin screen and afterwards you can see the Debug Samples in Dashboard.
That all does not work in safari though for some reason š
Ah got it - that is already the case though. Iām logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucketā¦
Wonāt they be printed out as well in the Web UI? That shows the full Docker command for running the task rightā¦
Ok, I re-checked and saw that the data was indeed cached and re-loaded - maybe I waited a little too long last time and it was already a new instance. Awesome implementation guys!
So without the flush I got the error apparently at the very end of the script - all commands of my actual Python code had run.
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.
By the way, if we donāt wrap other calls in is_offline()
we get errors like āDateTime object is not serializableā, but thatās a secondary issue.
Happy to and thanks!
@<1523701070390366208:profile|CostlyOstrich36> thank you, now everything works so far!
Last thing: Is there any way to change all the links in the new ClearML server such that an artifact that was previous under s3://ā¦
is now taken from gs://ā¦
? The actual data is already available under the gs:// link of course
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
Sorry to ask again, but the values are still showing up in the WebUI console logs this way (see screenshot.)
Here is the config that I paste into the EC2 Autoscaler Setup:
` agent {
extra_docker_arguments: ["-e AWS_ACCESS_KEY_ID=XXXXXX", "-e AWS_SECRET_ACCESS_KEY=XXXXXX"]
hide_docker_command_env_vars {
enabled: true
extra_keys: ["AWS_SECRET_ACCESS_KEY"]
parse_embedded_urls: true
}
} `Never mind, it came from setting the options wrong, it has to be ...
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients āmanuallyā. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...
Hi @<1523703436166565888:profile|DeterminedCrab71> and @<1523701070390366208:profile|CostlyOstrich36> , coming back to this after a while. It actually seems to be related to Google Cloud permissions:
- The images in the ClearML dashboard to not show as discussed above
- If I copy the image url (coming out as something like None and open it in another tab where Iām logged into my Google Account, the image loads
- If I do t...
So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE
environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...
More stack trace:
clearml-elastic | ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];
clearml-elastic | Likely root cause: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
clearml-elastic | at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
clearml-elastic | at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
clearml-el...
Although, some correction here: While the secret is indeed hidden in the logs, it is still visible in the āexecutionā tab of the experiment, see two screenshots below.
One again I set them withtask.set_base_docker(docker_arguments=["..."])
Yes totally, but weāve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe itās easier to get one big one than many small ones, but Iāve never actually checked if that is true š Thanks anyhow!
Hi AgitatedDove14 , so it took some time but Iāve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...
Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when ārecordingā the local run on Mac and on Linux.
I actually wanted to load a specific artifact, but didnāt think of looking through the tasks output models. I have now changed to that approach which feels much safer, so we should be all done here. Thanks!
When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...
Hi SuccessfulKoala55 , thanks for getting back to me!
In the docs of Task.set_base_docker()
it says āWhen running remotely the call is ignoredā. Does that mean that this function call is executed when running locally to ārecordā the arguments and then when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?