Reputation
Badges 1
49 × Eureka!Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL š
SuccessfulKoala55 just in case you have any more thoughts, but we could also continue as is š
Hi @<1523703436166565888:profile|DeterminedCrab71> and @<1523701070390366208:profile|CostlyOstrich36> , coming back to this after a while. It actually seems to be related to Google Cloud permissions:
- The images in the ClearML dashboard to not show as discussed above
- If I copy the image url (coming out as something like None and open it in another tab where Iām logged into my Google Account, the image loads
- If I do t...
So my own repo Iām launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
To recap, the server started up on GCP as expected before migrating the data over. The migration was done by
- deleting the current data
sudo rm -fR /opt/clearml/data/*
- unpacking the backup
sudo tar -xzf ~/clearml_backup_data.tgz -C /opt/clearml/data
- setting permissions
sudo chown -R 1000:1000 /opt/clearml
So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE
environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...
By the way, if we donāt wrap other calls in is_offline()
we get errors like āDateTime object is not serializableā, but thatās a secondary issue.
SuccessfulKoala55 AgitatedDove14 So Iāve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them inā¦
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myselfā¦)
Hi SuccessfulKoala55 , thanks for getting back to me!
In the docs of Task.set_base_docker()
it says āWhen running remotely the call is ignoredā. Does that mean that this function call is executed when running locally to ārecordā the arguments and then when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?
That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML š
When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...
Ok, I re-checked and saw that the data was indeed cached and re-loaded - maybe I waited a little too long last time and it was already a new instance. Awesome implementation guys!
More stack trace:
clearml-elastic | ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];
clearml-elastic | Likely root cause: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
clearml-elastic | at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
clearml-elastic | at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
clearml-el...
@<1523701070390366208:profile|CostlyOstrich36> thank you, now everything works so far!
Last thing: Is there any way to change all the links in the new ClearML server such that an artifact that was previous under s3://ā¦
is now taken from gs://ā¦
? The actual data is already available under the gs:// link of course
Hi Jake, yes Iād love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?
Yes totally, but weāve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe itās easier to get one big one than many small ones, but Iāve never actually checked if that is true š Thanks anyhow!
So the container itself gets deleted but everything is still cached because the cache directory is mounted to the host machine in the same place? Makes absolute sense and is what I was hoping for, but I canāt confirm this currently - I can see that the data is reloaded each time, even if the machine was not shut down in between. Iāll check again to find the cached data on the machine
Happy to and thanks!
Hi AgitatedDove14 , so it took some time but Iāve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...
Although, some correction here: While the secret is indeed hidden in the logs, it is still visible in the āexecutionā tab of the experiment, see two screenshots below.
One again I set them withtask.set_base_docker(docker_arguments=["..."])
Wonāt they be printed out as well in the Web UI? That shows the full Docker command for running the task rightā¦
Yes makes sense, it sounded like that from the start. Luckily, the task.flush(...)
way seems to work for now š
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients āmanuallyā. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached...
Hey guys, really appreciating the help here!
So what I meant by āit does workā is that the environment variables go through to the container, I can use them there, everything runs.
The remaining problem is that this way, they are visible in the ClearML web UI which is potentially unsafe / bad practice, see screenshot below.
Well duh, now it makes total sense! Should have checked docs or examples more closely š
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.