Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I think you are right. We will try to look into this asap
Hi @<1689446563463565312:profile|SmallTurkey79> !Prior runs of this pipeline worked just fine
What SDK version were you using for the prior runs? Does this still happen if you revert to that version?
Can you provide a script that imitates what you are doing?
In the pipeline you are running, are you creating new tasks/pipelines/datasets?
would it be on the pipeline task itself then, since that's what's disappearing?
that likely the case
are you running this locally or are you enqueueing the task (controller)?
can you share the logs of the controller?
Hi @<1702492411105644544:profile|YummyGrasshopper29> ! Parameters can belong to different sections. You should append it before some_parameter
. You likely want ${step2.parameters.kwargs/some_parameter}
Hi @<1724235687256920064:profile|LonelyFly9> ! I think that just adding some retries to exists_file
is a good idea, so maybe we will do just that 👍
do you have any STATUS REASON
under the INFO
section of the controller task?
because I think that what you are encountering now is an NCCL error
can you send the full logs of rank0 and rank1 tasks?
Hi @<1523702652678967296:profile|DeliciousKoala34> ! Looks like this is a bug in set_metadata
. The model ID is not set, and set_metadata
doesn't set it automatically. I would first upload the model file, then set the meta-data to avoid this bug. You can call update_weights
to do that. None
does it work running this without clearml? @<1578555761724755968:profile|GrievingKoala83>
I will ask internally about this
@<1578555761724755968:profile|GrievingKoala83> did you call task.aunch_multi_node(4)
or 2
? I think the right value is 4 in this case
Hi @<1689446563463565312:profile|SmallTurkey79> ! I will take a look at this and try to replicate the issue. In the meantime, I suggest you look into other dependencies you are using. Maybe some dependency got upgraded and the upgrade now triggers this behaviour in clearml.
Anyhow, there is a serialization_function argument you could use in upload_artifact. I could imagine that we don’t properly serialize your artifacts. You could use the argument to pass a callback that would eficiently serialize the artifact. Notice that getting the artifact back requires a deserialization function
Hi @<1724235687256920064:profile|LonelyFly9> ! ClearML does not allow for those to be configured, but you might consider setting AWS_RETRY_MODE and AWS_MAX_ATTEMPTS env vars. Docs from boto3: None
Hi @<1603198134261911552:profile|ColossalReindeer77> ! The usual workflow is that you modify the fields in your remoter run in either the Hyperparameters section or the configuration section, but not usually both (as in Hydra's case). When using CLI tools, people mostly modify the Hyperparameters section so we chose to set the allow_omegaconf_edit to False by default for parity.
Hi @<1639799308809146368:profile|TritePigeon86> ! Please see continue_behaviour
. You should be able to pass the parameter to your parent step. It is not documented yet, but it should be available in the latest version of clearml. See this for some documentation: None
Hi @<1674226153906245632:profile|PreciousCoral74> !
Sadly, Logger.report_matplotlib_figure(…) doesn't seem to log plots. Only the automatic integration appears to behave.
What do you mean by that? report_matplotlib_figure
should work. See this example on how to use it: None .
If it still doesn't work for you, could you please share a code snippet that could help us track down...
@<1578555761724755968:profile|GrievingKoala83> what error are you getting when using gloo? Is it the same one?
@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi
inside of this container?
@<1578555761724755968:profile|GrievingKoala83> does it work properly when gpus=1? Also, what are the values found under Initializing distributed: GLOBAL_RANK: , MEMBER:
in the 2 scenarios, for each task?
1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus)
instead, as I see that the world size set by lightning corresponds to this value
Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1
which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus)
. For example:
import sys
import os
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from...
do you have the agent logs that is supposed to run your pipeline? Maybe there is a clue there. I would also suggest to try enqueuing the pipeline to some other queue, maybe even run the agent on your on machine if you do not already and see what happens