OutrageousSheep60 1.8.4rc1 is out. Can you please try it? pip install -U clearml==1.8.4rc1
Hi SoreHorse95 ! I think that the way we interact with hydra doesn't account for overrides. We will need to look into this. In the meantime, do you also have somesort of stack trace or similar?
Hi @<1719524641879363584:profile|ThankfulClams64> ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage
you are using tf.Summary
which no longer exists since tensorflow 2.2.3
, which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your prog...
Hi EcstaticMouse10 ! Are you using the latest clearml sdk version? If not, can you please upgrade and tell us if you still have this issue?
Hi @<1724235687256920064:profile|LonelyFly9> ! I assume in this case we fail to retrieve the dataset? Can you provide an example when this happens?
Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON
? See the screenshot for an example:
Hi @<1578555761724755968:profile|GrievingKoala83> ! The only way I see this error appearing is:
- your process gets forked while
launch_multi_node
is called - there has been a network error when receiving the response to Task.enqueue, then the call has been retried, resulting in this errorCan you verify one or the other?
Hi @<1578555761724755968:profile|GrievingKoala83> ! Can you share the logs after setting NCCL_DEBUG=INFO
of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?
Hi @<1631102016807768064:profile|ZanySealion18> ! Reporting None is not possible, but you could report np.nan instead.
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I think you are right. We will try to look into this asap
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! You could get the Dataset Struct
configuration object and get the job_size
from there, which is the dataset size in bytes. The task IDs of the datasets are the same as the datasets' IDs by the way, so you can call all the clearml task related function on the task your get by doing Task.get_task("dataset_id")
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Having tqdm installed in your environment might help
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files
doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset
otherwise, you could run this as a hack:
dataset._dataset_file_entries = {
k: v
for k, v in self._dataset_file_entries.items()
if k not in files_to_remove # you need to define this
}
then call dataset.remove_files
with a path that doesn't exist in the dataset.
@<1590514584836378624:profile|AmiableSeaturtle81> note that we zip the files before uploading them as artifacts to the dataset task. Any chance you are specifying the default output uri as being a local path, such as /tmp
?
Hi @<1523701842515595264:profile|PleasantOwl46> ! This looks like a python problem. A useful SO thread: None
First, I would verify that I can access the api server without using the SDK. To do so, run this code after filling the credentials yourself (just login should be enough to verify that the api server is reachable)
api_server = ""
access_key = ""
secret_ke...
Yes it should word with ClearML if it works with requests
Hi @<1594863230964994048:profile|DangerousBee35> ! This GH issue might be relevant to you: None
the sudo update-ca-certificates
? maybe this will work
Hi @<1594863230964994048:profile|DangerousBee35> ! This looks like an ok solution, but I would make the package pip-installable and push it to another repo, then add that repo to a requirements file such that the agent can install it. Other than that, I can’t really think of another easy way to use your package
do you have any STATUS REASON
under the INFO
section of the controller task?
can you share the logs of the controller?
Hi @<1689446563463565312:profile|SmallTurkey79> !Prior runs of this pipeline worked just fine
What SDK version were you using for the prior runs? Does this still happen if you revert to that version?
Can you provide a script that imitates what you are doing?
In the pipeline you are running, are you creating new tasks/pipelines/datasets?
are you running this locally or are you enqueueing the task (controller)?
Hi DangerousDragonfly8 ! At the moment, this is not possible, but we do have it in plan (we had some prior requests for this feature)
@<1523701304709353472:profile|OddShrimp85> I believe you need to set the repo
argument to point to your repository
You could consider downgrading to something like 1.7.1 in the meantime, it should work with that version
hi OutrageousSheep60 ! We didn't release an RC yet, we will a bit later today tho. We will ping you when it's ready, sorry for the delay
Hi @<1546303277010784256:profile|LivelyBadger26> ! You can either manually change it in the UI, or use None to set it in your code. Our modified example:
import hydra
from omegaconf import OmegaConf
from clearml import Task
@hydra.main(config_path="config_files", config_name="config", version_base=None)
def my_app(cfg):
# type (DictConfig) -> None
task = Task.init(project_name="examples", task_name="Hy...