Hi @<1724235687256920064:profile|LonelyFly9> ! I think that just adding some retries to exists_file
is a good idea, so maybe we will do just that 👍
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files
doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset
because I think that what you are encountering now is an NCCL error
Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py
which Popen
s sub.py
.
Can you please run main.py
as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!
Hi @<1714451218161471488:profile|ClumsyChimpanzee54> ! We will automatically add the cwd of the pipeline controller to the python path when running locally in a future version.
If running remotely, you can approach this in a few ways:
- add the whole project to a git repo and specify that repo in the pipeline steps
- have a prebuilt docker image that contains your project's code. you may then set the working directory to the path of your project
- if the agent running the docker is running ...
Hi @<1578555761724755968:profile|GrievingKoala83> ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do
task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0 # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD...
does it work running this without clearml? @<1578555761724755968:profile|GrievingKoala83>
Hi JitteryCoyote63 ! Your clearml-agent is likely ran with python3.9. Can try setting this entry https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/docs/clearml.conf#L48 in your clearml.conf
to python3.8
, or the full path to python3.8
if that doesn't work
@<1578555761724755968:profile|GrievingKoala83> did you call task.aunch_multi_node(4)
or 2
? I think the right value is 4 in this case
1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus)
instead, as I see that the world size set by lightning corresponds to this value
Hi @<1719524641879363584:profile|ThankfulClams64> ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage
you are using tf.Summary
which no longer exists since tensorflow 2.2.3
, which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your prog...
@<1578555761724755968:profile|GrievingKoala83> what error are you getting when using gloo? Is it the same one?
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I think you are right. We will try to look into this asap
Hi @<1726047624538099712:profile|WorriedSwan6> ! At the moment, only the function_kwargs
and queue
parameters accept such references. We will consider supporting them for other fields as well in the near future
Hi @<1523701713440083968:profile|PanickyMoth78> ! Make sure you are calling Task.init
in my_function
(this is because the bindings made by clearml will be lost in a spawned process as opposed to a forked one). Also make sure that, in the spawned process, you have CLEARML_PROC_MASTER_ID
env var set to the pid of the master process and CLEARML_TASK_ID
to the ID task initialized in the master process (this should happen automatically)
Hi HandsomeGiraffe70 ! You could try setting dataset.preview.tabular.table_count
to 0 in your clearml.conf
file
@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi
inside of this container?
thank you! we will take a look and come back to you
@<1719162259181146112:profile|ShakySnake40> the data is still present in the parent and it won't be uploaded again. Also, when you pull a child dataset you are also pulling the dataset's parent data. dataset.id
is a string that uniquely identifies each dataset in the system. In my example, you are using the ID to reference a dataset which would be a parent of the newly created dataset (that is, after getting the dataset via Dataset.get
)
Yes, see minio instructions under this: None
Hi @<1523702000586330112:profile|FierceHamster54> ! This is currently not possible, but I have a workaround in mind. You could use the artifact_serialization_function
parameter in your pipeline. The function should return a bytes stream of the zipped content of your data with whichever compression level you have in mind.
If I'm not mistaken, you wouldn't even need to write a deserialization function in your case, because we should be able to unzip your data just fine.
Wdyt?
check the output_uri
parameter in Task.init
Hi OutrageousSheep60 . The list_datasets
function is currently broken and will be fixed next release
Hi @<1570583237065969664:profile|AdorableCrocodile14> ! get_local_copy
will always copy/download external files to a folder. To get the external files, there is property on the dataset called link_entries
which returns a list of LinkEntry
objects, which contain a link
attribute, and each such link should point to a extrenal file (in this case, your local paths prefixed with file://
)
If the task is running remotely and the parameters are populated, then the local run parameters will not be used, instead the parameters that are already on the task will be used. This is because we want to allow users to change these parameters in the UI if they want to - so the paramters that are in the code are ignored in the favor of the ones in the UI
Hi OutrageousSheep60 ! Regarding your questions:
No it's not. We will have a RC that fixes that ASAP, hopefully by tomorrow You can use add_external_files
which you already do. If you wish to upload local files to the bucket, you can specify the output_url
of the dataset to point the bucket you wish to upload the data to. See the parameter here: https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload . Note that you CAN mix external_files and regular files. We don't hav...
Hi @<1523703652059975680:profile|ThickKitten19> ! Could you try increasing the max_iteration_per_job
and check if that helps? Also, any chance that you are fixing the number of epochs to 10, either through a hyper_parameter e.g. DiscreteParameterRange("General/epochs", values=[10]),
or it is simply fixed to 10 when you are calling something like model.fit(epochs=10)
?
Regarding 1.
, are you trying to delete the project from the UI? (I can't see an attached image in your message)
Regarding number 2.
, that is indeed a bug and we will try to fix it as soon as possible
Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1
which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus)
. For example:
import sys
import os
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from...