ShinyPuppy47 does add_task_init_call
help your case? https://clear.ml/docs/latest/docs/references/sdk/task/#taskcreate
Don't call PipelineController
functions after start
has finished. Use a post_execute_callback
instead
` from clearml import PipelineController
def some_step():
return
def upload_model_to_controller(controller, node):
print("Start uploading the model")
if name == "main":
pipe = PipelineController(name="Yolo Pipeline Controller", project="yolo_pipelines", version="1.0.0")
pipe.add_function_step(
name="some_step",
function=some_st...
That is very odd. Is the script above all you're running?
Try examples/.pipelines/custom pipeline logic
instead of pipeline_project/.pipelines/custom pipeline logic
Hi @<1668427963986612224:profile|GracefulCoral77> ! The error is a bit misleading. What it actually means is that you shouldn't attempt to modify a finalized clearml dataset (I suppose that is what you are trying to achieve). Instead, you should create a new dataset that inherits from the finalized one and sync that dataset, or leave the dataset in an unfinalized state
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! You could get the Dataset Struct
configuration object and get the job_size
from there, which is the dataset size in bytes. The task IDs of the datasets are the same as the datasets' IDs by the way, so you can call all the clearml task related function on the task your get by doing Task.get_task("dataset_id")
Hi @<1578555761724755968:profile|GrievingKoala83> ! It looks like lightning uses the NODE_RANK
env var to get the rank of a node, instead of NODE
(which is used by pytorch).
We don't set NODE_RANK
yet, but you could set it yourself after launchi_multi_node
:
import os
current_conf = task.launch_multi_node(2)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
Hope this helps
Any chance you have some uncommited code changes that, when not included, this works fine?
Hi @<1631102016807768064:profile|ZanySealion18> ! Reporting None is not possible, but you could report np.nan instead.
Your object is likely holding some file descriptor or something like that. The pipeline steps are all running in separate processes (they can even run on different machines while running remotely). You need to make sure that the objects that you are returning are thus pickleable and can be passed between these processes. You can try to see that the logger you are passing around is indeed pickalable by calling pickle.dump(s)
on it an then loading it in another run.
The best practice would ...
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files
doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset
because I think that what you are encountering now is an NCCL error
Hi @<1714451218161471488:profile|ClumsyChimpanzee54> ! We will automatically add the cwd of the pipeline controller to the python path when running locally in a future version.
If running remotely, you can approach this in a few ways:
- add the whole project to a git repo and specify that repo in the pipeline steps
- have a prebuilt docker image that contains your project's code. you may then set the working directory to the path of your project
- if the agent running the docker is running ...
Hi @<1578555761724755968:profile|GrievingKoala83> ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do
task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0 # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD...
Hi JitteryCoyote63 ! Your clearml-agent is likely ran with python3.9. Can try setting this entry https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/docs/clearml.conf#L48 in your clearml.conf
to python3.8
, or the full path to python3.8
if that doesn't work
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I think you are right. We will try to look into this asap
Hi @<1726047624538099712:profile|WorriedSwan6> ! At the moment, only the function_kwargs
and queue
parameters accept such references. We will consider supporting them for other fields as well in the near future
@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi
inside of this container?
thank you! we will take a look and come back to you
Yes, see minio instructions under this: None
Hi @<1523702000586330112:profile|FierceHamster54> ! This is currently not possible, but I have a workaround in mind. You could use the artifact_serialization_function
parameter in your pipeline. The function should return a bytes stream of the zipped content of your data with whichever compression level you have in mind.
If I'm not mistaken, you wouldn't even need to write a deserialization function in your case, because we should be able to unzip your data just fine.
Wdyt?
check the output_uri
parameter in Task.init
Hi OutrageousSheep60 . The list_datasets
function is currently broken and will be fixed next release
Hi @<1570583237065969664:profile|AdorableCrocodile14> ! get_local_copy
will always copy/download external files to a folder. To get the external files, there is property on the dataset called link_entries
which returns a list of LinkEntry
objects, which contain a link
attribute, and each such link should point to a extrenal file (in this case, your local paths prefixed with file://
)
Hi OutrageousSheep60 ! Regarding your questions:
No it's not. We will have a RC that fixes that ASAP, hopefully by tomorrow You can use add_external_files
which you already do. If you wish to upload local files to the bucket, you can specify the output_url
of the dataset to point the bucket you wish to upload the data to. See the parameter here: https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload . Note that you CAN mix external_files and regular files. We don't hav...
Hi @<1523703652059975680:profile|ThickKitten19> ! Could you try increasing the max_iteration_per_job
and check if that helps? Also, any chance that you are fixing the number of epochs to 10, either through a hyper_parameter e.g. DiscreteParameterRange("General/epochs", values=[10]),
or it is simply fixed to 10 when you are calling something like model.fit(epochs=10)
?
Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1
which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus)
. For example:
import sys
import os
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from...
Hi DilapidatedDucks58 ! Browsers display double spaces as a single space by default. This is a common problem. What we could do is add a copy to clipboard
button (it would copy the text properly). What do you think?
what do you get when you run this code?
from clearml.backend_api import Session
print(Session.check_min_api_server_version("2.17"))
Hi @<1523701949617147904:profile|PricklyRaven28> ! Thank you for the example. We managed to reproduce. We will investigate further to figure out the issue