1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus) instead, as I see that the world size set by lightning corresponds to this value
do you have any STATUS REASON under the INFO section of the controller task?
One more question FierceHamster54 : what Python/OS/clearml version are you using?
You're correct. There are 2 main entries in the conf file: api and sdk . The dataset entry should be under sdk
SmallGiraffe94 You should use dataset_version=2022-09-07 (not version=... ). This should work for your use-case.Dataset.get shouldn't actually accept a version kwarg, but it does because it accepts some **kwargs used internally. We will make sure to warn the users in case they pass values to **kwargs from now on.
Anyway, this issue still exists, but in another form:Dataset.get can't get datasets with a non-semantic version, unless the version is sp...
@<1668427963986612224:profile|GracefulCoral77> You can both create a child or keep the same dataset as long as it is not finalized.
You can skip the finalization using the --skip-close argument. Anyhow, I can see why the current workflow is confusing. I will discuss it with the team, maybe we should allow syncing unfinalized datasets as well.
PanickyMoth78 Something is definitely wrong here. The fix doesn't seem to be trivial as well... we will prioritize this for the next version
@<1719524641879363584:profile|ThankfulClams64> you could try using the compare function in the UI to compare the experiments on the machine the scalars are not reported properly and the experiments on a machine that runs the experiments properly. I suggest then replicating the environment exactly on the problematic machine. None
would it be on the pipeline task itself then, since that's what's disappearing? that likely the case
Hi @<1670964662520254464:profile|LonelyFly70> ! FrameGroups are part of the enterprise sdk, thus they can only be imported from allegroai
@<1709740168430227456:profile|HomelyBluewhale47> you should be able to upload the images and download them without a problem. You could also use a cloud provider to store your files such as s3 if you believe it would speed things up
Hi @<1635088270469632000:profile|LividReindeer58> ! parent_task.artifacts[artifact_name].get() should just work to get artifacts from the parent task (the artifact should be automatically unpickled). Are you getting any errors when you do this?
Yeah, that's always the case with complex systems 😕
FreshParrot56 You could modify this entry in your clearml.conf to point to your drive: sdk.storage.cache.default_base_dir .
Or, if you don't want to touch your conf file, you could set the env var CLEARML_CACHE_DIR to your remote drive before you call get_local_copy. See this example:
` dataset = Dataset.get(DATASET_ID)
os.environ["CLEARML_CACHE_DIR"] = "/mnt/remote/drive" # change the clearml cache, make it point to your remote drive
copy_path = dataset.get_loc...
FierceHamster54initing the task before the execution of the file like in my snippet is not sufficient ?It is not because os.system spawns a whole different process then the one you initialized your task in, so no patching is done on the framework you are using. Child processes need to call Task.init because of this, unless they were forked, in which case the patching is already done.
` But the training.py has already a CLearML task created under the hood since its integratio...
Hi FlutteringWorm14 ! Looks like we indeed don't wait for report_period_sec when reporting data. We will fix this in a future release. Thank you!
DangerousDragonfly8 you can try to start the pipeline like this:pipe.start(step_task_completed_callback=callback)where callback has the signature:def callback(pipeline, node, parameters): print(pipeline, node, parameters)Note that even tho the parameter name is step_task_completed_callback , it is actually ran before the task is started. This is actually a bug...
We will need to review the callbacks, but I think you can work with this for now...
Hi SoreHorse95 ! I think that the way we interact with hydra doesn't account for overrides. We will need to look into this. In the meantime, do you also have somesort of stack trace or similar?
DangerousDragonfly8 I'm pretty sure you can use pre_execute_callback or post_execute_callback for this. you get the PipelineController in the callback and the Node . Then you can modify the next step/node. Note that you might need to access the Task object directly to change the execution_queue and docker_args . You can get it from node.job.task https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller#add_funct...
@<1545216070686609408:profile|EnthusiasticCow4> yes, that's true. I would aggregate the tasks by tags (the steps will be tagged with opt: ID ), None then get the metrics to get the losses None , and look into the tasks config to get the term you wanted to optimize [None](https://clear.ml/docs/latest/docs/references/sdk/task/#get_last...
can you send the full logs of rank0 and rank1 tasks?
Hi @<1714451218161471488:profile|ClumsyChimpanzee54> ! We will automatically add the cwd of the pipeline controller to the python path when running locally in a future version.
If running remotely, you can approach this in a few ways:
- add the whole project to a git repo and specify that repo in the pipeline steps
- have a prebuilt docker image that contains your project's code. you may then set the working directory to the path of your project
- if the agent running the docker is running ...
can you share the logs of the controller?
Hi @<1668427963986612224:profile|GracefulCoral77> ! The error is a bit misleading. What it actually means is that you shouldn't attempt to modify a finalized clearml dataset (I suppose that is what you are trying to achieve). Instead, you should create a new dataset that inherits from the finalized one and sync that dataset, or leave the dataset in an unfinalized state
Hi @<1523703652059975680:profile|ThickKitten19> ! Could you try increasing the max_iteration_per_job and check if that helps? Also, any chance that you are fixing the number of epochs to 10, either through a hyper_parameter e.g. DiscreteParameterRange("General/epochs", values=[10]), or it is simply fixed to 10 when you are calling something like model.fit(epochs=10) ?
this is likely an UI bug. We should have a fix soon. In the meantime, yes, you can edit the configuration under the pipeline task to achieve the same effect
Hi @<1724235687256920064:profile|LonelyFly9> ! ClearML does not allow for those to be configured, but you might consider setting AWS_RETRY_MODE and AWS_MAX_ATTEMPTS env vars. Docs from boto3: None
I think that will work, but I'm not sure actually. I know for sure that something like us-east-2 is supported
Hi @<1523705721235968000:profile|GrittyStarfish67> ! Please install the latest RC: pip install clearml==1.12.1rc0 to fix this. We will have an official release soon as well