SmugDolphin23

0 Questions, 418 Answers

Active since 10 January 2023

Last activity 2 years ago

Reputation

Answers 418

0 Hello Everyone Again! So, I Have A Bit Of An Issue This Time Where Sometimes Clearml Won'T Be Able To Find A File On S3, Occasionally It Logs A 503 Error Too Where It Has Exceeded Its 4 Max Retries. So, Essentially, It'S A Server Problem In A Way. Howeve

Hi @<1724235687256920064:profile|LonelyFly9> ! I think that just adding some retries to exists_file is a good idea, so maybe we will do just that 👍

5 months ago

0 Hello, Im Having Huge Performance Issues On Large Clearml Datasets How Can I Link To Parent Dataset Without Parent Dataset Files. I Want To Create A Smaller Subset Of Parent Dataset, Like 5% Of It. To Achieve This, I Have To Call Remove_Files() To 60K It

Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset

7 months ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

because I think that what you are encountering now is an NCCL error

6 months ago

0 When I Run An Experiment (Self Hosted), I Only See Scalars For Gpu And System Performance. How Do I See Additional Scalars? I Have

Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py which Popen s sub.py .
Can you please run main.py as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!

2 years ago

0 Hi! To Make My Script Work Inside A Task, I Need To Add

Hi @<1714451218161471488:profile|ClumsyChimpanzee54> ! We will automatically add the cwd of the pipeline controller to the python path when running locally in a future version.
If running remotely, you can approach this in a few ways:

add the whole project to a git repo and specify that repo in the pipeline steps
have a prebuilt docker image that contains your project's code. you may then set the working directory to the path of your project
if the agent running the docker is running ...

5 months ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

Hi @<1578555761724755968:profile|GrievingKoala83> ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do

task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0  # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD...

6 months ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

does it work running this without clearml? @<1578555761724755968:profile|GrievingKoala83>

6 months ago

0 Hi, With Clearml-Agent 1.5.1, I Tried To Run An Experiment Within A Docker With Image Python3:8 And It Failed Executing The Task While Trying To Call Python3.9. I Am Not Sure Why It'S Using Python3.9, Since The Agent.Default_Python Is 3.8 And The Image Is

Hi JitteryCoyote63 ! Your clearml-agent is likely ran with python3.9. Can try setting this entry https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/docs/clearml.conf#L48 in your clearml.conf to python3.8 , or the full path to python3.8 if that doesn't work

2 years ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

@<1578555761724755968:profile|GrievingKoala83> did you call task.aunch_multi_node(4) or 2 ? I think the right value is 4 in this case

6 months ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus) instead, as I see that the world size set by lightning corresponds to this value

6 months ago

0 I Am Using Clearml Pro And Pretty Regularly I Will Restart An Experiment And Nothing Will Get Logged To Clearml. It Shows The Experiment Running (For Days) And It'S Running Fine On The Pc But No Scalers Or Debug Samples Are Shown. How Do We Troubleshoot T

Hi @<1719524641879363584:profile|ThankfulClams64> ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage you are using tf.Summary which no longer exists since tensorflow 2.2.3 , which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your prog...

5 months ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

@<1578555761724755968:profile|GrievingKoala83> what error are you getting when using gloo? Is it the same one?

6 months ago

0 Some Wierd Bug With Get_Local_Copy? I Have: Dataset C (100X A.Png, 100X B.Json, 100X C.Json) Dataset B (5X A.Png, 5X B.Json, 5X C.Json, 10X New.Json) Dataset A (5X A.Png) When Doing Get_Local_Copy() It Starts Downloading Dataset C Which Is 1.5Tb. This Sh

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I think you are right. We will try to look into this asap

4 months ago

0 Hey, Can Someone Help With Using Parameter In The Pipeline? According To The Documentation:

Hi @<1726047624538099712:profile|WorriedSwan6> ! At the moment, only the function_kwargs and queue parameters accept such references. We will consider supporting them for other fields as well in the near future

5 months ago

0 Hi. I'M Using Clearml Agent 1.16.1 My Code Is Running A Multi-Process Pool With "Spawn" (See

Hi @<1523701713440083968:profile|PanickyMoth78> ! Make sure you are calling Task.init in my_function (this is because the bindings made by clearml will be lost in a spawned process as opposed to a forked one). Also make sure that, in the spawned process, you have CLEARML_PROC_MASTER_ID env var set to the pid of the master process and CLEARML_TASK_ID to the ID task initialized in the master process (this should happen automatically)

6 months ago

0 Hi, I Have Noticed That Dataset Has Started Reporting My Dataset Head As A Txt File In "Debug Samples -> Metric: Tables". Can I Disable It? Thanks!

Hi HandsomeGiraffe70 ! You could try setting dataset.preview.tabular.table_count to 0 in your clearml.conf file

2 years ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi inside of this container?

6 months ago

0 Hi, After Upgrading To Clearml Sdk 1.6.0, I Am Getting Error When Trying To Work With Google Gcp, Debugging The Code I See This Line In Storagehelper.Check_Write_Permissions :

thank you! we will take a look and come back to you

2 years ago

0 Hi, I'M Trying To Upload Data From My S3 Bucket To Clearml Dataset Where I Can Start Versioning It All For My Ml Project. I Have Connected Successfully To My S3, Correctly Configured My Clearml.Conf File, But I Am Struggling With Some Task Initialization

@<1719162259181146112:profile|ShakySnake40> the data is still present in the parent and it won't be uploaded again. Also, when you pull a child dataset you are also pulling the dataset's parent data. dataset.id is a string that uniquely identifies each dataset in the system. In my example, you are using the ID to reference a dataset which would be a parent of the newly created dataset (that is, after getting the dataset via Dataset.get )

6 months ago

Yes, see minio instructions under this: None

7 months ago

0 Hey, Is There A Way To Set Pipeline Component Return Artifact Compression At A Pipeline Level ? It Would Allow To Make Big Dataframes Flow Across Component Without Having To Resort To Define Temporary Datasets, Currently It'S Generating Only Raw Pickles.

Hi @<1523702000586330112:profile|FierceHamster54> ! This is currently not possible, but I have a workaround in mind. You could use the artifact_serialization_function parameter in your pipeline. The function should return a bytes stream of the zipped content of your data with whichever compression level you have in mind.
If I'm not mistaken, you wouldn't even need to write a deserialization function in your case, because we should be able to unzip your data just fine.
Wdyt?

11 months ago

0 I Configured S3 Storage In My Clearml.Conf File On A Worker Machine. Then I Run Experiment Which Produced A Small Artifact And It Doesn'T Appear In My Cloud Storage. What Am I Doing Wrong? How To Make Artifacts Appear On My S3 Storage? Below Is A Sample O

check the output_uri parameter in Task.init

one year ago

0 Hi, Working With Clearml 1.6.4 What Is The Correct Way To List All The

Hi OutrageousSheep60 . The list_datasets function is currently broken and will be fixed next release

2 years ago

0 I Uploaded Direct Access File To Clearml Dataset System Like This One. How Can I Access The Link Of The Uploaded Item. Whenever I Try To Call

Hi @<1570583237065969664:profile|AdorableCrocodile14> ! get_local_copy will always copy/download external files to a folder. To get the external files, there is property on the dataset called link_entries which returns a list of LinkEntry objects, which contain a link attribute, and each such link should point to a extrenal file (in this case, your local paths prefixed with file:// )

one year ago

0 I’M Trying To Understand The Execution Flow Of Pipelines When Translating From Local To Remote Execution. I’Ve Defined A Pipeline Using The

If the task is running remotely and the parameters are populated, then the local run parameters will not be used, instead the parameters that are already on the task will be used. This is because we want to allow users to change these parameters in the UI if they want to - so the paramters that are in the code are ignored in the favor of the ones in the UI

9 months ago

0 Hi, I'M Running

Hi OutrageousSheep60 ! Regarding your questions:
No it's not. We will have a RC that fixes that ASAP, hopefully by tomorrow You can use add_external_files which you already do. If you wish to upload local files to the bucket, you can specify the output_url of the dataset to point the bucket you wish to upload the data to. See the parameter here: https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload . Note that you CAN mix external_files and regular files. We don't hav...

2 years ago

0 Hello, Community, I Hope This Message Finds You All Well. I Am Currently Working On A Project Involving Hyperparameter Optimization (Hpo) Using The Optuna Optimizer. Specifically, I'Ve Been Trying To Navigate The Parameters 'Min_Iteration_Per_Job' And 'M

Hi @<1523703652059975680:profile|ThickKitten19> ! Could you try increasing the max_iteration_per_job and check if that helps? Also, any chance that you are fixing the number of epochs to 10, either through a hyper_parameter e.g. DiscreteParameterRange("General/epochs", values=[10]), or it is simply fixed to 10 when you are calling something like model.fit(epochs=10) ?

8 months ago

0 Hi, We Have Recently Upgraded To

Regarding 1. , are you trying to delete the project from the UI? (I can't see an attached image in your message)

2 years ago

0 Hi, We Have Recently Upgraded To

Regarding number 2. , that is indeed a bug and we will try to fix it as soon as possible

2 years ago

0 Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1 which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus) . For example:

import sys
import os
from argparse import ArgumentParser

import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from...

6 months ago

Show more results