YummyFish22 can you point to the huggingface example you are using?
Let me check something
Damn, JitteryCoyote63 seems like a bug in the backend, it will not allow you to change the task type to the new types π
Any idea why the Pipeline Controller is Running despite the task passing?
What do you mean by "the task passing"
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
Hi SmugLizard24
The question is what is the reason of the issue?
That is a good question, could it be out of memory? (trying to compress or send the file in one chunk?)
@<1523701868901961728:profile|ReassuredTiger98> what are you getting with:
nvidia-smi
And here:
ls -la /usr/local/
Hi @<1631102016807768064:profile|ZanySealion18>
sorry missed that one
The cache doesn't work, it attempts to download the dataset every time.
just making sure the dataset itself contains all the files?
Once I used clearml-data add --folder * CLI everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
Not sure I follow here, is the problem the creation of the dataset of fetching it? is this a single version or multi...
Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?
Please do, just so it wont be forgotten (it won't but for the sake of transparency )
Damn, okay I'll make sure we fix the order.
Could you verify the ~= works as intended (if the order id correct)
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.
This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there
Hi @<1727497172041076736:profile|TightSheep99>
Yes it can, it will upload the meta-data as well as the files (it will also do de-dup and will not upload files that already exist in the dataset based on the hash of teh file content)
and they don't know how to write code, is this still possible?
well this means there is some standard of the data, right? what is that standard? unfortunately in our space there is no standard fort data, it's just too generic, so everyone always end with custom parsing of a sort.
Does that make sense ?
CleanWhale17 per your request :)
An automated ML Pipeline π Automated Data Source Integration π Data Pooling and Web Interface for Manual Annotation of Images(Seg. / Classif) [Allegro Enterprise] or users integrate with open-source Storage of Annotation output files(versioned JSON) π Online-Training Β Support(for Dataset Shifts) [Not Sure what you mean] Data Pre-processessing (filter/augment) [Allegro Enterprise] or users integrate with open-source Data-set visualization(stats...
I'm running hyper parameter optimzation on LSF cluster where every task is an LSF job running without clearml-agent
WOW this is so cool! π
HI @<1687643893996195840:profile|RoundCat60>
Are you running on AWS ?
So assuming they are all on the same LB IP: You should do:
LB 8080 (https) -> instance 8080
LB 8008 (https) -> instance 8008
LB 8081 (https) -> instance 8081
It might also work with:
LB 443 (https) -> instance 8080
confirmed that the change had been added by
Make sure you see them in the Task log in the UI (the agent print it when it starts)
Any insight on how we can reproduce the issue?
Can this be reproducible using a simple script that we can also run?
Notice: dataset_rgb.list_files()
will list the content of the dataset, Not the local files:
e.g.: /folder/myfile.ext
and not /hone/user/cache/folder/myfile.ext
So basically i think you are just not passing actual files, you should probably do:for local_file in Path(folder_rgb).rglob('*'): ...
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...