Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
Is there a quicker way to abort all running experiments in a project? I have over a thousand running anonymous data tasks in a specific project and I want to abort them before debugging them.
We are adding "select" all in the next UI version to do that as quickly as possible 🙂
it overwrites the previous run?
It will overwrite the previous if
Under 72h from last execution no artifact/model was createdYou can control it with "reuse_last_task_id=False" passed to Task.init
Task name itself is Not unique in the system, think of it as short description
Make sense ?
Should I use
update_weights_package
Yes
BTW, config.pbtxt you should pass when "registering" the endpoint with the CLI
Can you share the modified help/yaml ?
Did you run any specific migration script after the upgrade ?
How many apiserver instances do you have ?
How did you configure the elastic container? is it booting?
Then you have to pass the .ssh into the remote server, probably the easiest is to have it in the "extra bash script"
TrickyFox41 are you saying that if you add Task.init inthe code it works, but when you are calling "clearml-task" it does not work? (in both cases editing the Args/overrides ?
I can but that is not a configuration we would want to run with in production
Agreed, I just want to isolate the issue. I think this is the bottom python interface missing some configuration or environment variables
The address is valid. If i just go to the files server address on my browser,
@<1729309131241689088:profile|MistyFly99> what is the exact address of those files? (including the http prefix) and what is the address of the web application ?
Runtime, every time the add_step needs to create a New Task to be enqueued
Closing the data doesnt work: dataset.close() AttributeError: 'Dataset' object has no attribute 'close'
Hi @<1523714677488488448:profile|NastyOtter17> could you send he full exception ?
Are you suggesting just taking the
read_and_process_file
function out of the
read_dataset
method,
Yes 🙂
As for the second option, you mean create the task in the
init
method of the NetCDFReader class?
correct
It would be a great idea to make the Task picklelizable,
Adding that to the next version to do list 😉
Yes exactly like a Task (pipeline is a type of task)
'''
clonedpipeline=Task.clone(pipeline_uid_here)
Task.enqueue(...)
'''
Yeah you can ignore those, this is some python GC stuff, seems to be related with the OS and python version
For running the pipeline remotely I want the path to be like /Users/adityachaudhry/.clearml/cache/......
I'm not sure I follow, if you are getting a path with all your folders from get_local_copy , that's exactly what you are looking for, no?
Seems like it is working (including seaborn)
PlainSquid19 yes the link is available on in the actual paid product 😞
I don't think they have the documentation open yet...
My recommendation is to fill the contact us form, you'll get a free online tour as well 😉
Thanks @<1523702652678967296:profile|DeliciousKoala34> I think I know what the issue is!
The container has 1.3.0a and you need 1.3.0 this is why it is re-downloading (I'll make sure the agent can sort it out, becuase this is Nvidia's version in reality it should be a perfect match)
Hi TrickySheep9
Could you post the pipeline code here?
Also which clearml version are you using ?
Hi @<1691258563357315072:profile|ColorfulKitten60>
I think we need some context for this question 🙂
What's your clearml version (python and server) ?
It seems that once the job as completed once, it doesn't accept any new report...
completed can be forced, published cannot ...
What's the error you are getting ?
😂
I'm trying to create a task that is not in repository root folder.
JuicyFox94 If the Task is not in a repo folder, you mean in a remote repository right ?
This means the repo should be in the form of " https://github.com/ " or "ssh://"
It failed in deducing this is a remote repository (maybe we can improve the auto detection?!)
Hi GrievingTurkey78
Can you test with the latest clearml-agent RC (I remember a fix just for that)pip install clearml-agent==1.2.0rc0
what do you have here in your docker compose :
None
ModelCheckpoint('best_model', save_best_only=True)
That worked for me now, what's the diff
I have to admit, I haven't had the time 😞
Trying to get pip to be twice as fast 🤞
https://github.com/pypa/pip/pull/8215
Please keep pinging me, I would really like to follow on it.
GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)
Actually doesn't matter (systemd and init.d are diff ways to spin services on diff linux distros) you can pick whatever seems more continent for you, and whichever is supported by the linux you are running (in most cases both are) 🙂