Are you using the community server or did you deploy yourself?
I meant writing a new pipeline controller that will incorporate the previous pipelines as steps. What is the error that you're getting? Can you provide a snippet?
Hi JumpyPig73 , can you provide a snippet from the console log? Also, what OS are you running on?
Can you try with blank worker_id/work_name in your clearml.conf
(basically how it was before)?
You can force kill the agent using kill -9 <process_id>
but clearml-agent daemon stop should work.
Also, can you verify that one of the daemons is the clearml-services daemon? This one should be running from inside a docker on your server machine (I'm guessing you're self hosting - correct?).
Hi @<1554638166823014400:profile|ExuberantBat24> , you mean dynamic GPU allocation on the same machine?
What about tasks.get_all and you specify the ID of the task you want as well:
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
@<1541954607595393024:profile|BattyCrocodile47> , shouldn't be an issue - ClearML SDK is resilient to connectivity issues so if the server goes down the SDK will continue running and will just store all the data locally, once server is back up, it will send everything that was waiting.
Makes sense?
There is literallyย
Models
ย tab in each project
Was about to mention it ๐
look like I created a new task for every epoch ...
What do you mean?
I would also suggest using pipelines if you want to do several actions with a task controlling the progress.
ReassuredTiger98 , nothing CLI based but you can do it programmatically via the API quite easily.
Also, what happens if you do clearml-data delete --id <TASK_ID>
? It's a bet but it could actually work as well ๐
You write code for a new pipeline ๐
Dataset.get
only fetches the dataset object, it doesn't try accessing files yet. What else are you doing in your code that reproduces your issue?
Discussion moved to internal channels
And when you run it again under exactly the same circumstances it works fine?
Only news ones after you use SDK 1.6.0 ๐
There is an optimizer in ClearML already.
Here is an example:
https://github.com/allegroai/clearml/tree/master/examples/optimization/hyper-parameter-optimization
and some docs ๐
https://clear.ml/docs/latest/docs/references/sdk/hpo_optimization_hyperparameteroptimizer
Also, can you verify that you still have the clearml-agent process running? top
/ htop
Hi @<1523701181375844352:profile|ExasperatedCrocodile76> , and now the worker clones the repo correctly?
This is because Datasets have a new view now. Just under 'Projects' on the left bar you have a button for Datasets ๐
Hi @<1523702496097210368:profile|ScantChimpanzee51> , your steps look ok but the error pretty much indicates that there is a folder permissions issue. Please navigate manually to /opt/clearml/data folder and check "ls -al" command what are the user and permissions for the "elastic_7" folder and then enter the elastic_7 folder and check the same for its "nodes" subfolder. If the permissions are correct try restarting the docker and checking if it helps.
Huh, what an interesting issue! I think you should open a github issue for this to be followed up on.
If you remove the tags the page resizes back?
Hi PerfectMole86 ,
how do I connect it to clearml installed outside my docker container?
Can you please elaborate?
Hi @<1543766544847212544:profile|SorePelican79> , I don't think you can track the data inside the dataset. Maybe @<1523701087100473344:profile|SuccessfulKoala55> , might have an idea
I'm afraid not, but I think it would be a cool feature request on GitHub ๐
It means there is nothing reporting iterations explicitly or any iterations being reported by any framework. This means scalers will show with time from start as x axist instead of iterations
Hi @<1524560082761682944:profile|MammothParrot39> , I think you need to run the pipeline at least once (at least the first step should start) for it to "catch" the configs. I suggest you run once with pipe.start_locally(run_pipeline_steps_locally=True)
Did the second run, run remotely or locally?
Hi, can you give the error that is printed out?