
Reputation
Badges 1
25 × Eureka!Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component
Thank you @<1689446563463565312:profile|SmallTurkey79> !!!
I think CostlyOstrich36 managed to reproduce?!
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
UnevenDolphin73 are you positive, is this reproducible? What are you getting?
try these values:
os.environ.update({
'CLEARML_VCS_COMMIT_ID': '<commit_id>',
'CLEARML_VCS_BRANCH': 'origin/master',
'CLEARML_VCS_DIFF': '',
'CLEARML_VCS_STATUS': '',
'CLEARML_VCS_ROOT': '.',
'CLEARML_VCS_REPO_URL': '
',
})
task = Task.init(...)
so I guess this could be one reason to start about thinking upgrading ....
Wait you mean the clearml-server ? (there is no reason not to upgrade the python package)
WhimsicalLion91
What would you say the use case for running an experiment with iterations
That could be loss value per iteration, or accuracy per epoch (iteration is just a name for the x-axis in a sense , this is equivalent to time series)
Make sense?
That sounds like an issue with "working dir" , check the "Execution" "Working Directory" field.
'.' means the root of the git repository
'subfolder' means run the script from the subfolder etc. also make sure that the script path is adjusted accordingly.
btw: Trains should have filled in all the correct paths... If you have time get the latest trains (0.14.3) and run again see if the problem consts, we should probably fix that bug 🙂
While I'll look into it, you can do:from clearml import OutputModel output_model = OutputModel() output_model.update_weights("best_model.onnx")
Awesome, PRs are always welcome, and we try to help with any request and feature coming for users. We just added audio support (RC releasing in a few days) based only on users request.
https://github.com/allegroai/trains/issues/120
Hi @<1566959357147484160:profile|LazyCat94>
So it seems the arg parser is detecting the configuration YAML
The first thing I would suggest is changing it to a relative path (so that when launched on remote machines it will find the YAML file)
Regardless how are you launching the HPO ? are you spinning a new agent ?
(as background, argparser arguments are injected in realtime by the agent or the HPO when running as subprocesses)
Basically the links to the file server are saved in both mongo and elastic, so as long as these are host:ip based, at least in theory it should work
Yes, that sounds like a good start, DilapidatedDucks58 can you open a github issue with the feature request ?
I want to make sure we do not forget
Sorry @<1524922424720625664:profile|TartLeopard58> 😞 we probably missed it
clearml-session is still being developed 🙂
Which issue are you referring to ?
Hi @<1523703397830627328:profile|CrookedMonkey33>
If you click on the "Task Information" (on the Version Info panel, right hand-side). It will open the Task details page, there you have the "hamburger" menu top right, where you have publish
(Maybe we should add that to the main right click menu?!)
... if we have direct access to the Kubernetes worker when we run K8S glue?
Correct, if you have a direct access to the Node (on your k8s cluster) from your laptop (assuming the clearml-session is running from the laptop), everything should work
Hi LazyLeopard18 ,
So long story short, yes it does.
Longer version, to really accomplish full federated learning with control over data at "compute points" you need some data abstraction layer. Without data abstraction layer, federated learning is just averaging derivatives from different location, this can be easily done with any distributed learning framework, such as horovod pr pytorch distributed or TF distributed.
If what you are after is, can I launch multiple experiments with the sam...
where the ui merges the plots just as we want and I was wondering if there is some simple way to do it in the case of all plots.
we can do it for scalars (this is trivial)
We can merge specific plots when they are simple, I think basic histograms.
But for any generic plots we fear the merge will just fail, and this is why it defaults to side by side.
how can I combine two plots in the ui as you mentioned?
The easiest solution is to use, "report_scatter2d", these are specific pl...
what if cleanup service is launched using ClearML-Agent Services container
The easiest is to use the container args and pass the AWS credentials as env variables:-e AWS_ACCESS_KEY_ID=abcd -e ....
Make sense ?
For example:examples/k8s_glue_example.py --queue k8s_gpu - --namespace pod-clearml-conf ~/trains.conf --template-yaml example/base.yml
Hi SoreDragonfly16
The warning you mention means that someone state of the experiment was changed to aborted
, which in term will actually kill the process.
What do you mean by "If I disable the logger," ?
We are here if you need further help 🙂
ScantMoth28 where are you seeing this warning ?
Hi FiercePenguin76
https://allegro.ai/clearml/docs/rst/references/clearml_python_ref/model_module/model_outputmodel.html
Basically:from clearml import OutputModel model = OutputModel() model.update_weights(weights_filename='local_file_here.bin')
EnviousStarfish54 regrading file server, you have one built into the trains-server, and this will be the default location to store all artifacts. You can also use external solutions like S3 GS Azure etc.
Regarding the models, any model store / load is automatically logged as long as you are using one of the supported frameworks (TF Keras PyTorch scikit learn)
If you want your model to be automatically uploaded, just add outpu_uri:
task=Task.init('examples', 'model', output_uri=' http://trai...