I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
Hmm is your code calling Logger.current_logger()
directly ?
Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML support...
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25
Yes 🙂
BTW: do you guys do remote machine development (i.e. Jupyter / vscode-server) ?
Sounds good to me 🙂
Woot woot! 🤩
Hi AverageBee39
What's the clearml-server and clearml packge you are using ?
(I looks like some capability that is missing from the server, i.e. needs upgrade ?!)
EnviousPanda91 'connect' will log the object properties, the automagic logging is controlled in the Task.init call. Specifically Which framework produces metrics that are not logged? Your sample code manually reports some scalars/values, do you these as well?
Hi SubstantialBaldeagle49
yes, you can backup the entire trains-server (see the github docs on how) You mean upgrading the server? Yes, you can change the name or add comments (Info tab / description ), and you can add key/value description (under the configuration tab, see user properties)
CloudySwallow27 okay essentially this defs file is kind of a user "secret vault" for access credentials, is that correct?
Are you running inside a kubernetes cluster ?
I think my main point is, k8s glue on aks or gke basically takes care of spinning new nodes, as the k8s service does that. Aws autoscaler is kind of a replacement , make sense?
Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch ver...
@<1535793988726951936:profile|YummyElephant76> oh you mean like jupyter server was running, then inside the notebook you would start a new venv, in that venv "notebook" package was missing, hence it failed detecting the notebook ?
StaleKangaroo85 check https://demoapp.trains.allegro.ai/projects/0e152d03acf94ae4bb1f3787e293a9f5/experiments/193ac2bced184c49a57658fceb4bd7f9/info-output/metrics/plots?columns=type&columns=name&columns=status&columns=project.name&columns=user.name&columns=started&columns=last_update&columns=last_iteration&order=last_update on the demo server, seems okay to me...
Hi WackyRabbit7
First always check the functions on the Task object, they are the most straight forward access to the system.
Then if you need general purpose API calls, currently they are only documented in the doc-string of the API schema (that said it should be quite documented)
You can check all the endpoints https://github.com/allegroai/trains/tree/master/trains/backend_api/services/v2_8
And finally if you want to easily use the RestAPI :
` from trains.backend_api.session.client impo...
LudicrousParrot69
I "think" I have a better handle on what you wish to do.
Is it kind of generic "serving" solution?
FYI:
Model artifact is, usually, a weights/model file. The idea that later you will be able to access it and serve it. Now the problem is (and I think this is what you are referring to) there is usually a specific piece of code tied to that model that can use it (a.k.a pyfunc)
A few ideas:
These days everyone is trying to build their models with generic interface, so that scik...
UnevenDolphin73 something like this one?
https://github.com/allegroai/clearml/pull/225
Hi JealousParrot68
spinning the clearml-agent with docker support (i.e. each experiment is running inside its own container):
https://clear.ml/docs/latest/docs/clearml_agent#docker-mode
Basically you can specify a default docker to use (per agent) and a specific docker container to use per Task (configured in the UI under execution at the bottom)
make sure you follow all the steps :
https://clear.ml/docs/latest/docs/deploying_clearml/upgrade_server_linux_mac
(basically make sure you get the latest docker-compose.yml and the pull itcurl
-o /opt/clearml/docker-compose.yml docker-compose -f /opt/clearml/docker-compose.yml pull docker-compose -f /opt/clearml/docker-compose.yml up -d
LudicrousParrot69 this is implementation issue, this entire page is based on "task comparison" single Task means totally different interface for querying the data 🙂
BeefyCow3 if you are trying to optimizer a specific metric (i.e. a scalar on a graph). The template Task should report it with the same title/series combination, which should be easy enough to verify in the UI 🙂
You can either report with Tensorboard or with the Trains Logger, either way will work.
JitteryCoyote63 it should just "freeze" after a while as it will constantly try to resend logs. Basically you should be fine 🙂
(If for some reason something crashed, please let me know so we can fix it)
JitteryCoyote63 I think that with 0.17.2 we stopped mounting the venv build to the host machine. Which means it is all stored inside the docker.
it will constantly try to resend logs
Notice this happens in the background, in theory you will just get stderr messages when it fails to send but the training should continue