I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)
Like get the tasks that uses the most metrics API?
Hi @<1571308003204796416:profile|HollowPeacock58>
parameters = task.connect(config, name='config_params')
It seems that your DotDict does not support the python copy
operator?
i.e.
from copy import copy
copy(DotDict())
fails ?
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.pip install trains-agent trains-agent init trains-agent daemon --gpus all
What should have happened is the experiments should have been pending (i.e. in a queue)
(Not sure why they are not).
You can manually send them for execution , right click on an experiment in the able, select enqueue and select the default queue (This will be the one the trains-agent will pull from , by default)
A more detailed instructions:
https://github.com/allegroai/trains-agent#installing-the-trains-agent
ShaggyHare67 notice that the services queue is designed to run CPU based tasks like monitoring etc.
For the actual training you need to run your trains-agent
on a GPU machine.
Did you run the trains-agent init
? it will walk you through the configuration (git user/pass) included.
If you want to manually add them, you can see an example of the configuration file in the link below.
You can find it on ~\trains.conf
https://github.com/allegroai/trains-agent/blob/master/docs/tr...
Hi ShaggyHare67 ,
Yes the trains.conf created by trains-agent
is basically an extension of the trains
usage (specifically it adds a section for the agent)
I'm assuming you are running the agent on the same development machine.
I guess the easiest is to rename the trains.conf to trains.conf.old and run trains-agent init
(No need to worry, the trains package supports it , so the new configuration file that will be generated will work just fine)
ShaggyHare67
Now the
trains-agent
is running my code but it is unable to import
trains
...
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with python3 trains-agent
this way it will use the python3 for the experiments
Hi @<1575656665519230976:profile|SkinnyBat30>
Streamlit apps are backend run (i.e. the python code drives the actual web app)
This means running your Tasks code and exposing the web app (i.e. http) streamlit.
This is fully supported with ClearML, but unfortunately only in the paid tiers 😞
You can however run your Task with an agent, make sure the agent's machine is accessible and report the full IP+URL as a hyper-parameter or property, and then use that to access your streaml...
Can the host server's service agent be used?
In theory yes, just make sure you expose the containers network (check the docker compose)
or shall I call the Task.init even from the agent
WorriedParrot51 I think something is lost here.
Task.init() is always called, even when the agent is executing the code. The difference is in what happens inside the Task.init() call. When the codebase itself is executed by the trains-agent, it signals through OS environment to the task.init() that instead of a new created task, it should use the already created one. from this point all data flows from the trains-server back into the c...
Hi WorriedParrot51
Take a look at the Experiment execution section:
there is script
and working directory
working directory is the base of the git repository (which is cloned into the docker file)
So if for some reason trains did not properly detect the current working dir here is what should solve the issue, without changing the PYTHONPATH
script path: ./sub_folder/scripy.py working directory: .
What do you think?
Oh, did you try task.connect_configuration
?
https://allegro.ai/docs/examples/reporting/model_config/#using-a-configuration-file
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
SlipperyDove40 Yes there isTRAINS_CONFIG_FILE
https://allegro.ai/docs/faq/faq/#trains-configuration
SmarmySeaurchin8 just so that I don't miss anything.
One machine, two trains-agents each one connected to a different trains-server, correct ?
from the trains-agent --help
trains-agent --config-file /home/user/my_trains_server1.conf daemon trains-agent --config-file /home/user/my_trains_server2.conf daemon
does the clearml server is a worker i can serve on models?
The serving is done by one of the clearml-agents.
Basically you spin an agent, then this agent is spinning the model serving engine container (fully managed).
(1) install run run clearml-agent (2) run clearml-session CLI to configure and spin the serving engine
Hi JitteryCoyote63
The easiest is to inherit the ResourceMonitor class and change the default logging rate (you could also disable some of the metrics).
https://github.com/allegroai/clearml/blob/701fca9f395c05324dc6a5d8c61ba20e363190cf/clearml/task.py#L565
Then pass the new class to Task.init as auto_resource_monitoring
Hi SmallDeer34
I need some help what is the difference between the manual one and the automatic one ?
from your previous log, this is the bash command executed inside the container, can you try to "step by step" try to catch who/what is messing it up ?
` docker run -it --gpus "device=1" -e CLEARML_WORKER_ID=Gandalf:gpu1 -e CLEARML_DOCKER_IMAGE=nvidia/cuda:11.4.0-devel-ubuntu18.04 -v /home/dwhitena/.git-credentials:/root/.git-credentials -v /home/dwhitena/.gitconfig:/root/.gitconfig -v /tmp/...
But a warning instead of an error would be good.
Yes, that makes sense, I'll make sure we do that
Does this sound like a reasonable workflow, or is there a better way maybe?
makes total sense to me, will be part of next RC 🙂
when I run it on my laptop...
Then yes, you need to set the default_output_uri
on Your laptop's clearml.conf (just like you set it on the k8s glue)
Make sense ?
ShinyLobster84
fatal: could not read Username for '
': terminal prompts disabled
This is the main issue, it needs git credentials to clone the repo code, containing the pipeline logic (this is the exact same behaviour as pipeline v1 execute_remotely(), which is now the default, could it be that before you executed the pipeline logic, locally ?)
WackyRabbit7 could the local/remote pipeline logic could apply in your case as well ?
So basically the APIClient is a pythonic interface to the RestAPI, so you can do the following
See if this one works# stats from he last 60 seconds for worker in workers: print(client.workers.get_stats(worker_ids=[worker.id], from_date=int(time()-60),to_date=int(time()), interval=60, ))
I started running it again and it seems to have passed the phase where it failed last time
Yey!
Yes it is a common case....
I have the feeling ShinyLobster84 WackyRabbit7 you are not alone in this one 🙂 let me make sure we change the default value of Yes it is a common case
to False, so the code looks cleaner
Is this a common case? maybe we should change the run_pipeline_steps_locally
argument to False?
(The idea of run_pipeline_steps_locally=True
is that it will be easier to debug the entire pipeline on the same machine)
Yes, I find myself trying to select "points" on the overview tab. And I find myself wanting to see more interesting info in the tooltip.
Yep that's a very good point.
The Overview panel would be extremely well suited for the task of selecting a number of projects for comparing them.
So what you are saying, this could be a way to multi select experiments for detailed comparison (i.e. selecting the "dots" on the overview graph), is this what you had in mind?