Reputation
Badges 1
25 × Eureka!From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
Great, please feel free to share your thoughts here 🙂
Hmm that is odd, can you send an email to support@clear.ml ?
Hi @<1523704667563888640:profile|CooperativeOtter46>
Is there a way to set the name/path of the
requirements.txt
file the agent uses to install packages?
When the agent is installing packages it takes it from the "Onstalled Packages" section of the Task. Only if it is empty it will revert to "requirements.txt" from the git repository
That said, if you can Add the following to your "Installed Pacakges"
-r my_other_requirements.txt
And the agent will `my_...
JitteryCoyote63 hacky but sure 🙂
` from trains.config import config_obj
print(config_obj) `
CurvedHedgehog15 is it plots or scalars you are after ?
Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(
Oh!!!
For now, this is the behavior I observe: Suppose I have two models, A and B. ....
Correct
Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist...
@<1651395720067944448:profile|GiddyHedgehong81> just to be clear, Dataset.get_local_copy returns a path to your files,
You have to Manually add the additional path to the specific files you need to use. It does Not know that in advance.
That was the initial issue you had, and I assume it is the same one here. does that make sense ?
Hi @<1569858449813016576:profile|JumpyRaven4>
task.add_requirements()
This is the problem, if you look closely this is a class method, meant for helping the Task.init better capture python packages, it does Not change the task requirements.
To do that, use " task.set_packages "
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI 🙂
DeliciousBluewhale87 wdyt?
Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11 ...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze ?
And where is the agent running ? (and is it venv or docker mode?)
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
Can you test with the latest RC:pip install clearml==1.0.3rc0
Hmm could it be this is on the "helper functions" ?
Hmm Could you check if it makes a difference importing ClearML before shap ?
If this changes nothing, could you put a standalone script to reproduce the issue ?
You can put a breakpoint here, and see what you are sending:
https://github.com/allegroai/trains/blob/17f7d51a93deb52a0e7d6cdd59da7038b0e2dd0a/trains/backend_api/session/session.py#L220
This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Correct 🙂
Hi @<1610083503607648256:profile|DiminutiveToad80>
Yes, it does. They are also cached by default (on the machine with the agent)
None
Done!
Thanks
fatal: unable to find a suitable socket path; use --socket
)
I think that's the root cause, we should probably also add https://github.com/allegroai/trains-agent/issues/16
ClearML best practice to create a draft pipeline to have the task on the server so that it can be cloned, modified and executed at any time?
Well it is, we just assume that you executed the pipeline somewhere (i.e. made sure it works) 🙂
Correction:
What you actually are looking for (and I will make sure we have it in the doc) is :pipeline.start(queue=None)It will just leave it as is, so you can manually enqueue / clone it 🙂
However, the pipeline experiment is not visible in the project experiment list.
I mean press on the "full details" in the pipeline page
MotionlessCoral18 so did it solve the issue ?
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life 🙂
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:
apiserver.conf
Not sure, I'm still waiting on answer... It...
Is this consistent on the same file? can you provide a code snippet to reproduce (or understand the flow) ?
Could it be two machines are accessing the same cache folder ?
So I think this is a good example of pipelines and data:
Basically Task A generates data stored using the cleamrl-data (See Dataset class). The output of that is an ID of the Dataset. Then Task B uses that ID to retrieve the Dataset created by Task A.
documentation
https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Example:
Step A creating Dataset:
https://github.com/alguchg/clearml-demo/blob/main/process_dataset.py
Step B training model using the Dataset created in ...
You could change infrastructure or hosting, and now your data is associated with the wrong URL
Yeah that makes sense, so have it on a specific dns name? (this is usually the case with k8s deployments)
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
Hi WackyRabbit7
So I'm assuming after the start_locally is called ?
Which clearml version are you using ?
(just making sure, calling Task.current_task() before starting the pipeline returns the correct Task?)
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task
ExcitedFish86 Oh if this is the case:
in your cleaml.conf:agent.package_manager.type: conda agent.package_manager.conda_env_as_base_docker: truehttps://github.com/allegroai/clearml-agent/blob/36073ad488fc141353a077a48651ab3fabb3d794/docs/clearml.conf#L60
https://git...