Reputation
Badges 1
25 × Eureka!` from clearml.automation.parameters import LogUniformParameterRange
sampler = LogUniformParameterRange(name='test', min_value=-3.0, max_value=1.0, step_size=0.5)
sampler.to_list()
Out[2]:
[{'test': 1.0},
{'test': 3.1622776601683795},
{'test': 10.0},
{'test': 31.622776601683793},
{'test': 100.0},
{'test': 316.22776601683796},
{'test': 1000.0},
{'test': 3162.2776601683795}] `
thanks @<1715900788393381888:profile|BitingSpider17> for attaching the log it really helps/
Notice from the log:
'-v', '/home/clearml/.clearml/cache:/clearml_agent_cache'
and as expected we also get:
sdk.storage.cache.default_base_dir = /clearml_agent_cache
Yet I can see the error you pointed:
FileNotFoundError: [Errno 2] No such file or directory: '/clearml_agent_cache/storage_manager/datasets'
Now, could it be that the same folder is used for both root and...
Is there no await/synchronize method to wait for task update?
Yes, but then we will have to relaunch it (not unthinkable), but I'm still looking for the intimidate value of doing all that work, wdyt?
Hmmm why don't you use "series" ?
(Notice that with iterations, there is a limit to the number of images stored per title/series , which is configurable in trains.conf, in order to avoid debug sample explosion)
Hmm I would recommend passing it as an artifact, or returning it's value from the decorated pipeline function. Wdyt?
. Ive seen parameters connect and task create in
seconds
and other times it takes 4 minutes.
This might be your backend (cleamrl-server) replying slowly becuase of load?
Is there a way (at the class level) to control the retry logic on connecting to the API server?
The difference in the two screenshots is literally only the URLs in
clearml.conf
and it went from 30s down to 2-3s.
Yes that could be network, also notice that there is aut...
We created an account, setup our data pipeline, and now we can't get back in. Nothing is in the project. Can someone from support reach out to help?
Hi @<1545216077846286336:profile|DistraughtSquirrel81>
You mean in the SaaS? (app.clearml.ml) or is it a local installation?
If this is the SaaS, could it be the data is on a different workspace ? (you can switch workspace and refresh the page)
Hi ShakyJellyfish91
It seems clearml is using a single connection, that takes a long time download
Hmm, I found this one:
https://github.com/allegroai/clearml/blob/1cb5dbb276026644ae20fef63d58256cdc887818/clearml/storage/helper.py#L1763
Does max_connections=10 mean 10 concurrent connections ?
. I guess this can be built in as a feature into ClearML at some future point.
VexedCat68 you mean referencing an external link?
VexedCat68 both are valid. In case the step was cached (i.e. already executed) the node.job will be None, so it is probably safer to get the Task based on the "executed" field which stores the Task ID used.
Hi GrittyCormorant73
When I archive the pipeline and go into the archive and delete the pipeline, the artifacts are not deleted.
Which clearml-server version are you using? The artifact delete was only recently added
Just one more question, do you have any idea about how I could change the x-axis label from "Iterations" to "Epochs"
You mean in the UI (i.e. just the title) ? or are you actually reporting iterations instead of epochs? and if so is this auto connected to tensorboard or is it reported manually ?
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
Great, please feel free to share your thoughts here 🙂
Hmm that is odd, can you send an email to support@clear.ml ?
Hi @<1523704667563888640:profile|CooperativeOtter46>
Is there a way to set the name/path of the
requirements.txt
file the agent uses to install packages?
When the agent is installing packages it takes it from the "Onstalled Packages" section of the Task. Only if it is empty it will revert to "requirements.txt" from the git repository
That said, if you can Add the following to your "Installed Pacakges"
-r my_other_requirements.txt
And the agent will `my_...
JitteryCoyote63 hacky but sure 🙂
` from trains.config import config_obj
print(config_obj) `
Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(
Oh!!!
For now, this is the behavior I observe: Suppose I have two models, A and B. ....
Correct
Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist...
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI 🙂
DeliciousBluewhale87 wdyt?
Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11 ...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze ?
And where is the agent running ? (and is it venv or docker mode?)
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
Hmm could it be this is on the "helper functions" ?
Hmm Could you check if it makes a difference importing ClearML before shap ?
If this changes nothing, could you put a standalone script to reproduce the issue ?
You can put a breakpoint here, and see what you are sending:
https://github.com/allegroai/trains/blob/17f7d51a93deb52a0e7d6cdd59da7038b0e2dd0a/trains/backend_api/session/session.py#L220
This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Correct 🙂
Hi @<1610083503607648256:profile|DiminutiveToad80>
Yes, it does. They are also cached by default (on the machine with the agent)
None
Done!
Thanks
fatal: unable to find a suitable socket path; use --socket
)
I think that's the root cause, we should probably also add https://github.com/allegroai/trains-agent/issues/16
ClearML best practice to create a draft pipeline to have the task on the server so that it can be cloned, modified and executed at any time?
Well it is, we just assume that you executed the pipeline somewhere (i.e. made sure it works) 🙂
Correction:
What you actually are looking for (and I will make sure we have it in the doc) is :pipeline.start(queue=None)It will just leave it as is, so you can manually enqueue / clone it 🙂
However, the pipeline experiment is not visible in the project experiment list.
I mean press on the "full details" in the pipeline page