Reputation
Badges 1
25 × Eureka!MiniatureHawk42 could you re-test with the latest from github?pip install -U git+
Hi CharmingPuppy6
Basically yes there is.
The way clearml is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu...
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
Hi HelpfulHare30
I mean situations when training is long and its parts can be parallelized in some way like in Spark or Dask
Yes that makes sense, with both the function we are paralleling usually bottle-necked in both data & cpu, and both frameworks try to split & stream the data.
ClearML does not do data split & stream, but what you can do is launch multiple Tasks from a single "controller" and collect the results. I think that one of the main differences is that a ClearML Task is ...
Thatβs the question i want to raise too,
No file size limit
Let me try to run it myself
build your containers off these two? or are you building directly from code ?
cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
Correct
how can I enforce a specific wheel to be installed?
You mean like specific CUDA wheel ?
you can simple put the http link to the wheel in the "installed packages", it should work
I'm not sure I follow the example... Are you sure this experiment continued a previous run?
What was the last iteration on the previous run ?
BurlyPig26 if this is about Task.init delaying execution, did you check:Task.init(..., deferred_init=True)it will execute the initialization in the background without disturbing execution.
If this is about Model auto logging, see Task.init(..., auto_connect_frameworks) you can specify per framework a wild card to log the models, or disable completely https://github.com/allegroai/clearml/blob/b24ed1937cf8a685f929aef5ac0625449d29cb69/clearml/task.py#L370
Make sense ?
Hmm, it seems as if the task.set_initial_iteration(0) is ignored...
What's the clearml version you are using ?
Is it the same one you have on the local machine ?
Thanks SubstantialElk6 !
I believe an initial a fix was pushed π A full one (merging Task --env with k8s template) will be added soon
NaughtyFish36
No module named 'leap.learn.data_tools.merge_data.merge_data'
This seems to be the error but I cannot see leap in the installed packages , Notice that if the Task has "Installed Packages" section then the agent will use that Not the "requirements.txt" , Only if this section is Empty it will revert to the "requirements.txt" in the repo.
How did you create the Task in the first place?
I see that you added "leap" into the initial bashscript, actually you should add i...
Hmmm, I'm not sure that you can disable it. But I think you are correct it should be possible. We will add it as another argument to Task.init. That said, FriendlyKoala70 what's the use case for disabling the code detection? You don't have to use it later, but it is always nice to know :)
In my understanding requests still go through
clearml-server
which configuration I left
DefiantHippopotamus88 actually this is Not correct.
clearml-server only acts as a control plane, no actual requests are routed to it, it is used to sync model state, stats etc. not part of the request processing flow itself.curl: (56) Recv failure: Connection reset by peerThis actually indicates 9090 port is not being listened to...
What's the final docker-compose you are usi...
And anyway once a model is published canβt update it right?
Correct, and also the creating Task is published (i.e. locked)
I'm kind of at a point where I don't know a lot of what to even search for.
we feel you π , yes there still isn't a very good source of information on where to get started...
This is because the entire field is constantly changing and evolving, and one solution will usually only apply to specific use case...
I would start with the mlops community slack channel, and youtube talks (specifically those by companies describe how they built their own internal infrastructure, i...
... if we have direct access to the Kubernetes worker when we run K8S glue?
Correct, if you have a direct access to the Node (on your k8s cluster) from your laptop (assuming the clearml-session is running from the laptop), everything should work
Also could you explain the difference between trigger.start() and trigger.start_remotely()
Start will start the trigger process (the one "watching the changes") locally (this makes sense for debugging etc.)
start_remotely will launch the trigger process on the "services" where it should live forever π
Okay so when I add trigger_on_tags, the repetition issue is resolved.
Nice!
This problem occurs when I'm scheduling a task. Copies of the task keep being put on the queue ...
MuddySquid7 the fix was pushed to GitHub, you can now install directly from the repo:pip install git+
Ohh "~/trains.conf" is root probably
ReassuredTiger98 I guess this is a plotly feature, none the less I think you can shift the Y axis manually (click and drag)
I'm trying to achieve a workflow similar to the one
You mean running everything on a single machine (manually)?
JitteryCoyote63 I meant to store the parent ID as another "hyper-parameter" (under its own section name) not the data itself.
Makes sense ?
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
I wanted to know what the best way to create and register the SSL keys is.
of I see, so basically you need to add it to add nginx with SSL certificates on top of the hosted service (or configure the dockercompose nginx container to add that)
Then you need to add the self signed SSL into any host machine (I'm assuming these are not "valid" SSL certificates generated by a reputable SSL provider)
But generally speaking if you are using self hosted clearml-server on a local machine that n...
Thanks@doru! BTW if you are running a code from outside the trains repo, do you still get the double package?
Funny enough Iβm running into a new issue now.
Sorry my bad, I thought have known π yes it probably should be packages=["clearml==1.1.6"]
BTW: do you have any imports inside the pipeline function itself ? if you do not, then no need to pass "packages" at all, it will just add clearml
ShinyWhale52 any time π
Feel free to followup with more questions
Hi FiercePenguin76
By default clearml will list only the packages you import, and not derivative packages.
This means that if you import package X and it imports package Y , only package X will be listed.
The way it should work is by statically analyzing the entire repository, but if you import a local package from a different local folder, and that folder is Not in the same repo, it will not get listed (obviously if you install the external local package, it will be...