Reputation
Badges 1
25 × Eureka!Oh task_id is the Task ID of step 2.
Basically the idea is, you run your code once (lets call it debugging / programming), that run creates a task in the system, the task stores the environment definition and the arguments used. Then you can clone that Task and launch it on another machine using the Agent (that basically will setup the environment based on the Task definition and will run your code with the new arguments). The Pipeline is basically doing that for you (i.e. cloning a task chan...
Many thanks LazyLeopard18 ! π
Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue π
Hi SourOx12
How do you set the iteration when you continue the experiment? is it with Task.init
continue_last_task
?
I couldn't change the task status from draft to complete
Task.completed(ignore_errors=True)
Hi ShallowArcticwolf27
Does theΒ
clearml-task
Β cli command currently support remote repositories with that are intended to be used with ssh
It does π
but theΒ
git@
Β prefix used for gitlab's ssh it seems to default to looking for the repository locally
git@ is always the prefix for SSH repositories (it does not actually mean it uses it, it's what git will return when asked on the origin of the repository. The agent knows (if SSH credentials ...
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
Has anyone done this exact use case - updates to datasets triggering pipelines?
Hi TrickySheep9 seems like this is following a diff thread, am I missing something ?
I can definitely see your point from the "DevOps" perspective, but from the user perspective it put the "liability" on me to "optimize" the resource, which to me sounds a bit much to put on my tiny shoulders, I just have a general knowledge on what I need. For example lots of CPUs (because I know my process scales well with more cpus), or large memory (because I have an entire dataset in memory). Personally (and really only my personal perspective), I'd rather have the option to select from a...
Yes, but I'm not sure that they need to have separate task
Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)
that does make more sense π
Hi EnviousStarfish54
Verified with the frontend / backend guys.
Backend allows to search for "all" tags, and frontend will add a toggle button for the UI to select or/all for the selected Tags.
Should be part of the next release
Hi RoughTiger69
One quirk I found was that even with this flag on, the agent decides to install whatever is in the requirements.txt
Whats the clearml-agent you are using?
I just noticed that even when I clear the list of installed packages in the UI, upon startup, clearml agent still picks up the requirements.txt (after checking out the code) and tries to install it.
It can also just skip the entire Python installation with:CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
So I shouldnβt even need to call theΒ
task.set_initial_iteration
Β function
I think just removing this call should solve it, I think that what's going on is that this is called twice (once internal once manually by your code)
Yep, and this is the root cause of the issue (But easily fixable) π
Also, don't be shy, we love questions π
Hi @<1687653458951278592:profile|StrangeStork48>
I have good news, v1.0 is out with hashed passwords support.
No worries, let's assume we have:base_params = dict( field1=dict(param1=123, param2='text'), field2=dict(param1=123, param2='text'), ... )
Now let's just connect field1:task.connect(base_params['field1'], name='field1')
That's it π
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
GreasyLeopard35 from the implementation:
https://github.com/allegroai/clearml/blob/fcad50b6266f445424a1f1fb361f5a4bc5c7f6a3/clearml/automation/parameters.py#L215
Which basically returns the "self.base" (default) 10 to the power of the selected value:10**-3 = 0.001
So how would I get a negative value ?
Hi CooperativeFly2
is it possible to create multiple train-agent per gpu
Yes you can, that said memory cannot be actually shared between GPU processes (GPU time is obviously shared) so you have to be careful with the Tasks actually being executed in parallel.
For instance:TRAINS_WORKER_NAME=host_a trains-agent daemon --gpus 0 --queue default TRAINS_WORKER_NAME=host_b trains-agent daemon --gpus 0 --queue default
Python3.8 I can quickly check, give me a minute
AntsyElk37
and when i try to use --output-uri i can't pass true because obviously i can't pass a boolean only strings
hmm, that sounds right, I think we should fix that so when using --output-uri true
the value that is passed is actually True, not the string "true".
Regrading the issue itself:
are you saying --skip-task-init
is being ignored ? and it always adds the Task.init call? you can also pass --output-uri
https://files.clear.ml (which is the same as True) ,...
Hi UnsightlyLion90
from my understanding agent do the job of SLURM,
That is kind of correct (they overlap in some ways π )
Any guide of how to integrate both of them?
The easiest way is to just add the "Task.init()" call to your code, and use SLURM to schedule the job. this will make sure all jobs are fully logged (this can also includes automatically uploading the models, and artifact support etc)
Full SLURM support (i.e. similar to the k8s glue support), is currently ou...
Hi WittyOwl57
I think what happens is it auto-logs the joblib load/save calls, these calls track models used/created by the code, and attach them to the model repository representing these model.
I'm assuming there are multiple load/save , and there are multiple model instances pointing to the same local file "file:///tmp/..." . The earning basically says it is re-registering existing models.
Make sense ?
OK - the issue was the firewall rules that we had.
Nice!
But now there is an issue with the
Setting up connection to remote session
OutrageousSheep60 this is just a warning, basically saying we are using the default signed SSH server key (has nothing to do with the random password, just the identifying key being used for the remote ssh session)
Bottom line, I think you have everything working π