Reputation
Badges 1
25 × Eureka!Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug
It's dead simple to install:
Pip install trains-agent
the.n you can simply do:
Trains-agent execute --id myexperimentid
Hi CharmingPuppy6
Basically yes there is.
The way clearml
is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu...
I assume so ๐ Datasets are kind of agnostic to the data itself, for the Dataset it's basically a file hierarchy
The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
Ch...
I can definitely see your point from the "DevOps" perspective, but from the user perspective it put the "liability" on me to "optimize" the resource, which to me sounds a bit much to put on my tiny shoulders, I just have a general knowledge on what I need. For example lots of CPUs (because I know my process scales well with more cpus), or large memory (because I have an entire dataset in memory). Personally (and really only my personal perspective), I'd rather have the option to select from a...
Just making sure, pip package installed on your Conda env, correct?
Hi GreasyPenguin14
However the cleanup service is also running in a docker container. How is it possible that the cleanup service has access and can remove these model checkpoints?
The easiest solution is to launch the cleanup script with a mount point from the storage directory, to inside the container ( -v <host_folder>:<container_folder>
)
The other option, which clearml version 1.0 and above supports, is using the Task.delete, that now supports deleting the artifacts and mod...
HugePelican43 sure you can, usually the limiting factor is memory, as it cannot be shared among processes, so if one allocated all memory the second process will crash with out of memory error
Hi MagnificentSeaurchin79
Could you test with the tesnorflow toy example?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py
MagnificentSeaurchin79 making sure the basics work.
Can you see the 3D plots under the Plot section ?
Regrading the Tensors, could you provide a toy example for us to test ?
No by definition the agent will only execute one Task at a time, you can spin a second agent on the same GPU :)
Hi PunyGoose16 ,
I think the website is probably the easiest ๐
https://clear.ml/contact-us/
I think they get back to quite quickly
Hi DeliciousBluewhale87
I think we had a docker that does exactly that, and then you would spin the docker as a k8s service , is this what you are referring to?
Hi UnsightlySeagull42
does anyone know how this works with git ssh credentials?
These will be taken from the host ~/.ssh folder
Seems that api has changed pretty much since a few versions back.
Correct, notice that your old pipelines Tasks use the older package and will still work.
There seems to be no need inย
controller_task
ย anymore, right?
Correct, you can just call pipeline.start()
๐
The pipeline creates the tasks, but never executes/enqueues them (they are all inย
Draft
ย mode). No DAG graph appears inย
RESULTS/PLOTS
ย tab.
Which vers...
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
This is a good point! I'll make sure we stress it (BTW: it will work with elevated credentials, but probably not recommended)
LOL, thanks!
So would this pseudo code solve the issue
def pipeline_creator():
pipeline_a_id = os.system("python3 create_pipeline_a.py")
print(f"pipeline_a_id={pipeline_a_id}")
something like that?
(obviously the quesiton how would you get the return value of the new pipeline ID, but I'm getting a head of myself)
Hi @<1523701079223570432:profile|ReassuredOwl55> let me try ti add some color here:
Basically we have to parts (1) pipeline logic, i.e. the code that drives the DAG, (2) pipeline components, e.g. model verification
The pipeline logic (1) i.e. the code that creates the dag, the tasks and enqueues them, will be running in the git actions context. i.e. this is the automation code. The pipeline components themselves (2) e.g. model verification training etc. are running using the clearml agents...
Itโs only on this specific local machine that weโre facing this truncated download.
Yes that what the log says, make sense
Seems like this still doesnโt solve the problem, how can we verify this setting has been applied correctly?
hmm exec into the container? what did you put in clearml.conf?
trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?
Or can I enable agent in this kind of local mode?
You just built a local agent
AdventurousRabbit79 you are correct, caching was introduced in v1.0 , also notice the default is no caching, you have to specify that you want caching per step.
Hi ContemplativeGoat37
it a good idea to use ClearML Agent Services for such things?
Yes! it is exactly the kind of thing it was designed to do ๐
but when I run the same task again it does not map the keys..ย (edited)
SparklingElephant70 what do you mean by "map the keys" ?
Hi UnsightlySeagull42
But now I need the hyperparameters in every python file.
You can always get the Task from anywhere?main_task = Task.current_task()
SubstantialElk6 this is odd, how are they passed ? what's the exact setup ?