Reputation
Badges 1
25 × Eureka!Then we can figure out what can be changed so CML correctly registers process failures with Hydra
JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?
If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never cal...
Just to make sure, the first two steps are working ?
Maybe it has to do with the fact the "training" step specifies a docker image, could you try to remove it and check?
BTW: A few pointers
The return_values is used to specify multiple returned objects stored individually, not the type of the object. If there is a single object, no need to specify
The parents argument is optional, the pipeline components optimizes execution based on inputs, for example in your code, all pipeline comp...
Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy() this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are wo...
Okay, I was able to reproduce it (this is odd) let me check ...
Correct, and that also means the code the runs is not auto-magically logged.
is it normal that it's slower than my device even though the agent is much more powerful than my device? or because it is just a simple code
Could be the agent is not using the GPU for some reason?
Or am I forced to do a get, check if the latest version is fainallyzed,
Dataset Must be finalized before using it. The only situation where it is not is because you are still in the "upload" state.
, then increment de version of that version and create my new version ?
I'm assuming there is a data processing pipeline pushing new data?! How do you know you have new data to push?
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
however for the remote file it always creates the name with the following pattern:
{filename_prefix}checkpoint{n}.pt
..
Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance)
This is essentially a "queue". Basically a queue is a way to abstract a specific type of resource, so that you can achieve exactly what you descibed.
open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake).
Yes, that's exactly how clearml is designed, a...
. I guess this can be built in as a feature into ClearML at some future point.
VexedCat68 you mean referencing an external link?
In your trains.conf, change the valuefiles_server: ' s3://ip :port/bucket'
now, I need to pass a variable to the Preprocess class
you mean for the construction ?
Hi PompousParrot44
You can check the cleanup service example.
It sleeps for 24 hours then spins up and does its thing.
You can always launch this service tasks on the services queue, its purpose is to run those services on the trains-server as additional CPU services. They will also be registered as service nodes, so you have visibility into which service is running.
In order to clone a task and wait for its completion.
Use the TrainsJob https://github.com/allegroai/trains/blob/65a4a...
Oh I see, that kind of make sense
I think this is the section you should use:
None
But instead of the clearml-services container you should use the regular container (or just have it installed as part of the entry-point on any ubuntu based container)
Notice the important parts here are:
[None](https://github.com/allegroai/clearml-server/blob/6a1fc04d1e8b112fb334c8743d...
in order to work with ssh cloning, one has to manually install openssh-client to the docker image, looks like that
Correct, you have to have SSH inside the container so that git can use it.
You can always install with the following setup inside your agent's clearml.conf:extra_docker_shell_script: ["apt-get install -y openssh-client", ]
https://github.com/allegroai/clearml-agent/blob/73625bf00fc7b4506554c1df9abd393b49b2a8ed/docs/clearml.conf#L145
If the same Task is run with different parameters...
ShinyWhale52 sorry, I kind of missed that in the explanation
The pipeline will always* create a new copy (clone) of the original Task (step), then modify the step's inputs etc.
The idea is that you have the experiment management (read execution management) to create full transparancy into the pipelines and steps. Think of it as the missing part in a lot of pipelines platforms where after you executed the pipeline you need to furthe...
somehow set docker_args and docker_bash_setup_script equivalent??task.set_base_docker(...)# somehow setup repo and branch to download to remote instance before runningThis is automatically detected based on your local commit/branch as well ass uncommitted changes
Also, I just wanted to say thanks for the tool! I'm managing a small data science practice and it's going to be really nice to have a view of all of the experiments we've got and know our GPU utilization, all without having to give every data scientist access to each box where the workflows are run. Incredibly stoked.
♥ ❤ ♥
` param = {'arg': value}
task.connect(param, section='new section')
create pipeline here
pipeline `
It seems like there is no way to define that a Task requires docker support from an agent, right?
Correct, basically the idea is you either have workers working in venv mode or docker.
If you have a mixture of the two, then you can have the venv agents pulling from one queue (say default_venv) and the docker mode agents pulling from a different queue (say default_docker). This way you always know what you are getting when you enqueue your Task
Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!
I was thinking mainly about AWS.
Meaning S3?
Hmm could it be this is on the "helper functions" ?
JitteryCoyote63 This seems like exactly what you are saying, elastic license issue...
make sure you follow all the steps :
https://clear.ml/docs/latest/docs/deploying_clearml/upgrade_server_linux_mac
(basically make sure you get the latest docker-compose.yml and the pull itcurl -o /opt/clearml/docker-compose.yml docker-compose -f /opt/clearml/docker-compose.yml pull docker-compose -f /opt/clearml/docker-compose.yml up -d
BTW: the agent will resolve pytorch based on the install CUDA version.
or at least stick to the requirements.txt file rather than the actual environment
You can also for it to log the requirements.txt withTask.force_requirements_env_freeze(requirements_file="requirements.txt") task = Task.init(...)
Ohh I see, could you copy paste what you put there (instead of the secret and key *** will do 🙂 )