Reputation
Badges 1
25 × Eureka!Hi WackyRabbit7
I have a pipeline controller task, which launches 30 tasks. Semantically there are 10 applications, and I run 3 tasks for each (those 3 are sequential, so in the UI it looks like 10 lines of 3 tasks).
π
In one of those 3 tasks that run for every app, I save a dataframe under the name "my_dataframe".
I'm assuming as an artifact:
What I want to achieve is once all tasks are over, to collect all those "my_dataframe" artifacts (10 in number), extract a sin...
GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?
cleamrl sdk (i.e. python client)
The issue is that the Task.create did not add the repo, link (again as mentioned above, you need to pass the local folder or repo link to the repo argument of the Task.create function). I "think" it could automatically deduce the repo from the script entry point, but I'm not sure. hence my question on the clearml package version
Try removing this magic environment that tells the sub-process there was already an Initialized Task.
import os env = dict(**os.environ) env.pop('TRAINS_PROC_MASTER_ID', None) π
I'm just trying to see what is the default server that is set, and is it responsive
I'm assuming you mean your own server, not the demo server, is that correct ?
and then second part is to check if it is up and alive
Yes, you can curl to the ping endpoint :
https://clear.ml/docs/latest/docs/references/api/debug#post-debugping
- look at immediate parents for identically-named files
....
UnevenDolphin73 are you saying this will be your way to log the diff between two versions (for increased visibility) ?
If so, how would you visualize it ?
(I really like this idea of visualizing the changeset, trying to think if there is "smart" way to create a callback to make the approach kind of best-practice) wdyt?
Let me try to add some color to this process analysis process.
Basically clearml will try to statically analyze the code (i.e. look for import/from packages)
Then it will list them in a pip requirements.txt format under installed packages.
When running inside conda environment, it will check which packages were installed via "conda install" (instead of pip install) and mark them internally. This process ensures that when the clearml-agent is running with conda package manager, it "knows" whic...
what does it mean to run the steps locally?
start_locally : means the pipeline code itself (the logic that runs / controls the DAG) runs on the local machine (i.e. no agent), but this control logic creates/clones Tasks and enqueues them, for those Tasks you need an agent to execute them
run_pipeline_steps_locally=True: means the Tasks the pipeline creates, instead of enqueuing them and having an agent runs them, they will be launched on the same local machine (think debugging, other...
command line to the arg parser should be passed via the "Args" section in the Configuration tab.
What is the working directory on the experiment ?
StorageManager is what you need, if you want to download/upload files to any server (this is a utility class the takes care of the DL/uL + adds caching) storage helper is used internally
If you want to rename it (any pipeline), click on the "Full details" in the "Run Info" (right hand side panel), then in the full detail of the Pipeline Task you will be able to rename the pipeline execution
(Is renaming useful? should we add a right click to rename ?)
Maybe something similar to dockers
I like this approach maybe we could add --name as well, so it is easier to name them.trains-agent daemon stop --gpus alltrains-agent daemon stop --cpu-onlytrains-agent daemon stop --gpus 0
What do you think?
Also being able to separate their configurations files would be good (maybe there is and I don't know?)
This is already supported --config-file , see trains-agent --help for details π
Hi AmiableFish73
Hi all - is there an easier way track the set of datasets used by a particular task?
I think the easiest is to give the Dataset an alias, it will automatically appear in the Configuration section:Dataset.get(..., alias="train dataset")wdyt?
BTW: if you make the right column the base line (i.e. move it to the left, you will get what you probably expected)
Exactly, just pointing to the fact that, that machine is yours ;)
DS, this way they only need to remember (and me only need to teach them where to find) one id.
Yes that's the point, this ID is the Model UID (as opposed to the Task ID), the reason I kind if "insist" on it is that the Model ID is built into the system meaning, this is how you register it, as opposed to the Task ID that somehow needs to be hacked/passed externally
TBH the main reason I went with our API is that because of the custom model loading, we need to use the "custom" framew...
Hi OutrageousGrasshopper93
which framework are you using? trains-agent will pull the correct torch based on the cuda version it detects, but no such thing for TF the default venv mode, trains-agent creates a new venv for the experiment (not conda) then everything is installed there. If you need conda you need to change the package_manager to conda: https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c6736d12614de9870eff48bc/docs/trains.conf#L49 The safest way to control CUDA dri...
Hmm, I still wonder what is the "correct" answer for most people, is empty string in argparse redundant anyhow? will someone ever use it?
Added -v /home/uname/.ssh:/root/.ssh and it resolved the issue. I assume this is some sort of a bug then?
That is supposed to be automatically mounted the SSH_AUTH_SOCK defined means that you have to add the mount to the SSH_AUTH_SOCK socket so that the container can access it.
Try to run when you undefine SSH_AUTH_SOCK and keep the force_git_ssh_protocol (no need to manually add the .ssh mount it will do that for you)
EnviousStarfish54 Notice that you can configure it on the agent machine only, so in development you are not "wasting" storage when uploading debug checkpoints/models π
Hi @<1523708920831414272:profile|SuperficialDolphin93>
The error seems like nvml fails to initialize inside the container, you can test it with nvidia-smi and check if that wirks
Regrading Cuda version the ClearML serving inherits from the Triton container, could you try to build a new one with the latest Triton container (I think 25). The docker compose is in the cleaml serving git repo. wdyt?
Ok..so I should generally avoid connecting complex objects? I guess I would create a 'mini dictionary' with a subset of params, and connectvthis instead.
In theory it should always work, but this specific one fails on a very pythonic paradigm (see below)
from copy import copy
an_object = copy(object)
A good rule of thumb is to connect any object/dict that you want to track or change later
store_code_diff_from_remote
Β don't seem to change anything in regards of this issue
Correct, it is always from remote
i'll be using the update_task, that worked just fine, thanksΒ
Β (edite
Sure thing.
ShakyJellyfish91 , I took a quick look at the diff between the versions can you hack a non working version (preferably the latest) and verify the issue for me?
ShakyJellyfish91 what exactly are you passing to Task.create?
Could it be you are only passing script= and leaving repo= None ?
Thanks StrongHorse8
Where do you think would be a good place to put a more advanced setup? Maybe we should add an entry for DevOps? Wdyt?
You need trains-server support, so if trains v0.15 is working with older backend it will revert to "training" type
Hi @<1773158043551272960:profile|PungentRobin32>
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
So seems like the docker command is incorrect?! the error you are seeing is the agent failing to spin the docker, what's the OS of the host machine ?
pipeline, can I control the tags that the tasks a pipeline creates?Β
add_pipeline_tags
Β adds tags from pipeline to the tasks I suppose? But I also need to clear existing tags in those created tasks
add_pipeline_tags will add the unique ID of the pipeline execution, if you want to add specific tags you can use the task_overrides and provide:pipe.add_step(..., task_overrides={'tags': ['my', 'tags']})
additionally, I found is that clearml==1.0.5 package is able to find these partial changes, newer versions find nothing at all, maybe it's because it's always comparing against remote
Hmm it was always from remote...
it is actually doing the following:git rev-parse --abbrev-ref --symbolic-full-name @{u}Then with the branch name output,git diff --submodule=diff <add_branch_name_here>
If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged