Hi @<1619867994005966848:profile|HungryTurtle13>
I'm using Python's joblib library and the Parallel class to run an experiment in multiple parallel threads.
I believe joblib creates subprocesses not threads, but yes you are correct,
Basically once Task.init is called, every forked/spawned process will be automatically logged to the main process Task (you can, and probably should call either Task.init or Task.current_task() from the forked processes, but this is just a detial)
The mai...
JitteryCoyote63 to filter out 'archived tasks' (i.e. exclude archived tasks)Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(system_tags=["-archived"])))
Also could you explain the difference between trigger.start() and trigger.start_remotely()
Start will start the trigger process (the one "watching the changes") locally (this makes sense for debugging etc.)
start_remotely will launch the trigger process on the "services" where it should live forever 🙂
Okay so when I add trigger_on_tags, the repetition issue is resolved.
Nice!
This problem occurs when I'm scheduling a task. Copies of the task keep being put on the queue ...
Great, you can test directly from the master 🙂pip3 install -U git+
Building the pipeline in runtime from external configuration is very cool!!
I think nested components is exactly the correct solution, and it is a great use case.
Hi EnviousStarfish54
Color coding on the entire UI is stored per user (I think that on your local cookies, but I might be wrong). Anyhow any title/series combination will have the select color regardless of the project.
This way you can configure once that loss is red and accuracy is green, etc.
VexedCat68
But what's happening is, that I only publish a dataset once but every time it polls,
this seems wrong (i.e a bug?!), how do you setup the trigger ? is the Trigger Task constantly running or are you re-launching it?
To store all the debug samples, also it can store all the models (if you configure the output_uri=' http://file_server_here:8081 ') Yes: instead of the file server have 's3://<ip_of_minio>:9000/bucket' make sure you add the credentials for the minio in the trains.conf Yes, basically once you have the creendtials in the trains.conf, you could do StorageManager.get_local_copy('s3://<minio>:9000/bucket/file') (also upload of course 🙂 )
we have a separate cache
Why? they can share
Is this per Task or for all the Tasks always ?
Hi AbruptHedgehog21
can you send the two models info page (i.e. the original and the updated one) ?
do you see the two endpoints ?
BTW: --version would add a version to the model (i.e. create a new endpoint with version "endpoint/{version}"
it does appear on the task in the UI, just somehow not repopulated in the remote run if it’s not a part of the default empty dict…
Hmm that is the odd thing... what's the missing field ? Could it be that it is failing to Cast to a specific type because the default value is missing?
(also, is issue present in the latest clearml RC? It seems like a task.connect issue)
maybe I should use explicit reporting instead of Tensorboard
It will do just the same 😞
there is no method for settingÂ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
FiercePenguin76 the git repo should detect only clearml
as required python package
Basically the steps are:
decide if the initial python entry script is a standlone script (i.e. no local imports) in the git repo (in your example "task_with_deps.py") If this is a "standlone script" only look for imports inside the calling python script, and list those packages under "installed packages" If this is Note a standalone script, go over All the python files inside the repository, look for "i...
Hi HollowFish37
I think I have good news for you, the clearml-agent is only communicating with the api endpoint, so as long as this is secure, you should be fine. Do notice that the default files server endpoint should be secure as well, as by default it will allow any upload/download
Try removing this magic environment that tells the sub-process there was already an Initialized Task.
import os env = dict(**os.environ) env.pop('TRAINS_PROC_MASTER_ID', None)
🙂
Hi ShakyJellyfish91
Check mount default here:
https://github.com/allegroai/clearml-agent/blob/e416ab526ba9fe05daa977b34c9e46b50fb214a0/docs/clearml.conf#L186
Is this what you are after, or do you actually want to change the start up script?
Hi PompousBeetle71 , Trains will log all the torch.save call, I'm assuming they do not actually use it for the rest of the files on that folder.
If you like to share a code snippet we could see if we could auto-magically log it You could use artifacts and store the entire folder. It will zip it an upload it. Then you can reuse it from other experiments. https://allegro.ai/docs/task.html?highlight=artifact#trains.task.Task.upload_artifact
Example:
` task.upload_artifact('transformer', './my_...
Could that be the proper way to install ?
https://github.com/facebookresearch/pytorch3d/blob/main/INSTALL.md#3-install-wheels-for-linux
DeliciousBluewhale87 out of curiosity , what do you mean by "deployment functionality" ? is it model serving ?
Clearml automatically gets these reported metrics from TB, since you mentioned see the scalars , I assume huggingface reports to TB. Could you verify? Is there a quick code sample to reproduce?
Maybe there is setting in docker to move the space used in a different location?
No that I know of...
I can simply increase the storage of the first disk, no problem with that
probably the easiest 🙂
But as you describedÂ
 it looks like an edge case, so I don’t mindÂ
🙂
Specifically your error seems to be an issue with nvidia Triton container upgrade
And other question is clearml-serving ready for serious use?
Define serious use? KFserving support is in the pipeline, if that helps.
Notice that clearml-serving is basically a control plane for the serving engine, not to neglect the importance of it, the heavy lifting is done by Triton 🙂 (or any other backend we will integrate with, maybe Seldon)
Hi VexedCat68
Could it be the python version is not the same? (this is the only reason not to find a specific python package version)
SmarmyDolphin68 sadly if this was not executed with trains (i.e. the offline option of trains), this is not really doable (I mean it is, if you write some code and parse the TB 😉 but let's assume this is way to much work)
A few options:
On the next run, use clearml OFFLINE option, (i.e. in your code call Task.set_offline() , or set env variable CLEARML_OFFLINE_MODE=1) You can compress the upload the checkpoint folder manually, by passing the checkpoint folder, see https://github.com...
Hi HealthyStarfish45
You can disable the entire TB logging :Task.init('examples', 'train', auto_connect_frameworks={'tensorflow': False})
but I was wondering if there's any limitation in creating an image with a non root user to use as the actual worker?
SarcasticSquirrel56 non-root pods (containers) are fully supported,
I would recommend using the latest agent RC (that simplified a few things)clearml-agent==1.4.0rc3
I see... because the problem it would be with permissions when creating artifacts to store in the "/shared" folder
You mean as output target for artifacts ?
especially for datasets (for th...