![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/AgitatedDove14.png)
Reputation
Badges 1
25 × Eureka!HealthyStarfish45 We are now working on improving the k8s glue (due to be finished next week) after that we can take a stab at slurm, it should be quite straight forward. Will you be able to help with a bit of testing (setting up a slurm cluster is always a bit of a hassle π )?
? Do you have a link how to setup a task scheduler to run in service mode in k8s?
basically spin the agent pod and add an argument to the agent itself (this is the --service-mode)
https://clear.ml/docs/latest/docs/clearml_agent#services-mode
Oh I see the pipeline controller itself (not the components) is the one with the repo
To fix that add at the top of the script the following:
` from clearml import Task
Task.force_store_standalone_script()
@PipelineDecorator.pipeline(...) `That should do the trick
I think your "files_server" is misconfigured somewhere, I cannot explain how you ended up with this broken link...
Check the clearml.conf on the machines or the env vars ?
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
Hi FriendlyKoala70 , trains will report all the tensorboard graphs, I'm assuming that's who is creating the epoch_lr graph. On top of it, you can always report manually with logger (as you pointed). Does that make sense to you?
Hi @<1523701949617147904:profile|PricklyRaven28>
I'm trying to figure out if i have a way to report pipeline-step artifact paths in the main pipeline task. (So i don't need to dig into steps to find the artfacts.
Basically this is the monitor_artifacts
argument
None
:param monitor_artifacts: Optional, log the step's artifacts on the pipeline ...
Hi ShakyJellyfish91
Check mount default here:
https://github.com/allegroai/clearml-agent/blob/e416ab526ba9fe05daa977b34c9e46b50fb214a0/docs/clearml.conf#L186
Is this what you are after, or do you actually want to change the start up script?
however, this will also turn off metricsΒ
For the sake of future readers, let me clarify on this one, turning it off auto_connect_frameworks={'pytorch': False}
only effects the auto logging of torch.save/load
(side note: the reason is pytorch does not have built in metric reporting, i.e. it is usually done manually and these days most probably with tensorboard, for example lightning / ignite will use tensorboard as default metric reporting),
I have an idea, can you try with:task = Task.init(..., reuse_last_task_id=False)
I have a suspicion it starts the Tasks in parallel, and the "reuse_last_task_id" causes them to "reuse the same task locally" which makes them overwrite the configuration of one another.
Hi @<1573119955400921088:profile|CloudyPelican46>
On what machine is it best practice to run the clean up service, local machine or should it be on the clearml server ?
The easiest is to run it on the server machine itself, even though in practice you can put it anywhere, but most of the time this service is sleeping and not using so much RAM so it kind of makes sense
I have the same offset (that appear after each fail on my scalars).
Hmm, I actually would think this is the "correct" behavior, but I see your point:
Any chance you can open a GH issue ?
That makes no sense to me?!
Are you absolutely sure the nntrain is executed on the same queue? (basically could it be that the nntraining is executed on a different queue in these two cases ?)
HealthyStarfish45
No, it should work π
Hmm there was this one:
https://github.com/allegroai/clearml/commit/f3d42d0a531db13b1bacbf0977de6480fedce7f6
Basically always caching steps (hence the skip), you can install from the main branch to verify this is the issue. an RC is due in a few days (it was already supposed to be out but got a bit delayed)
So the only difference is how I log in into machine to start clear-ml
the only different that I can think of is the OS Environments in the two login types:
can you run export
in the two cases and check the diff between them?export
odd message though ... it should have said something about boto3
I want to keep the above setup, the remote branch that will track my local will be onΒ
fork
Β so it needs to pull from there. Currently it recognizesΒ
origin
Β so it doesnβt work because the agent then canβt find the commit.
So you do not want to push the change set ?
You can basically add the entire change set (uncomitted changes) from the last pushed commit).
In your clearml.conf, set store_code_diff_from_remote: true
https://github.com/allegroai...
Oh I think that I understand what's going on, @<1523701260895653888:profile|QuaintJellyfish58> let me check how to update the cron scheduler while it is running (I really like this idea, so if this is not already supported I'l like us to add this capability π )
If you use this one for example, will the component have pandas as part of the requirement
None
def step_two(...):
import pandas as pd
# do stuff
If so (and it should), what's the difference, where is "internal.repo " different from pandas ?
Hi @<1541954607595393024:profile|BattyCrocodile47>
I
do
have the SSH key placed at
/root/.ssh/id_rsa
on the machine,
Notice that the .ssh folder is mounted from the host (EC2 / GCP) into the container,
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh'
This is odd, why is it mounting it to /.ssh and not /root/.ssh ?
Basically the links to the file server are saved in both mongo and elastic, so as long as these are host:ip based, at least in theory it should work
Hi MysteriousBee56 , do you have Trains installed from the git?
Another question, you mentioned "it breaks my execution", I'm assuming you mean trains-agent?!
If that is the case, there is a fix for trains-agent install 0.15.2rc0
OH I see. I think you should use the environment variable to override it:
None
so add to the docker args something like
-e CLEARML_AGENT__AGENT__PACKAGE_MANAGER__POETRY_INSTALL_EXTRA_ARGS=
Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?
Parent makes sense if you are changing the data of the parent version, but some data is preserved. Which will make the delta-based storage only store the diff.
If everything is different, and you call sync
for example, then it will not reference any previous "snapshot", so there will be no redundancy in storage, but you still get a pointer to the "parent" version.
Make sense ?
So the way it works when you run a component the return value with the entire function execution is cached, basically:
this did NOT add the artifact to the pipeline via caching on subsequent runs β
you just need to do:
PipelineDecorator.upload_artifact(name='images', artifact_object=img_dir, wait_on_upload=True)
return Task.current_task().artifacts['images'].url
This will return the URL of the uploaded images (i.e. S3 bucket)
which means if this is cached you will get it...
BTW: if you feel like writing a wrapper it could be cool π
using the docker-compose file for the
clearml-serving
pipeline, do we also have to mount it somehow?
oh yes, you are correct the values are passed using environment variables (easier when using docker compose)
You can in addition add a mount from the host machine to a conf file,
volumes:
- ${PWD}/clearml.conf:/root/clearml.conf
wdyt?
VexedCat68 yes π you can also pass the parent folder and it will zip the entire subfolders into a single artifact