Check the first steps here:
(Basically you have to generate credentials / configure you machine so it knows where the server is and how to access it)
Make sense ?
SarcasticSparrow10 sure see "execute_remotely" it does exactly that:
It will stop the current process (after syncing everything) and launch itself remotely (i.e. enqueue itself)
When the same code is running by the "trains-agent" the execute_remotely call becomes a no-operation and is basically skipped
it's saved in a
folder where i started the script instead.
It should be saved there + it should upload it to your file server
Can you send the Task log? (this is odd)
We are always looking for additional talented people 😉 DM me...
Yep this will work. BTW check the new pipeline it might have a more flexible solution
Can you copy the "Installed Packages" here, and point to the package causing the issue?
In theory it should not, in practice you could run out of space while running the experiment itself...
You can always cleanup everything from time to time (maybe worth a flag?)
Can you reproduce this behavior outside of lightning? or in a toy example (because I could not)
Hi LudicrousDeer3
It should not be a problem see iteration
argument in Logger.report_scalar
GreasyPenguin14 you mean the artifacts/models ?
Hi ExasperatedCrocodile76
This is quite the hack, but doable 🙂
file_path = task.connect_configuration(name = 'augmentations', configuration = '')
import importlib
module_name = 'augmentations'
spec = importlib.util.spec_from_file_location(module_name, file_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module) `
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:
` trains-agent execute --id <...
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to i...
now realise that the ignite events callbacks seem to not be fired
So this is an ignite issue ?
Yes 🙂
BTW: do you guys do remote machine development (i.e. Jupyter / vscode-server) ?
Notice that the StorageManager has default configuration here:
Then a per bucket credentials list, with detials:
One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.
Hi GracefulDog98
The agent will map the ~/.ssh folder automatically into the docker's /root/.ssh
It will also convert http links to ssh pull if you set force_git_ssh_protocol
in your clearml.conf :
I want to keep the above setup, the remote branch that will track my local will be onÂ
 so it needs to pull from there. Currently it recognizesÂ
 so it doesn’t work because the agent then can’t find the commit.
So you do not want to push the change set ?
You can basically add the entire change set (uncomitted changes) from the last pushed commit).
In your clearml.conf, set store_code_diff_from_remote: true
Yes! Thanks so much for the quick turnaround
My pleasure 🙂
BTW: did you see this (it seems like the same bug?!)
Yes, it recreates the venv (or fetches it from cache) if you need your dataset, use Dataset class (it will cache it persistently, so no need to re-download)
Hi SpotlessFish46 ,
Is the artifact already in S3 ?
Is the S3 configured as the default files_server in the trains.conf
You can always use the StorageManager upload to wherever and register the url on the artifacts.
You can also programmatically change the artifact destination server to S3, then upload the artifact as usual.
What would be the best natch for you?
Thank you @<1689446563463565312:profile|SmallTurkey79> !!!
MysteriousBee56 yes, please change the trains code!!! Wee pee, if you think someone else can benefit, feel free to PR :)
Regrading the double entry, that seems like an odd bug, how can I reproduce it?
And do you need to run your code inside a docker, or is venv enough ?
Could you send the "installed packages" section of the Task that was created in the notebook ?