Reputation
Badges 1
25 × Eureka!PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py
I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py
ScantMoth28 where are you seeing this warning ?
does this work for multiple levels?
Yep π
Hi @<1600661428556009472:profile|HighCoyote66>
However, we need to allocate resources to ourselves manually, using an
srun
command or
sbatch
Long story short, there is a full SLURM integration, basically you push a job into the ClearML queue and it produces a slurm job that uses the agent to setup the venv/container and run your Task, but this is only part of the enterprise version π
You can however do the following (notice this is ...
Hi @<1523701949617147904:profile|PricklyRaven28>
Sorry, we missed that one
we need to invoke it with
accelerate launch
so we use
subprocess.run
So you have two options, either you change the script entry of the Task from your " script.py
" to" -m accelerate launch script.py
or you manually do that inside your entry point (i.e. call accelerate launch)
BTW, I "think" we added an "auto detect" for it, so that if you launched it manually this wa...
there is a bug wherein both
Task.current_task()
and
Logger.current_logger()
return
None
.
This is not a bug this means something broke, the environment variable CLEARML_TASK_ID
Has to be set inside the agent's process
How are you running it? (also log π , you can DM so it is not public here)
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
Hi @<1692345677285167104:profile|ThoughtfulKitten41>
Is it possible to trigger a pipeline run via API?
Yes! a pipeline is at the end a Task, you can take the pipeline ID and clone and enqueue it
pipeline_task = Task.clone("pipeline_id_here")
Task.enqueue(pipeline_task, queue_name="services")
You can also monitor the pipeline with the same Task inyerface.
wdyt?
Hi @<1569496075083976704:profile|SweetShells3>
These environment variable are injected into the new process, are you passing them on the vault?
None
Is there any way to make that increment from last run?
pipeline_task = Task.clone("pipeline_id_here", name="new execution run here")
Task.enqueue(pipeline_task, queue_name="services")
wdyt?
How does this work in the context of a pipeline?
Is your pipeline from functions / decorators ? or is it from Tasks ?
(if this is Tasks then just changing the entry point in the overides)
In case of functions or decorators, you have to do that manually (i.e. your function needs to do "accelerate launch"
from accelerate.commands.launch import launch_command, launch_command_parser
parser = launch_command_parser()
args = parser.parse_args("-command -here".split())
launch_command(arg...
I still don't get resource logging when I run in an agent.
@<1533620191232004096:profile|NuttyLobster9> there should be no difference ... are we still talking about <30 sec? or a sleep test? (no resource logging at all?)
have a separate task that is logging metrics with tensorboard. When running locally, I see the metrics appear in the "scalars" tab in ClearML, but when running in an agent, nothing. Any suggestions on where to look?
This is odd and somewhat consistent with actu...
The first pipeline
Β step is calling init
GiddyPeacock64 Is this enough to track all the steps?
I guess my main question is every step in the pipeline an actual Task/Job or is it a single small function?
Kubeflow is great for simple DAGs but when you need to build more complex logic it is usually a bit limited
(for example the visibility into what's going on inside each step is missing so you cannot make a decision based on that).
WDYT?
SoreDragonfly16 the torchvision warning has nothing to do with the Trains
warning.
The Trains warning means that somehow someone changes the state of the Task from running (in_progress) to "stopped" (aborted). Could it be one of the subprocesses raised an exception ?
strange ...
create inside another task that would again run remotely
This Task will be run on another node, user / permissions will be dealt with by the agent on the other node running the Task
Hi LazyLeopard18 ,
See details below, are you using the win10 docker-compose yaml?
https://github.com/allegroai/trains-server/blob/master/docs/install_win.md
You actually have to login/ssh under said user, have another dedicated mountpoint and spin the agent from that user.
CourageousLizard33 column order / specific selection is stored per user. If you press the share button you will have a link with all the definitions embedded on it.
Column resizing and order is in the next version release :)
Hi! I was wondering why ClearML recognize Scikit-learn scalers as Input Models...
Hi GiganticTurtle0
any joblib.load/save is logged by clearml (it cannot actually differentiate what it is used for ...)
You can of course disable it with Task.init(..., auto_connect_frameworks={'joblib': False})
Hi PanickyMoth78
My local
clearml.conf
file has agent's
git_user
and
git_pass
defined as in my
in order for the autoscaler to access your git , in the wizard you have to provide the git user/token
The component agent's log has:
Executing task id [90de043e354b4b28a84d5cc0788fe63c]: repository = branch = version_num =
Hmm, how does the decorator of the component looks like ? meaning did you specify a repo/branch/commi...
so i end up having to clone the other ones manually in my code
Hi ConvolutedChicken69
Yes the problem is that there is no standard for multi repo environments
The best solution I can come up with is using git-submodules or packaging the auxiliary repo as wheels. wdyt?
the other repos i have are constantly worked on and changing too
Not only it will be cloned automatically, the git diff of the sub-modules are stored as well π
try Hydra/trainer.params.batch_size
hydra separates nesting with "."
TrickySheep9 Yes, let's do that!
How do you PR a change ?