agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
AgitatedDove14 any chance you found something interesting? π
clearml doesn't change the matplotlib backend under the hood, right? Just making sure π
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
yes, here is the error (the space at the end of the line is there)
` Applying uncommitted changes
Executing: ('git', 'apply'): b'error: corrupt patch at line 13\n'
Failed applying diff
trains_agent: ERROR: Failed applying git diff:
diff --git a/configs/2.2.2_from_scratch.yaml b/configs/2.2.2_from_scratch.yaml
index 9fece48..5816f78 100644
--- a/configs/2.2.2_from_scratch.yaml
+++ b/configs/2.2.2_from_scratch.yaml
@@ -136,7 +136,7 @@ data_processing:
optimizer:
type: 'RMSprop'
args:
- lr: 2.5e...
Awesome, thanks!
Is there any logic on the server side that could change the iteration number?
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
alright I am starting to get a better picture of this puzzle
But we can easily extend, right?
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
Sorry, I didn't get that π
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didnβt
btw I monkey patched igniteβs function global_step_from_engine to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(β¦, global_step_transform=patched_global_step_from_engine(engine)) . It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
I ended up dropping omegaconf altogether
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc is not ava...
Here is (left) the data disk (/opt/clearml) and right the OS disk
