Reputation
Badges 1
981 × Eureka!SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?
Seems like it just went unresponsive at some point
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
What happens is different error but it was so weird that I thought it was related to the version installed
agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
AgitatedDove14 any chance you found something interesting? ๐
clearml doesn't change the matplotlib backend under the hood, right? Just making sure ๐
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
yes, here is the error (the space at the end of the line is there)
` Applying uncommitted changes
Executing: ('git', 'apply'): b'error: corrupt patch at line 13\n'
Failed applying diff
trains_agent: ERROR: Failed applying git diff:
diff --git a/configs/2.2.2_from_scratch.yaml b/configs/2.2.2_from_scratch.yaml
index 9fece48..5816f78 100644
--- a/configs/2.2.2_from_scratch.yaml
+++ b/configs/2.2.2_from_scratch.yaml
@@ -136,7 +136,7 @@ data_processing:
optimizer:
type: 'RMSprop'
args:
- lr: 2.5e...
Awesome, thanks!
Is there any logic on the server side that could change the iteration number?
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
alright I am starting to get a better picture of this puzzle
But we can easily extend, right?
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
Sorry, I didn't get that ๐
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didnโt
btw I monkey patched igniteโs function global_step_from_engine to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(โฆ, global_step_transform=patched_global_step_from_engine(engine)) . It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
I ended up dropping omegaconf altogether
