Reputation
Badges 1
981 × Eureka!Also I can simply delete the /elastic_7 folder, I don’t use it anymore (I have a remote ES cluster). In that case, I guess I would have enough space?
On the cloned experiment, which by default is created in draft mode, you can change the commit to point either a specific commit or the latest commit of the branch
Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak
Could you please point me to the relevant component? I am not familiar with typescript unfortunately 😞
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
Sure, where can I find this file?
I wouldn't do it, this is less code to maintain from your side and honestly too much auto magic makes it difficult for the user to control the environment (ie. to understand what happens behind the scenes). I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
I did that recently - what are you trying to do exactly?
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
Yea again I am trying to understand what I can do with what I have 😄 I would like to be able to export as an environment variable the runtime where the agent is installing, so that one app I am using inside the Task can use the python packages installed by the agent and I can control the packages using clearml easily
yes, exactly: I run python my_script.py , the script executes, creates the task, calls task.remote_execute(exit_process=True) and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
yes that makes sense, I will do that. Thanks!
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs
No idea, I also would have expected it to be automatically logged as console output 🤔
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
Thanks for the hint, I’ll check the paid version, but I’d like first to understand how much efforts it would be to fix the current situation by myself 🙂
SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?
There’s a reason for the ES index max size
Does ClearML enforce a max index size? what typically happens when that limit is reached?
Seems like it just went unresponsive at some point
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
What happens is different error but it was so weird that I thought it was related to the version installed
agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well