Reputation
Badges 1
979 × Eureka!BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far 🙂
I managed to do it by using logger.report_scalar, thanks!
Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
but according to the disks graphs, the OS disk is being used, but not the data disk
Seems like it just went unresponsive at some point
But you might want to double check
AgitatedDove14 https://clear.ml/docs/latest/docs/apps/clearml_session/#running-in-docker in the docs there is a --docker
option, that’s what confuses me, since the agent should always run in docker mode
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
There’s a reason for the ES index max size
Does ClearML enforce a max index size? what typically happens when that limit is reached?
SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?
Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no sca...
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
The number of documents in the old and the new env are the same though 🤔 I really don’t understand where this extra space used comes from
Here is (left) the data disk (/opt/clearml) and right the OS disk
it also happens without hitting F5 after some time (~hours)
Here is the console with some errors
Yes, I set:auth { cookies { httponly: true secure: true domain: ".clearml.xyz.com" max_age: 99999999999 } }
It always worked for me this way
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
Sorry, its actuallytask.update_requirements(["."])Â
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2
(instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?
Would adding a ILM (index lifecycle management) be an appropriate solution?
Ha nice, makes perfect sense thanks AgitatedDove14 !
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasn’t taken into account) I’ve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
Opened an issue with the logs here > None
yes, so it does exit the local process (at least, the command returns), but another process is still running on the background and is logging things from time to time (such as:)ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
yes, exactly: I run python my_script.py
, the script executes, creates the task, calls task.remote_execute(exit_process=True)
and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml