Reputation
Badges 1
979 × Eureka!yes, here is the error (the space at the end of the line is there)
` Applying uncommitted changes
Executing: ('git', 'apply'): b'error: corrupt patch at line 13\n'
Failed applying diff
trains_agent: ERROR: Failed applying git diff:
diff --git a/configs/2.2.2_from_scratch.yaml b/configs/2.2.2_from_scratch.yaml
index 9fece48..5816f78 100644
--- a/configs/2.2.2_from_scratch.yaml
+++ b/configs/2.2.2_from_scratch.yaml
@@ -136,7 +136,7 @@ data_processing:
optimizer:
type: 'RMSprop'
args:
- lr: 2.5e...
So I want to be able to visualise it quickly as a table in the UI and be able to download it as a dataframe, which of report_media or artifact is better?
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase ๐
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0)
call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
There is no way to filter on long types? I canโt believe it
Maybe the agent could be adapted to have a max_batch_size parameter?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
It indeed has the old commit, so they match, no problem actually ๐
Although task.data.last_iteration
ย is correct when resuming, there is still this doubling effect when logging metrics after resuming ๐
with the CLI, on a conda env located in /data
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didnโt think there could be a problem with ESโฆ
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
I have no idea what's going on
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
I cannot share the file itself, but here are some potential helpful points:
Multiple lines empty One line is empty but has spaces (6 to be exact) The last line of the file is empty
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?