Reputation
Badges 1
979 × Eureka!Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
yes but they are in plain text and I would like to avoid that
That would be amazing!
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0)
Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)
So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
AgitatedDove14 Yes that might work, also the first one (with conda) might work as well, I will give it a try, thanks!
I’ve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
I’ve set dynamic: “strict” in the template of the logs index and I was able to keep the same mapping after doing the reindex
Should I try to disable dynamic mapping before doing the reindex operation?
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
the reindexing operation showed no error and copied everything
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
I did change the replica setting on the same index yes, I reverted it back from 1 to 0 afterwards
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -d
And now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES won’t let me create an alias with the same name as an existing index
I made sure before deleting the old index that the number of docs matched
SuccessfulKoala55 I am using ES 7.6.2
AppetizingMouse58 Yes and yes
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
amazon linux
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Hi AgitatedDove14 , I don’t see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?
I managed to do it by using logger.report_scalar, thanks!
Seems like it just went unresponsive at some point
Nice, the preview param will do 🙂 btw, I love the new docs layout!
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)