Reputation
Badges 1
59 × Eureka!TimelyPenguin76 , it possible I tried to compare more than 10 experiments. The issue at the server is that it got very slow, and did not show the 'console' and 'scalars' results any longer, even for a single experiment.
CostlyOstrich36 , I don't have the ClearmlML RAM estimate. My machine is running many processes in addition to ClearML.
AgitatedDove14 , I did nothing to generate a command-line. Just cloned the experiment and enqueued it. Used the server GUI.
I don't get the error any longer and the experiments get deleted as expected. So no complains on my side...
I am running my own server. Those are not example experiments.
I am not sure it matters for the following output, but anyway please note that the clearml dockers are down right now.
sigalr@momo : ~ $ curl -XGET http://localhost:9200/_cat/indices
yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 2F6APbQWSvajTZQ5JxXY1Q 1 1 59 0 26.2kb 26.2kb
yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b bZMKKCaKRXCys6VD_9oDDw 1 1 8556 0 4.1mb 4.1mb
yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 c85DhB...
AppetizingMouse58 , SuccessfulKoala55 and AgitatedDove14 , after running the ES migration for the 2nd time the problem is solved 🎉 . Thank you all for your help! 🙏
Just to make sure, by running ES migration you mean running elastic_upgrade.py again. Correct?
It took ~36 hours two days ago.
The ES migration log is attached in the 1st message of this thread. Do you see any problems in it?
Is there any way to make sure that the ES migration results are not good?
The upgrade is from /home/orpat/trains/data/elastic into /home/orpat/trains/data/elastic_7. Do you different paths in the log? Where?
Yes I've performed the ES migration. The data is in clearml/data/elastic_7.
I will try it and keep you posted. Thanks!
my original trains server version was 0.14 if I remember correctly. Anywhere I can check it after the upgrade has been done?
My new clearml server is 1.5. I get that from http://localhost:8080/version.json but if there is somewhere else I should look, let me know.
The clearml dockers are down right now because I started a new ES migration (elastic_upgrade.py). I started it before you contacted me and I don't want to break it now. So I cannot look at the console right now.
It will probably finish 30 hours from now. If the same problems repeat, we will continue this chat then.
In file docker-compose.yml I replaced all the strings /opt/clearml/data/elastic_7 into /home/orpat/clearml/data/elastic_7.
Is there any log that maybe details the problem?
Is it ok to restore data/mongo from my backup, and leave all the other files that were created by elastic_upgrade.py (e.g., data/elastic_7) untouched?
What I mean, is: Do I need to run elastic_upgrade.py again, or just the mongo upgrade (clearml-server-1.2.0-migration.py)?
Update: I ran the mongo migration script (clearml-server-1.2.0-migration.py) and now I can see my projects! 👏
Now there is a new problem: I don't see any of the logs: console, artefacts, scalars, plots.
Can you help?
AgitatedDove14 SuccessfulKoala55 , after I ran elastic_update.py (stage 5 as described above), I saw there was a new folder named data/mongo_4. Doesn't it mean mongodb was already migrated?
The sequence is unclear then:
I followed the instructions in https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_es7_migration/ .
Stage 5 ("python elastic_upgrade.py") ended successfully.
Then I skipped "Upgrading to ClearML Server v.1.2. or Newer" and went straight to "Completing the Installation".
Did I do wrong? What should I do to fix it?
Attached are the agent log and the task log
Who/What created the initial experiment ?
I created the initial experiment from command-line, with either "python folder/script.py" or "python -m folder.script".
Both end up with the experiment not running. I am attaching an agent daemon log where the initial experiment was called with "python folder/script.py".
Why isn't the entry point just the python script?
The entry point is folder.script and not just the script because I need the 'current' folder while running the script ...
Could it be the file you are trying to run is not in the repository ?
It is unclear what file is missing. The only hint is "Keyerror: '.'" and I am not sure what that refers to. All my code files are in the repository. Maybe the problem is with some installed package file?
Are you running inside a docker ?
No, I am running inside a conda environment.
Any chance you can send the full log ? (edited)
What I sent is the full agent daemon log. If you are asking for the console...
Bingo (I guess). My code is local, with multiple files. I will try to connect it to a git repo and let you know how it worked.
Does the agent support uncommitted changes in multiple files? (on-top of a git commit).
I was still having the issue and then I recalled an old solution, that worked again today. Here it is:
F12 --> Applications tab --> Storage --> Clear site data --> refresh clearml login screen
CostlyOstrich36 , I cleared the local cache and everything turned black (I guess it's not related to the cache). So I can't even see the list of experiments now.
I get an empty list for the 'XHR' filter.