` ssh my-instance
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:O2++ST5lAGVoredT1hqlAyTowgNwlnNRJrwE8cbM...
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
AgitatedDove14 Yes with the command you shared I can now ssh again manually to the agent, but I still clearml-agent will raise the same error
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
Make sure the cloned task is in Draft mode, if not, reset it
Then in the Execution tab of th task, in the Source Code section (first one), you can edit the values
But I am not sure it will connect the parameters properly, I will check now
Ok, by setting PyJWT==1.7.1
in the setup.py of the experiment pip did not enforced the update
ok, what is your problem then?
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
CostlyOstrich36 I don’t see such number, can you please share a screenshot of where to look at?
I hitted enter too fast ^^
Installing them globally via$ pip install numpy opencv torch
will install locally with warning:Defaulting to user installation because normal site-packages is not writeable
, therefore the installation will take place in ~/.local/lib/python3.6/site-packages
, instead of the default one. Will this still be considered as global site-packages
and still be included in experiments envs? From what I tested it does
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
That said, v1.3.1 is already out, with what seems like a fix:
So you mean 1.3.1 should fix this bug?
Would be very cool if you could include this use case!
trains==0.16.4
(I am not part of the awesome ClearML team, just a happy user 🙂 )
awesome! Unfortunately, calling artifact["foo"].get()
gave me:Could not retrieve a local copy of artifact foo, failed downloading file:///checkpoints/test_task/test_2.fgjeo3b9f5b44ca193a68011c62841bf/artifacts/foo/foo.json
It tries to get it from the local storage, but the json is stored in s3 (it does exists) and I did create both tasks specifying the correct output_uri (to s3)
I see 3 agents in the "Workers" tab
I checked the server code diffs between 1.1.0 (when it was working) and 1.2.0 (when the bug appeared) and I saw many relevant changes that can introduce this bug > https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -d
And now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
automatically promote models to be served from within clearml
Yes!