Reputation
Badges 1
38 × Eureka!it looks like clearml-apiserver
and clearml-fileserver
are continually restarting
no, they are still rebooting. i've looked in /opt/clearml/logs/apiserver.log
no errors
incidentally we turn off the server every evening as it's not used overnight, we've not faced issues with it starting up in the morning or noticed any data loss
Hi @<1523701205467926528:profile|AgitatedDove14>
Yes the clearml-server AMI - we want to be able to back it up and encrypt it on our account
thanks Stef, with max-size
do you set it for every running service separately, or can you set it once?
not entirely sure on this as we used the custom AMI solution, is there any documentation on it?
thanks @<1523715084633772032:profile|AlertBlackbird30> this is really informative. Nothing seems to be particularly out of the ordinary though
3.7G /var/lib/
3.7G /var/lib/docker
3.0G /var/lib/docker/overlay2
followed by a whole load of files that are a few hundred KBs in size, nothing huge though
Is there a way you can allow our account to make a copy of the AMI and store it privately?
that should be the case, we have default_output_uri:
set to an s3 bucket
so am I right in thinking it's just the mount points that are missing?based on the output of df
above
After making the change yesterday to the docker-compose file, the server is completely unusable - this is all I see for the /dashboard screen
Morning, we got to 100% used which is what triggered this investigation. When we initially looked at overlay2 it was using 8GB, so now seems to be acceptable.
yep, in most of them:
/opt/clearml/config
apiserver.conf
clearml.conf
/opt/clearml/data/elastic_7
/nodes
/opt/clearml/data/fileserver
<empty>
/opt/clearml/data/mongo/configdb
<empty>
/opt/clearml/data/mongo/db
collection/index files, /diagnostic.data, /journal etc
/opt/clearml/data/redis
dump.rdb
/opt/clearml/logs
apiserver.log.x, filserver.log (0 bytes)
Just by chance I set the SSH deploy keys to write access and now we're able to clone the repo. Why would the SSH key need write access to the repo to be able to clone?
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Yep i've done all that, it didn't seem to work until I set the deploy key to write
strange, I used one of the publicly available AMIs for ClearML (we did not upgrade from the Trains AMI as started fresh)
Hey @<1523701205467926528:profile|AgitatedDove14> I am helping Max to get this working. I ran the clearml-agent init
and now have the correct entries in the clearml.conf file.
Created an ssh key from the agent, uploaded it to the git repo, but still getting this error:
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
If I manually ssh on to the agent, and run:
`git clon...
not yet, going to try and fix it today.
if I do a df
I see this, which is concerning:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 928K 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 7.9G 13G 40% /
tmpfs 790M 0 790M 0% /run/user/1000
so it looks like the mount points are not created. When do these g...
I added this to each of the containers
logging:
options:
max-file: 5
max-size: 10m
all sorted, I somehow missed the documentation about the mongodb migration
Hi @<1523701205467926528:profile|AgitatedDove14> I tried this out, but I keep getting connection timeouts in the browser getting to the ELB. The instance is showing as inservice and passing the healthcheck. Is there any other configuration I need to do in the clearml.conf to make this work?
our setup currently consists of an EC2 instance for clearml-server and one for clearml-agent. We're not using a load balancer at the moment.
no, that's what i'm trying to do
thanks, i'll try that out