
Reputation
Badges 1
92 × Eureka!` [2021-01-24 17:02:25,660] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_all in 2ms
[2021-01-24 17:02:25,674] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_next_task in 8ms
[2021-01-24 17:02:26,696] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 36ms
[2021-01-24 17:02:26,742] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 78ms
[2021-01-24 17:02:27,169] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_al...
From the UI it will since it getting the temp file from there.
I mean from the code (let say remotely)
Hi SuccessfulKoala55 ,
I down the server:
` [ec2-user@ip-172-31-26-41 ~]$ sudo docker-compose -f /opt/clearml/docker-compose.yml down
WARNING: The CLEARML_HOST_IP variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_USER variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_PASS variable is not set. Defaulting to a blank string.
Stopping clearml-webserver ... done
Stopping clearml-agent-services ... done
Stopping clearml-apiserver...
Ok looks It is starting the training...
Thanks 💯
my docker has my project on it all ready so I know where to mount. Maybe the agent moves/create copy of my project somewhere else?
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
Thanks AgitatedDove14 ,
I need to check with my boss that it is OK to share more code, will let you know..
But I will give 0.16 a try when it will release.
🙏
the index creation:[ec2-user@ip-172-31-26-41 ~]$ sudo docker exec -it clearml-mongo /bin/bash root@3fc365193ed0:/# mongo MongoDB shell version v3.6.5 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.6.5 Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see
Questions? Try the support group
`
Server has startup warnings:
2021-01-25T05:58:37.309+0000 I CONTROL [initandlisten]
2021-01-25T05:58:37.309+0000 I C...
If I will mount the S3 bucket to the trains-server and link the mount to /opt/trains/data/fileserver does it will work?
Thanks for the reply,
I saw that it prefer to change the fileserver in trains.conf to s3://XXX
So, I changed this as I wrote before.
Ohh I understood, so can you give me a short explanation on how to change the meta data?
Hi, AgitatedDove14 Thanks for the answer.
I think the upload reporting (files over 5mb) was added post 0.17 version,
That what I thought...
I think it can be helpful to add it to the conf since 5MB is really small and my files are ~300MB, meaning 60 messages for each upload.
Another option is maybe to configure it as Task.init() parameter
I think both are OK 🙂
ARG USER_ID=1000 RUN useradd -m --no-log-init --system --uid ${USER_ID} appuser -g sudo RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers USER appuser WORKDIR /home/appuser
Thanks, I will make sure that all the python packages install as root..
And will let you know if it works
SuccessfulKoala55 Thanks 🙏 ..
Another related question:
My remote job fails because it cannot find the data.FileNotFoundError: [Errno 2] No such file or directory: './data/XXXXXXXX
I mounted the data to the same place relative to my project inside the docker with: extra_docker_arguments
I am using execute_remotely
for enqueue the job.
I know it works locally since the job reads from ./data/XXXX before execute_remotely()
and working.
but when the agent create ...
Hi SuccessfulKoala55 ,
Dose running_remotely()
will return True even if the task was enqueued from UI and not by execute_remotely
?
SuccessfulKoala55 and AppetizingMouse58 Thanks you very much!!
I have a future question:
Does this fix should harm in future cleraml-server upgrade?
Or what the best practice to upgrade after doing it?
Hi AgitatedDove14 ,
Sorry for the late response It was late at my country 🙂 .
This what I am gettingappuser@219886f802f0:~$ sudo su root root@219886f802f0:/home/appuser# whoami root
AgitatedDove14 Hi, sorry for the long delay.
I tried to use 0.16 instead of 0.13.1.
I didn't have time to debug it (I am overwhelming with work right now).
But it doesn't work the same as 0.13.1. I am still getting some hanging in my eval process.
I am don't know if it just slower or really stuck since I killed it and move back to 0.13.1 until my busy time will pass.
Thanks
I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
Thanks I just want to avoid giving the credentials to every user.
If it won't possible, I will do it..
Thanks!! you are the best..
I will give it a try when the runs will finish
OK thanks for the answer.. I will usetask.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
as you suggested for now..
If you will add something like I suggest can you notify me?
WOW.. Thanks 💯
Yes this is what we are doing 👍
I am sure you add this timeout for a reason.
Probably since increasing the timeout can affect other functionality. .
Am I wrong?
I update to the new version 0.16.1 few weeks away and it works using the elastic_upgrade.py