Reputation
Badges 1
979 × Eureka!hooo now I understand, thanks for clarifying AgitatedDove14 !
So previous_task
actually ignored the output_uri
The number of documents in the old and the new env are the same though 🤔 I really don’t understand where this extra space used comes from
but I also make sure to write the trains.conf to the root directory in this bash script:echo " sdk.aws.s3.key = *** sdk.aws.s3.secret = *** " > ~/trains.conf ... python3 -m trains_agent --config-file "~/trains.conf" ...
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain
setting - I did it, restarted and now it works 🙂 Thanks!
but not as much as the ELB reports
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
So probably only the main process (rank=0) should attach the ClearMLLogger?
yes -> but I still don't understand why the post_packages didn't work, could be worth investigating
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
That said, v1.3.1 is already out, with what seems like a fix:
So you mean 1.3.1 should fix this bug?
Yes that’s correct - the weird thing is that the error shows the right detected region
After I started clearml-session
SmugDolphin23 Actually adding agent.python_binary
didn't work, it was not read by the clearml agent (in the logs dumped by the agent, agent.python_binary =
(no value)
automatically promote models to be served from within clearml
Yes!
AgitatedDove14 Yes with the command you shared I can now ssh again manually to the agent, but I still clearml-agent will raise the same error
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
the api-server shows when starting:clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic host
`
clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic port 9200
...
clearml-apiserver | [2021-07-13 11:09:38,407] [9] [WARNING] [clearml.initialize] Could not connect to ElasticSearch Service. Retry 1 of 4. Waiting for 30sec
clearml-apiserver | [2021-07-13 11:10:08,414] [9] [WARNING] [clearml.initia...
So it is there already, but commented out, any reason why?
Trying now your code… should take a couple of mins
but if the task is now running on an agent, isn’t is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running