Reputation
Badges 1
282 × Eureka!The problem is resolved by doing a git push. Somehow the git diff didn't capture the difference in requirements.txt in the project. I can't reproduce the same issue after this as well.
I'm not familiar with elastic. What role does elastic play in ClearML?
[root@2c7498711bef elasticsearch]# curl
`
{
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-05-22T11:33:38.932Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisi...
Likely network. Can you run a curl on ClearML server api server from jenkin stage and see if that gets through?
so the clearml-agent daemon needs higher privilege?
Thanks that did solve the problem, the tasks are running again.
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
I would say its intermittent.
Ok, let me check this out first thing on Monday, thanks AgitatedDove14 .
Ok thanks.
Hi SuccessfulKoala55 , just to add, my clearml.conf (client) and clearml.agent.conf (agent) can have differing values. I'm not sure which one takes precedence and if this could be the cause.
Hi AgitatedDove14 , i've got the same error. It would appear that the code references clearml_agent/helper/base.py
which i believe is part of clearml-agent v0.17.1. Could that be the issue?
python k8s_glue_example.py --queue gpu --namespace default
Traceback (most recent call last):
File "k8s_glue_example.py", line 86, in <module>
main()
File "k8s_glue_example.py", line 80, in main
namespace=args.namespace,
File "/home/administrator/clearml-agent-k8s/venv/lib/python3.6/site-packages/clearml_agent/helper/base.py", line 239, in _ call _
cls. instances[cls] = super(Singleton, cls). call_(*args, **kwargs)
TypeError: _ init _() got an unexpected keyword argument 'base_pod...
where should i indicate in the configuration?
Any idea?
Nice, what are the names of the talks?
Yeah that sounds good. But from user perspective, especially the untrained, they wouldn't know what to point to. Example, some may think it's an exe, some think it's a zip bundle, and others think it's any github repo with the word vscode.
Hi we did a check. Only 7.16.1 and 6.8.21 and above mitigates the attack. What's the current version that ClearML is using?
Hi, building a container with vscode is not possible. If i have an alternative location for the vscode, where should i indicate in the configuration?
can you please verify that you have all the required packages installed locally ?
Its not installed on the image that runs the experiment. But its reflected in the requirements.txt.
what is the setting of
agent.package_manager.system_site_packages
True.
It would make sense on a very large resource cluster. Unfortunately we only have less than 50 GPUs to share across. A multi-tenant SAAS would cut the resources into even more smaller clusters and not help with efficiency. Or would you have a suggestion?
Hi SuccessfulKoala55 , thanks, tested the patch and its working as expected now.
which clearml.conf is it refering to? I'm executing on my client, which is then remotely executed by the agent. Both of them has ~/clearml.conf.
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?
Can this issue be solved with vault? It doesn't make sense to expose secrets like that.
Hi, we are still not getting the model repo to work, mainly due to clearml.storage failing to save the models.
We tried a vanilla boto3 code and it works, but we can't figure out why we get connectionreseterror 104 when clearml does it.
How do we configure clearml in correspondence to following boto code?
S3= boto3.resource('s3', endpoint_url=' https://ecs.ai ', aws_access_key_id='mykey', aws_secret_access_key='mysevret', config=Config(signature_version='s3v4'), region_name='us-east-1', ve...
So the context I'm asking is I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet. And for each experiment, I'll need to go into the code commit to see which id is being used. But on the other hand, I thought I've seen advertised use cases where the experiment can be directly linked to the dataset id being used. The brain's a bit rusty to recall how it was done.