Reputation
Badges 1
282 × Eureka!Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
what feature on this paid roadmap are you referring to? I am indeed communicating with Noem on paid features.
Hi SuccessfulKoala55 , thanks. Opened issue on the CLearml-Agent GH at https://github.com/allegroai/clearml-agent/issues/67
python k8s_glue_example.py --queue gpu --namespace default
Traceback (most recent call last):
File "k8s_glue_example.py", line 86, in <module>
main()
File "k8s_glue_example.py", line 80, in main
namespace=args.namespace,
File "/home/administrator/clearml-agent-k8s/venv/lib/python3.6/site-packages/clearml_agent/helper/base.py", line 239, in _ call _
cls. instances[cls] = super(Singleton, cls). call_(*args, **kwargs)
TypeError: _ init _() got an unexpected keyword argument 'base_pod...
the hackathon is 3 days.
What type of pipeline steps are you running? From task, decorator or function?
We were trying with 'from task' at the moment. But the question apply to all methods.
If they're all running on the same container why not make them the same task and do things in parallel?
The tasks were created by different teams and their tasks content is rather independent and modular. Usage of them is usually optional. For example, task1 performs 'image whitening', task2 performs 'image resize'.
Ok that works. thanks.
Then you pass the tolerations definition through a different pod template?
Yup.
Thanks this would be a good alternative before the enterprise version comes in. How is this different from argparser btw?
I'm using this feature, in this case i would create 2 agents, one with cpu only queue and the other with gpu queue. And then at the code level decide with queue to send to.
Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.
Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.
There's a secondary issue resulting, i will put this on a new thread.
Hi, we are still not getting the model repo to work, mainly due to clearml.storage failing to save the models.
We tried a vanilla boto3 code and it works, but we can't figure out why we get connectionreseterror 104 when clearml does it.
How do we configure clearml in correspondence to following boto code?
S3= boto3.resource('s3', endpoint_url=' https://ecs.ai ', aws_access_key_id='mykey', aws_secret_access_key='mysevret', config=Config(signature_version='s3v4'), region_name='us-east-1', ve...
Likely network. Can you run a curl on ClearML server api server from jenkin stage and see if that gets through?
Ok thanks, looking forward to it. Would you advise on the bug you encountered?
Hi, it make sense if i only had to change hyperparameters, but it's not so when i am still changing the model architecture (training code) and train and repeat.
Thanks. That's easy to miss as its not quite apparent in the main docs. How should i pass in env variables with Task?
Hi SuccessfulKoala55 I was refering to the Task.init() or any other SDK API that we use in our training codes.
and out of curiosity, what did you think we were talking about? cos i didn't see anywhere else that might print the secrets.
alright thanks. Its impt we clarify it works before we migrate the ifra.
Hi just wondering if I did something wrong here. Would k8s-glue be the reason is not working? I'm purchasing the enterprise version and if vault has the same problem it'll be a big issue.
Sorry i don't quite understand this. The task itself was submitted as I run the code on the client. I suppose the dependancies requirements would be copied over as the experiment is cloned?
I've been reading the documentation for a while and I'm not getting the following very well.
Given an open source codes say, huggingface. I wanted to do some training and i wanted to track my experiments using ClearML. The obvious choice would be to use Explicit Reporting in ClearML. But the part on sending my training job. and let ClearML orchestrate is vague. Would appreciate if i can be guided to the right documentation on this.
Hi we did a check. Only 7.16.1 and 6.8.21 and above mitigates the attack. What's the current version that ClearML is using?
Thought this looked familiar.
https://clearml.slack.com/archives/CTK20V944/p1635323823155700?thread_ts=1635323823.155700&cid=CTK20V944
where should i indicate in the configuration?
Any idea?
Hi. Anything that can point to activity by user.
Hi, by agent logs i suppose you meant the logs from the ClearML server console panel?