Reputation
Badges 1
282 × Eureka!Thanks SuccessfulKoala55 . I can try my hand on a patch. But the pod spinning is handled by the k8s glue, which has no link to the client side. How should the client pass the key over to k8s glue during runtime via clearml server?
Hi, i will have to get back to you again. Need to check every client's repo to determine your hypothesis.
No i didn't indicate this particular issue on the git issue. Only the apply template.yml is on the issue.
[root@2c7498711bef elasticsearch]# curl -XGET
`
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb
red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1
red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_...
What's the diff between template-yaml and --overrides-yaml? I used the latter to ensure the gpu is passed in.
This is probably the whole script.
kubectl get nodes
pip install clearml-agent
python k8s_glue_example.py
We are using k8s glue to spawn the job. Would you be able to advise in detail of steps on what goes on when the above code executes?
Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.
Space is way above nominal. What created this folder that it's trying to process? What processing is this?Processing /tmp/build/80754af9/attrs_1604765588209/work
Is there any paths in the agent machine that i can clear out to remove any possible issues from previous versions?
Ok sure. Thanks.
thanks GrumpyPenguin23 , i'll look deeper on that. This kinda fits what i am looking for but its for TRAINS and there's no technical how-to.
https://clear.ml/blog/stop-using-kubernetes-for-ml-ops/
Thanks. That's easy to miss as its not quite apparent in the main docs. How should i pass in env variables with Task?
Ok, i guess i will have to kill the whole thing and refresh it.
I meant the dataset id.
What type of pipeline steps are you running? From task, decorator or function?
We were trying with 'from task' at the moment. But the question apply to all methods.
If they're all running on the same container why not make them the same task and do things in parallel?
The tasks were created by different teams and their tasks content is rather independent and modular. Usage of them is usually optional. For example, task1 performs 'image whitening', task2 performs 'image resize'.
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
Does the enterprise version support natively?
Yes of cos, its a long one.
I've been reading the documentation for a while and I'm not getting the following very well.
Given an open source codes say, huggingface. I wanted to do some training and i wanted to track my experiments using ClearML. The obvious choice would be to use Explicit Reporting in ClearML. But the part on sending my training job. and let ClearML orchestrate is vague. Would appreciate if i can be guided to the right documentation on this.
Having same issues. Looks like Google DNS can't resolve the DNS at all.
` %nslookup app.clear.ml - 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
** server can't find app.clear.ml: SERVFAIL `
Hi. Anything that can point to activity by user.
Hi, is this currently not working? http://app.community.clear.ml ? I noticed that cleaml UI will cache on the browser and if the backend is not running, its not clear to user that something is wrong (except for broken pages).
For example, it would useful to integrate https://github.com/whylabs/whylogs#features into ClearML as part of data and model monitoring. WhyLogs would have their own static page that would preferably be displayed as a new custom tab (besides logs, scalars and plots.).
Thanks SuccessfulKoala55 . Just pm'ed him.
Ok that worked. So every time i have changes in codes, i will have to rerun the experiment on my own machine that doesn't have any GPUs?
Kinda defeat the purpose of using ClearML Agent.
Any idea where i can find the relevant API calls for this?
[root@2c7498711bef elasticsearch]# curl
`
{
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-05-22T11:33:38.932Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisi...
Ok thanks, that worked.
Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
` from clearml import Task, Logger
tas...
[root@2c7498711bef elasticsearch]# curl
`
{
"cluster_name" : "clearml",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" ...