Reputation
Badges 1
282 × Eureka!Ok that works. thanks.
Any idea where i can find the relevant API calls for this?
Hi, any idea if i can acheive this? I just need a list of usernames.
f you can directly access the machine running the agent, yes you could. If not reverse proxy is in the working
Hi AgitatedDove14 , i might have misunderstood your previous comment above. Do you mean that clearml-session can only work regardless of whether xforwarding is configured, if we have direct access to the Kubernetes worker when we run K8S glue?
We did some testing today and clearml-session tried to tunnel with a k8s cluster ip, and thus failed.
If we setup a ingress with Me...
Is there anyway to see an error log from that?
Unfortunately due to security, clients can't have direct access to the nodes. Is there any possible workarounds at the moment?
Hi AgitatedDove14 , do you mean the configuration tab in the UI? No, i don't see it.
Hi yes, still getting the SSLs. It looks like some incompatibility with the OS ssl libraries.
It didn't work as expected.
` task init
task report iter 10
task init
task report iter 10
The second task pushed the reporting iteration to 20 instead. `
Hi, for both of them, args.lastiter
is the exact same value. But when plotted out, they are 2 actually iterations apart.
Just to put a ping for those on this side of the timezone to look at. Thanks.
Hi TimelyPenguin76 , i am adding a debug sample to an existing task using the above method. What should i put for the iteration? I do not want to overwrite existing ones but i do not know what's the last count. This is for both scalar and media reporting.
Thanks TimelyPenguin76 , let me try it out now.
Ok thanks.
Hi AgitatedDove14 , thanks.
In this case i am running k8s glue (machine glue), which will then spawn off pods in kubernetes worker (machine worker). So when you say direct access, are you refering to the Glue machine or K8S Worker machine?
In the ClearML config that's being run by the ClearML container?
clearml=1.0.3
python=3.8.10clearml-data upload --id 12314jhg42342j4j --storage
http://ecs.ai is an on-prem DELL EMC ECS that serves as our S3 storage configured with s self signed cert.
Sorry, dev end I was referring to my developers.
I didn't think Horovod needs to be as complicated as you described. It can also work by running on multiple known nodes. How would i add a glue for multinode?
Horovod does also work with other similar products such as yours (E.g. Polyaxon).
Hi,
basically i run this block first and ended the script.task = Task.init(project_name="afro-nmt", task_name=args.taskname, continue_last_task=args.taskid) Logger.current_logger().report_scalar(title="BLEU",series="JW300",value=args.jwbleu, iteration=args.lastiter)
Then i run another script, with series different.
` task = Task.init(project_name="afro-nmt", task_name=args.taskname, continue_last_task=args.taskid)
Logger.current_logger().report_scalar(title="BLEU",series="SS900",value=arg...
Hi, currently the ClearML SDK only supports python. If i want to run my ML in other languages, can i use a SDK in that language? Or is there other means such as a Web API calls that does the same as the SDK?
Thanks could you share the URL to this full API documentation?
Hi. Yup the model was not physically uploaded with the up:port into the bucket, although ClearML does indicate that it's there, except that I can't download it. I also verified this with another S3 client, the model was not there as well.
Hi, when i tried ip:port, it references the right host and bucket....BUT... the file is not found on the ECS S3 even though i can see from the logs that it states Completed model upload to s3://ecs.ai:80/clearml-models/artifacts/ ...
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
No, i can't see the files. But i can see if i don't use ':port' in the URL when uploading. I can't access the machine today, i'll try to check the S3 logs when i'm back.
ah ok, so if i see Jax's workspace on https://app.community.clear.ml/dashboard , then i'm on the right track? How regular does this reset then?
like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah
Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.
Hi thanks for the examples! I will look into them. Quite a fair bit of my teams uses tf datasets to pull data directly from object stores, so tfrecords and stuff are heavily involved. I'm trying to figure if they should version the raw data or the tfrecords with ClearML, and if downloading entire set of data to local can be avoided as tf datasets is able to handle batch downloading quite well.