Reputation
Badges 1
282 × Eureka!Yes of cos, its a long one.
thanks SuccessfulKoala55 . I verified your last comment and it works.
Thanks, its attached.
I also noted that the status on the ClearML is always in 'pending', unlike others which says 'Running'. Is this a side effect of using k8s glue?
It didn't work as expected.
` task init
task report iter 10
task init
task report iter 10
The second task pushed the reporting iteration to 20 instead. `
Any comments on using the global python libraries without the need to 'pip install' anything?
Hi SuccessfulKoala55 ,i managed to install clearml-agent==1.0.1rc5. However, the same issues occur.
Hi, i tried the k8s-glue on my k8s setup and needed some clarifications on some of the arguments.
--queue. Does this only refer to default and service? How can i create new queue to which it can sync with the ClearML server? --ports-mode. I'm not sure what ports mode does. doc says "add a label to the pod which can be used as service". Which pod is it referring to in the first place? All args pertaining to --ports-mode. (E.g. base-pod-num, gateway-address...etc) --overrides-yaml. What is the ...
Thanks TimelyPenguin76 , let me try it out now.
Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
` from clearml import Task, Logger
tas...
[root@2c7498711bef elasticsearch]# curl `
{
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-05-22T11:33:38.932Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisi...
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
I thought of another potential way but not sure if the SDK supports it.
We will perform manual save and upload of model using vanilla boto3 and credentials passed in as env var. Use ClearML SDK to update the Model Repo on the location of the model, without ClearML uploading it explicitly.Would the above work?
Hi, the problem is the same.
I noticed that its not checking out the latest version in gitlab. This latest version would contain the requirements.txt.Using cached repository in "/root/.clearml/vcs-cache/pytorchmnist.f220373e7227ec760b28c7f4cd99b534/pytorchmnist" warning: redirecting to Note: checking out 'cfb833bcc70f3e10d3b6a96cfad3225ed682382b'.But i'm guessing this block below applied the diff..does it include the requirements.txt though?
` HEAD is now at cfb833b Upload New Fil...
Hi, scenario as follows.
client.py runs task.execute_remotely(queue='myqueue', exit_process=True) The API section of clearml.conf at client side is read in. client side calls clearml server and insert task into queue. K8S glue retrieves task from queue. Spawn a K8S pod. K8S pod performs git clone Error. ssh keys not found.
Each individual has their own key in the gitlab profile and gitlab is configured to only work via ssh.
We can't place the key in the image as this is as good as ...
ok, i'll wait till i get my hands on vault then. thanks.
[root@2c7498711bef elasticsearch]# curl `
{
"cluster_name" : "clearml",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" ...
Hi SuccessfulKoala55 I was refering to the Task.init() or any other SDK API that we use in our training codes.
Hi. nice read. Your permalink is wrong though, here's the right one.
https://cpatrickalves.com/mlops-what-it-is-and-why-does-it-matter
Hi CostlyOstrich36 , nothing in particular. I was doing a research and noticed that ML Pipelines was mentioned not even once in the literature. So i wonder if one should be done. I'm looking at other aspects as well but i'll gradually ask on those.
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
My assumption is that the agent will have pulled that off the client's clearml.conf.
Hi, for both of them, args.lastiter is the exact same value. But when plotted out, they are 2 actually iterations apart.
So these (PIP_INDEX_URL) weren't used when clearml starts running pip.
I think the default action of clearml-agent k8s glue when running a task is to create a virtual env and installing the dependancies. So i'm just checking how to change that behaviour to look at global instead.
Hi SuccessfulKoala55 , just wondering how i can follow up on this.
The server is running only the ClearML components. Could you advise on the ELB part, how should we optimise it?
Hi, it looks like the entire http://clear.ml domain is offline for more than 12 hours. Main pages and documentation are inaccessible as well.
Oh, this meant i have been using the latest agent which is v1.0.0. The problems were still there.
Can this issue be solved with vault? It doesn't make sense to expose secrets like that.