Reputation
Badges 1
282 × Eureka!Hi SuccessfulKoala55 ,i managed to install clearml-agent==1.0.1rc5. However, the same issues occur.
Hi, i tried the k8s-glue on my k8s setup and needed some clarifications on some of the arguments.
--queue. Does this only refer to default and service? How can i create new queue to which it can sync with the ClearML server? --ports-mode. I'm not sure what ports mode does. doc says "add a label to the pod which can be used as service". Which pod is it referring to in the first place? All args pertaining to --ports-mode. (E.g. base-pod-num, gateway-address...etc) --overrides-yaml. What is the ...
Thanks TimelyPenguin76 , let me try it out now.
Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
` from clearml import Task, Logger
tas...
[root@2c7498711bef elasticsearch]# curl `
{
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-05-22T11:33:38.932Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisi...
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
I thought of another potential way but not sure if the SDK supports it.
We will perform manual save and upload of model using vanilla boto3 and credentials passed in as env var. Use ClearML SDK to update the Model Repo on the location of the model, without ClearML uploading it explicitly.Would the above work?
Hi, the problem is the same.
I noticed that its not checking out the latest version in gitlab. This latest version would contain the requirements.txt.Using cached repository in "/root/.clearml/vcs-cache/pytorchmnist.f220373e7227ec760b28c7f4cd99b534/pytorchmnist" warning: redirecting to Note: checking out 'cfb833bcc70f3e10d3b6a96cfad3225ed682382b'.But i'm guessing this block below applied the diff..does it include the requirements.txt though?
` HEAD is now at cfb833b Upload New Fil...
Hi, scenario as follows.
client.py runs task.execute_remotely(queue='myqueue', exit_process=True) The API section of clearml.conf at client side is read in. client side calls clearml server and insert task into queue. K8S glue retrieves task from queue. Spawn a K8S pod. K8S pod performs git clone Error. ssh keys not found.
Each individual has their own key in the gitlab profile and gitlab is configured to only work via ssh.
We can't place the key in the image as this is as good as ...
ok, i'll wait till i get my hands on vault then. thanks.
[root@2c7498711bef elasticsearch]# curl `
{
"cluster_name" : "clearml",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" ...
Hi. nice read. Your permalink is wrong though, here's the right one.
https://cpatrickalves.com/mlops-what-it-is-and-why-does-it-matter
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
My assumption is that the agent will have pulled that off the client's clearml.conf.
Hi, for both of them, args.lastiter is the exact same value. But when plotted out, they are 2 actually iterations apart.
I think the default action of clearml-agent k8s glue when running a task is to create a virtual env and installing the dependancies. So i'm just checking how to change that behaviour to look at global instead.
Hi SuccessfulKoala55 , just wondering how i can follow up on this.
The server is running only the ClearML components. Could you advise on the ELB part, how should we optimise it?
Hi, it looks like the entire http://clear.ml domain is offline for more than 12 hours. Main pages and documentation are inaccessible as well.
Oh, this meant i have been using the latest agent which is v1.0.0. The problems were still there.
Can this issue be solved with vault? It doesn't make sense to expose secrets like that.
i see. Can i take it that when the client usestask.execute_remotely(queue_name="1gpu", exit_process=True)then none of the content in its clearml.conf will be used, except for the API part. And Clearml simply uses whatever is on the Agent side.api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: web_server: files_server: # Credentials are generated using the webapp, `
# Override with os environment: ...
Yes! I definitely think this is important, and hopefully we will see something there
(or at least in the docs)
Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?
Its hard to tell, but the agent change was a significant one. Unless python versions has something to do with it.
I used nvcr pytorch image and instruct clearml to inherit global dependencies. No need to install torch and work well.
Do you mean this?Removing containers section: [{'image': 'clearml-agent:latest"', 'env': [{'name': 'PIP_INDEX_URL', 'value': ' '},
I'm also noticing a lot of this while the k8s glue is running.Ex: Expecting value: line 1 column 1 (char 0) K8S Glue pods monitor: Failed parsing kubectl output:
I see i understand better now. Thanks.