[root@2c7498711bef elasticsearch]# curl `
{
"cluster_name" : "clearml",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" ...
Hi. nice read. Your permalink is wrong though, here's the right one.
https://cpatrickalves.com/mlops-what-it-is-and-why-does-it-matter
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
My assumption is that the agent will have pulled that off the client's clearml.conf.
Hi, for both of them, args.lastiter is the exact same value. But when plotted out, they are 2 actually iterations apart.
I think the default action of clearml-agent k8s glue when running a task is to create a virtual env and installing the dependancies. So i'm just checking how to change that behaviour to look at global instead.
Hi SuccessfulKoala55 , just wondering how i can follow up on this.
The server is running only the ClearML components. Could you advise on the ELB part, how should we optimise it?
Hi, it looks like the entire http://clear.ml domain is offline for more than 12 hours. Main pages and documentation are inaccessible as well.
Oh, this meant i have been using the latest agent which is v1.0.0. The problems were still there.
Can this issue be solved with vault? It doesn't make sense to expose secrets like that.
i see. Can i take it that when the client usestask.execute_remotely(queue_name="1gpu", exit_process=True)then none of the content in its clearml.conf will be used, except for the API part. And Clearml simply uses whatever is on the Agent side.api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: web_server: files_server: # Credentials are generated using the webapp, `
# Override with os environment: ...
Yes! I definitely think this is important, and hopefully we will see something there
(or at least in the docs)
Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?
Its hard to tell, but the agent change was a significant one. Unless python versions has something to do with it.
I used nvcr pytorch image and instruct clearml to inherit global dependencies. No need to install torch and work well.
Do you mean this?Removing containers section: [{'image': 'clearml-agent:latest"', 'env': [{'name': 'PIP_INDEX_URL', 'value': ' '},
I'm also noticing a lot of this while the k8s glue is running.Ex: Expecting value: line 1 column 1 (char 0) K8S Glue pods monitor: Failed parsing kubectl output:
I see i understand better now. Thanks.
Hi thanks.
So i suppose ClearML make use of the information in .git folder at the root of the script folder to gather those info.
I have yet to go through thoroughly with ClearML agent. TimelyPenguin76 , so if i run a training with uncommited changes and didn't commit/push after. When i clone the task, isn't ClearML agent unable to pull that script from the git repo?
Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
Thank. Gonna try that out. But i hit another snag. Strangely, the Agent is not creating the right venv. This is what the Agent created.
` pip:
- asn1crypto==0.24.0
- attrs==20.3.0
- certifi==2020.12.5
- chardet==4.0.0
- cryptography==2.1.4
- Cython==0.29.22
- furl==2.1.0
- future==0.18.2
- humanfriendly==9.1
- idna==2.6
- importlib-metadata==3.7.0
- jsonschema==3.2.0
- keyring==10.6.0
- keyrings.alt==3.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pycrypto==2.6.1
- pygobject...
Hi, so this means if i want to use Kubernetes, i would have to 'manually' install multiple agents on all the worker nodes?
which clearml.conf is it refering to? I'm executing on my client, which is then remotely executed by the agent. Both of them has ~/clearml.conf.
Is there anyway to see an error log from that?
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?
I'm also beginning to think this is related to https://clearml.slack.com/archives/CTK20V944/p1620664770492400 . Previously when i set force_repo_requirements_txt=true and system_site_packages: true , it seems to work. upgrading to v1.02 seems to change things.
If we run all the rank 0 and rank n tasks individually, it's defeats the purpose of using ClearML.