
Reputation
Badges 1
137 × Eureka!Hi Jake, sorry I left the office yesterday. On my laptop I have clearml==1.6.4
actually there are some network issues right now, I'll share the output as soon as I manage to run it
sure, give me a couple of minutes to make the changes
Thanks Martin.. I'll add this and check whether it fixes the issue, but I don't get quite well this though.. The local code doesn't need to import pandas, because the get method returns a DataFrame object that has a .loc
method.
I was expecting the remote experiment to behave similarly, why do I need to import pandas there?
thanks a lot 🙂 that was quick 🙂
At this point, I see a new queue in the UI:
If I now reset the experiment, and enqueue the experiment to the gpu queue (but in the experimet, the user-properties configuration for k8s-glue is still set to cpu) the experiment is left in a Pending state... and in the K8sGlue Agent for the gpu queue, I can see a similar error as the one in the cpu agent....
` No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call...
Hi folks, I think I found the issue, the documentation mention to set NVIDIA_VISIBLE_DEVICES to "", when in reality it should be "none" according to the code:
if Session.get_nvidia_visible_env() == 'none': # NVIDIA_VISIBLE_DEVICES set to none, marks cpu_only flag # active_gpus == False means no GPU reporting self._active_gpus = False
the same that is available in the agent: - clearml==1.6.4
OK, it wasn't the clearml.conf settings...
In the deployment I was referring to the fileserver, apiserver, etc. with the internal kubernetes dns names.
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
But I can't really figure out why that would be the case...
is there a way I can check whether the apiserver are reachable?
(like: https://clearml-apiserver.ds.bumble.dev/health http://ds.bumble.dev/health )
perfect, let me try 🙂 (thanks a lot for all the help!)
Ah sorry, I thought what where the names of the queue I created like (in case I used some weird character or stuff like that)
and in the logs of the K8s Glue I see an exception occurred:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", l...
Right now I see the default agent that comes with the helm chart...
About .get_local_copy... would that then work in the agent though?
Because I understand that there might not be a local copy in the Agent?
And if instead I want to force "get()" to return me the path (e.g. I want to read the csv with a library that is not pandas) do we have an option for that?
the queues already exist, I created them through the UI.
If you need to know the value of some of them let me know CostlyOstrich36 I wanted to avoid leaking access keys etc. so I removed the values
And yes these appear in the dropdown menu when I want to enqueue an experiment
But shouldn't the path of the artifacts be a setting of the file server, and not of the agent?
using the --set
you adviced above right?
OK I could connect with the SDK, so everything is working, I'd just like to get the right hosts shown in the UI when a new token is created
Hi Martin, I'll try to get the logs on Monday, though the K8s configuration doesn't "scare" me, I can solve that with my colleagues.
But I'll share it if it helps debug the issue
Yes, the queue is created when I enqueue the experiment. I took some screenshots, and got the logs (there is an error effectively).
Let me share them with you...
And this is the list of variables defined in the K8SGlue pod:
` CLEARML_REDIS_MASTER_PORT_6379_TCP_PROTO
CLEARML_REDIS_MASTER_SERVICE_HOST
CLEARML_REDIS_MASTER_PORT
CLEARML_MONGODB_PORT_27017_TCP
CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PROTO
CLEARML_WEBSERVER_SERVICE_HOST
K8S_GLUE_EXTRA_ARGS
CLEARML_ELASTIC_MASTER_PORT_9300_TCP_PORT
CLEARML_FILESERVER_PORT_8081_TCP_PROTO
HOSTNAME
CLEARML_MONGODB_PORT_27017_TCP_PORT
CLEARML_MONGODB_PORT
CLEARML_ELASTIC_MASTER_SERVICE_PORT
CLEARML_FILESERVER_PORT_...
By the way, after fixing the agentservice issue, and having the pod configured correctly, now I see an error in the agentgroup-cpu pod, because it says that the token is not the correct one:
http://:8081 http://:8080
`
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b00438d0>, 'Connection to pypi.org timed out. (connect timeout=15)')':...