Reputation
Badges 1
25 × Eureka!Hi FloppyDeer99
What is the meaning of no real scheduling
I think the meaning is that from the moment a k8s job is created, the k8s is in charge of actually spinning the container. Since k8s has no real priority/order the scheduling order is not guaranteed form this point.
The idea of the cleaml-k8s -glue is that the glue will launch a job on the k8s cluster only if it is sure there are enough resources to actually spin the job now (as opposed to, sometime in the future), this mea...
LuckyRabbit93 We do!!!
Failed to initialize NVML: Unknown Error
yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine
That is odd, can you send the full Task log? (Maybe some oddity with conda/pip ?!)
Hi FriendlyKoala70 you can edit the installed package section and add the missing package. See more details on how trains-agent works here (although it's on conda the same rules apply for pip) https://github.com/allegroai/trains-agent/issues/8
True, this is exactly the reason. That said, you can always manually add it. You can see the default values : https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf
I want to build a real time data streaming anomaly detection service with clearml-serving
Oh, so the way it currently works clearml-serving will push the data in real-time into Prometheus (you can control the stats/input/out), then you can build the anomaly detection in grafana (for example alerts on histograms over time is out-of-the-box, and clearml creates the histograms overtime).
Would you also need access to the stats data in Prometheus ? or are you saying you need to process it ...
Seems like passing the Task object is not working as expected (I'll make sure it is fixed).
Try:dataset._task.set_parent(Task.current_task().id)
is there a way to visualize the pipeline such that this step is βstuckβ in executing?
Yes there is, the pipelline plot (see plots section on the Pipeline Task, will show the current state of the pipeline.
But I have a feeling you have something else in mind?
Maybe add Tag on the pipeline Task itself (then remove it when it continues) ?
I'm assuming you need something that is quite prominent in the UI, so someone knows ?
(BTW I would think of integrating it with the slack monitor, to p...
if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps.
pipe.start should be called after the pipeline was constructed and should be the "last" call of the script.
Not sure I follow what is "before" the code?
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent
or if running code that uses trains
will work on them?
Thanks @<1694157594333024256:profile|DisturbedParrot38> !
Nice catch.
Could you open a github issue so that at least we output a more informative error?
Oh in that case add --remote-gateway <external_ip>
It will connect to the provided address instead of the local one. (you can also add --public-ip
which will automatically resolve the public IP of the server
Yes, as long as the client is served from http://app.something.com it will look for the api server at http://api.something.com
If i were to push the private package to, say artifactory, is it possible to use that do the install?
Yes that's the recommended way π
You add the private repo here, for the agent to use:
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L65
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)
Ex: Expecting value: line 1 column 1 (char 0)
K8S Glue pods monitor: Failed parsing kubectl output:
Run with --debug as the first parameter
Are you running the latest from the git repo ?
π€ maybe we should have "sub nodes" as just visual functions running inside the same actual pipeline component ?
Hi ExuberantParrot61
Is the pipeline logic code running from inside the repo?
Hi @<1541954607595393024:profile|BattyCrocodile47>
Do you mean to start a remote session instead of the cli directly from the vscode ui and connect to it? If so, that would be awesome!! We have a remote session from the web were it spins you remote session and launches vscode inside the container so you work on it in your browser. But a VSCode plugin is a great idea, do you have a ref code to similar plugins?
Hi OddShrimp85
If you pass 'output_uri=True' to task init, it will upload the model automatically, or as you said manually with outputmodel class
Great!
BTW: you can take some inspiration from here:
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Or from the full pipeline:
https://github.com/allegroai/trains/blob/master/examples/pipeline/pipeline_controller.py
This makes no sense to me π
Both are reading the exact same file, and using the same session / flow ...
Maybe there is an error with the "verify_certificate" on the agent ?
Are you running the agent in docker mode or venv mode?
Okay found the issue, to disable SSL verification global add the following env variable:CLEARML_API_HOST_VERIFY_CERT=0
(I will make sure we fix the actual issue with the config file)
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
Through the UI or python code ?
ChubbyLouse32 could it be the configuration file is not passed to the agent machine itself ?
(were you able to run anything against this internal server? I mean to connect to it from code, clearml/cleamrl-agent) ?
. Is there any known issue with amazon sagemaker and ClearML
On the contrary it actually works better on Sagemaker...
Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?
Hi UnevenDolphin73
If you "remove" the lock file the agent will default to pip.
You can hack it with uncommitted changes section?