Reputation
Badges 1
25 × Eureka!So clearml server already contains an authentication layer (JWT Token), and you do have a full user management on top:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#web-login-authentication
Basically what I'm saying if you add httpS on top of the communication, and only open the 3 ports, you should be good to go. Now if you really need SSO (AD included) for user login etc, unfortunately this is not part of the open source, but I know they have it in the scale/ent...
Could it be someone deleted the file? this is inside the temp venv folder but it should not get there
Understood, then I would use Task.remote_execution()
Basically :
task = Task.init(...)
# config some stuff
task.remote_execute(quque_name_here)
# this line will be executed on the remote machine only
This will both automatically log your code / repo with Task.init, and the call to Task.remote_execute will stop the local process (on your machine that runs the hydra sweep) and continue on the remote machine.
This will both allow you to use Hydra sweet & schedule / run on remote ...
trains-agent doesn't run the clone, it is pip...
basically calling "pip install git+https://..."
Not sure you can pass extra arguments
Also, this is not a setup problem, otherwise it would have seen consistently failing ... this actually looks like a network issue.
The only thing I can think of is retrying to install if we get network error (not sure whats the exit code of pip though (maybe 9?)
Hi ScaryLeopard77
You can probably do:Task.init(...,continue_last_task='task_id_here')This will continue a previously executed Task and log both steps in the same place.
Does that help?
BTW: you can also of course manually report to any Task as it is still running with:aux_task = Task.get_task(task_id_here) aux_task.get_logger().report_scalar(...)
Thanks for answering, Yes, this is exactly what I wanted
Hmm should be possible, how slow is the update that we want to save the time ?
BTW: What's the TF / Keras version?
Uninstall the current clearml-agent and reinstall this wheel, I hacked it to have ==, let's see if that works
ZanyPig66 it sounds like you need to add the docker args for binding, just add to the Task.create the argument: 'docker_args="-v /mnt/host:/mnt/container"'
ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.
where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
From the description it sounds like there is a problem with sending the metrics?! the task.close is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
A...
MelancholyElk85 assuming we are running with clearml 1.1.1 , let's debug the pipeline and instead of pipeline start/wait/stop :
Let's do:pipeline.start_locally(run_pipeline_steps_locally=False)
Hi ReassuredTiger98
To separate between minio and S3 we use:
s3://bucket/file for AWS S3 service and s3://server :port/bucket/file for minio.
this means if your S3 links would have been s3://<minio-address>:<port>/bucket/file.bin the UI would have popped the cred window.
Make sense ?
Hi @<1523703397830627328:profile|CrookedMonkey33>
If you click on the "Task Information" (on the Version Info panel, right hand-side). It will open the Task details page, there you have the "hamburger" menu top right, where you have publish
(Maybe we should add that to the main right click menu?!)
i have it deployed successfully with istio.
Nice!
the only thing we had to do to get it to work was to modify the nginx.conf in the webserver pod to allow http 1.1
I was under the impression we fixed that, let me check
JitteryCoyote63 I remember something with "!" in the name or maybe "/" in the name that might cause this behavior. May I suggest checking with clearml-server 1.3 ?
Sorry my bad:config_obj['sdk']['stuff']['here'] = value
Right! I just noticed that! this is odd... and yes defiantly has something to do with the multi pipeline executed on the agent, I think I know what to look for ...
(just making sure (again), running_locally produced exactly what we were expecting, is that correct?)
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
Well (yes, I think), the environment section is used mostly for logging, the next version will have full support by the clearml-agent (due next week), and the next release of clearml-server will add basj-script support.
Task.create will create a new Task (and return an object) but it does not do any auto-magic (like logging the console, tensorboard etc.)
Let me rerun the code and check
GiganticTurtle0 this is exactly what I did, and ended up with two pipelines, comparing them produced what I expected (different arguments as passed by the script).
What are you getting ?
What is the difference toΒ
file_history_size
Number of unique files per titles/series combination (aka how many images to store in the history, when the iteration is constantly increasing)
hmm that is odd.
Can you send the full log ?
. I'm trying to run to get a task to run using a specific docker image and to source a bash script before execution of the python script.
Are you running an agent in docker mode ? if so you should be able to see the Output of your bash script first thing in the log
(and it will appear in the docker CMD)