I found the issue, the first run it jumps over the first day (let me check if we can quickly fix that)
Hi @<1541954607595393024:profile|BattyCrocodile47>
But the files API is still open to the world, right?
No, of course not 🙂 (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...
OutrageousSheep60 so this should work, no?ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=100000000000)
Notice the chunk size is the maximum size (in bytes) per chunk, so it should basically very large
@<1523701868901961728:profile|ReassuredTiger98> thank you so much for testing it!
and you have clearml v0.17.2 installed on the "system" packages level, and 0.17.5rc6 installed inside the pyenv venv ?
Could it be it checks the root target folder and you do not have permissions there only on subfolders?
But in credentials creation it still shows 8008. Are there any other places in docker-compose.yml where port from 8008 to 8011 should be replaced?
I think there is a way to "tell" it what to out there, not sure:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#configuration-files
I think that the first model saved gets the task name as its name and the following models take
f"{task_name} - {file_name}"
Hmm, I'm not sure what would be a good way to make it consistent, would it make sense to always have the model file name?
I guess it takes some time before the the correct names are assigned?
Hmm that is odd, I have a feeling it has to do with calling Task.close()?!
I just tried with the latest clearml version and it seemed to work as expected
Hi @<1716987933514272768:profile|SuccessfulPuppy43>
How to make remote ClearML agent do
pip install -e .
in theory there is no need to do that clearml-agent adds the repo root folder to the python path.
If you insist on actually installing it, try to add to your "installed packages" section a "requirement.txt" compatible line:
-e .
Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component
GrievingTurkey78
Both are now supported, they basically act the same way 🙂
and log overrides + the final omegaconf
Hi @<1720249416255803392:profile|IdealMole15>
I'm assuming you mean on a remote machine with clearml-agent running ?
If you do, then you either use clearml-task
to create a Task (Job) and specify the container and script. or click on "Create New Experiment" in the UI, and fill out the git repo / script and specify the docker image.
Make sense?
Oh what if the script is in the container already?
Hmm, the idea of clearml is that the container is a "base environment" and code is "injected", this makes sure it is easy to reuse it.
The easiest way is to add an "entry point" scripts that just calls the existing script inside the container.
You can have this python initial script on your local machine then when you call clearml-task
it will upload the local "entry point" script directly to the Task, and then on the remote machin...
so firs yes, I totally agree. This is why the clearml-serving
has a dedicated statistics module that creates histograms over time, then we push it into Prometheus and connect grafana to it for dashboards and alerts.
To be honest, I would just use it instead of reporting manually, wdyt?
I prefer serving my models in-house and only performing the monitoring via ClearML.
clearml-serving
is an infrastructure for you to run models 🙂
to clarify, clearml-serving
is running on your end (meaning this is not SaaS where a 3rd party is running the model)
By the way, I saw there is a project dashboard app which might support the visualization I am looking for. Is it suitable for such use case?
Hmm interesting, actually it might, it does collect matrices over time ...
The versions don't need to match, any combination will work.
I see TrickyFox41 try the following:--args overrides="param=value"
Notice this will change the Args/overrides argument that will be parsed by hydra to override it's params
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?
Hi @<1730396272990359552:profile|CluelessMouse37>
However, the caching doesn't seem to be working correctly. Despite not changing the configuration, the first step runs every time.
How are you creating the cached component?
is this a standalone script or a git repo link?
These parameters are dictionaries of specific configurations (dict of dict) that are the same but might not be taken into account properly by the caching mechanism.
hmm for the component to be cached (or reuse...
DAG which get scheduled at given interval and
Yes exactly what will be part of the next iteration of the controller/service
an example achieving what i propose would be greatly helpful
Would this help?from trains.automation import TrainsJob job = TrainsJob(base_task_id='step1_task_id_here') job.launch(queue_name='default') job.wait() job2 = TrainsJob(base_task_id='step2_task_id_here') job2.launch(queue_name='default') job2.wait()
Ohh then you do docker sibling:
Basically you map the docker socket into the agent's docker , that lets the agent launch another docker on the host machine.
You cab see an example here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L144
Hi @<1541229818828296192:profile|HurtHedgehog47>
plots we create in the notebook are not saved as it was made.
I'm assuming these are matplotlib plots ?
Notice that ClearML tries to convert the plot into interactive plots, in that process sometimes, colors and legend is being lost (becomes generic).
You can however manually report the plot, and force it to store it as non-interactive:
task.logger.report_matplotlib_figure(
title="Manual Reporting", series="Just a plot", ite...
BeefyCow3 see this https://allegroai-trains.slack.com/archives/CTK20V944/p1593077204051100 :)
Thanks @<1523704157695905792:profile|VivaciousBadger56> ! great work on the docstring, I also really like the extended example. Let me make sure someone merges it
@<1734020162731905024:profile|RattyBluewhale45> could you attach the full Task log? Also what do you have under "installed packages" in the original manual execution that works for you?
(i.e. importing the trains package is enough to patch the argparser, only when you call the task.init the arguments will be logged, before they are stored in memory)
1724924574994 g-s:gpu1 DEBUG WARNING:root:Could not lock cache folder /root/.clearml/venvs-cache: [Errno 9] Bad file descriptor
You have an issue with your OS / mount, specifically "/mnt/clearml/" is the base folder for all the cached stuff and it fails to create the lock files there either use a Local folder or try to understand what is the issue with the Host machine /mnt/ mounts (because it looks like a network mount)