Reputation
Badges 1
25 × Eureka!Is there a quicker way to abort all running experiments in a project? I have over a thousand running anonymous data tasks in a specific project and I want to abort them beforeΒ debugging them.
We are adding "select" all in the next UI version to do that as quickly as possible π
The pipeline itself is also a task, so this line works in a pipeline. Task.current_task is a class method that returns the running task (pipeline in our case), then then the usual interface. BTW what are you having in the conf file ?
Any chance there is an env variable you set to get 1.5.0rc0? Because this is the version that is being used
suppose I have an S3 bucket where my data is stored and I wish to transfer it to ClearML file server.
Then you first have to download the entire bucket locally, then register the local copy.
Basically:
StorageManager.download_folder("
", "/target/folder")
# now register the local "/target/folder" with Dataset.add_files
Can the user be overwritten during task configuration (I don't see such an option in the documentation)?
Hm, not really π this is tied with security feature on top.
That said,
stored them as a k8s secret and they are reused whenever anyone from our ML team starts a new ML model training
Does that mean you are running an Agent on the k8s cluster? what's exactly the flow that causes your k8s credentials to be used
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
BoredHedgehog47 if you are running it on K8s, then the setup script is running before everything else, even before an agent appears on the machine, unfortunately this means the output is not logged yet, hence the missing console lines (I think the next version of the glue will fix that)
In order to test you can do:export TEST_MEthen inside your code you will be able to see itos.environ['TEST_ME']Make sense ?
Can the host server's service agent be used?
In theory yes, just make sure you expose the containers network (check the docker compose)
Just call the Task.init before you create the subprocess, that's it π they will all automatically log to the same Task. You can also call the Task.init again from within the subprocess task, it will not create a new experiment but use the main process experiment.
I still can't get it to work... I couldn't figure out how can I change the clearml version in the runtime of the Cleanup Service as I'm not in control of the agent that executes it
Let's take a step back. Let's remove the clearml-services from the docker compose for a second, and run it manually (then you can control everything). Once you have it running manually, let's try to replicate the setup back to the docker compose, make sense ?
Hi @<1683648242530652160:profile|ApprehensiveSeaturtle9>
I send a request to the endpoint but never unload (the gpu memory keep increasing when I infer with a new model).
They are not unloaded after the request is done. see discussion here: None
You can however remove the model from the serving session (but I do not think this is what you meant)
I'm assuming you want to run multiple models on a single GPU with not en...
I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done
Hi @<1541954607595393024:profile|BattyCrocodile47>
But the files API is still open to the world, right?
No, of course not π (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...
Hi UnevenHorse85
As far as I understand, users use logins and passwords specified in config/apiserver.conf to access webserver UI and key/secret key from their local ~/clearml.conf to access apiserver.
Correct π
access apiserver. What is the use of all other security keys
To be able to configure the SDK client (i.e. clearml package) from OS environment and not clearml.conf file
containing the
Extension
module
Not sure I follow, what is the Extension module ? what were you running manually that is not just pip install /opt/keras-hannd ?
Or can it also be right after
Task.init()
?
That would work as well π
AstonishingRabbit13 so is it working now ?
because comparing experiments using graphs is very useful. I think it is a nice to have feature.
So currently when you compare the graphs you can select the specific scalars to compare, and it Update in Real Time!
You can also bookmark the actual URL and it is fully reproducible (i.e. full state is stored)
You can also add custom columns to the experiment table (with the metrics) and sort / filter based on them, and create a summary dashboard (again like ll pages in the web app, URL is...
Not really sure that's easily done ... I mean you could query the data, but I'm not sure how you would import it. Btw why would you move from pro to self hosted?
Hi EnviousStarfish54
After the pop up do you see the plot on the web UI?
What's the Windows version, python version, clearml version, you are using ?
Great, you can test directly from the master πpip3 install -U git+
The main issue is applying the patch requires git clone and that would fail on local (not pushed) commits.
What's the use case itself ?
(btw, if you copy the uncommitted changed into a file and git apply it, it will work)
There is a version coming out next week, the one after it (probably 2/3 weeks later) will have this feature
that is because my own machine has 10.2 (not the docker, the machine the agent is on)
No that has nothing to do with it, the CUDA is inside the container. I'm referring to this image https://allegroai-trains.slack.com/archives/CTK20V944/p1593440299094400?thread_ts=1593437149.089400&cid=CTK20V944
Assuming this is the output from your code running inside the docker , it points to cuda version 10.2
Am I missing something ?
Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
Yes this repo is downloaded into the agent, so your code has access to it
Just fixed, will be merged later, basically some field you are not supposed to change post execution (but system tags should be exempt from that). The SDK checks before the backend does, so you get a nice error π anyhow the backend will obviously allow it
SmallDeer34 No worries, I'm happy to hear the issue disappeared π