data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
I'm pretty sure there is a nice way, let me check soemthing
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
CooperativeFox72 btw, are you guys running those 20 experiments manually or through trains-agent ?
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task
ExcitedFish86 Oh if this is the case:
in your cleaml.conf:agent.package_manager.type: conda agent.package_manager.conda_env_as_base_docker: true
https://github.com/allegroai/clearml-agent/blob/36073ad488fc141353a077a48651ab3fabb3d794/docs/clearml.conf#L60
https://git...
Have to get glue setup, which I couldnβt understand fully, so thatβs a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
Notice that the StorageManager has default configuration here:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L76
Then a per bucket credentials list, with detials:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L81
OddAlligator72 FYI you can also import / export an entire Task (basically allowing you to create it from scratch/json, even without calling Task.create)Task.import_task(...) Task.export_task(...)
Hi RoughTiger69
A. Yes makes total sense . Basically you can use Task.export Task.import to do achieve this process (notice we assume the dataset artifacts links are available on both, usually this is the case)
B. The easiest way would be to use Process , then one subprocess is exporting from dev , where the credentials and configuration is passed with os environment. The another subprocess imports it to the prod server (again with os environment pointing to the prod server). Make sense?
I understand, but then the toml file needs to be parsed to ensure poetry is used. It's just a tool entry in the pyproject.toml.
Probably too much for the agent... and specifically it seems poetry actually managed to parse it?! what are you getting in the log?
Maybe this is part of the paid version, but would be cool if each user (in the web UI) could define their own secrets,
Very cool (and actually how it works), but at the end someone needs to pay for salaries π
The S3 bucket credentials are defined on the agent, as the bucket is also running locally on the same machine - but I would love for the code to download and apply the file automatically!
I have an idea here, why not use the "docker bash script" argument for that ?...
Hi JitteryCoyote63
So that I could simply do
task._update_requirements(".[train]")
but when I do this, the clearml agent (latest version) does not try to grab the matching cuda version, it only takes the cpu version. Is it a known bug?
The easiest way to go about is to add:Task.add_requirements("torch", "==1.11.0") task = Task.init(...)
Then it will auto detect your custom package, and will always add the torch version. The main issue with relying on the package...
Hi MortifiedCrow63
sawΒ
Β , ...
By default ClearML
will only log the exact local place where you stored the file, I assume this is it.
If you pass output_uri=True
to the Task.init
it will automatically upload the model to the files_server and then the model repository will point to the files_server (you can also have any object storage as model storage, e.g. output_uri=s3://bucket
)
Notice yo...
Hi @<1564422644407734272:profile|DistressedCoyote60>
I have an ML project that is not on git. It is separated into several files:
Yes you are correct , if your code is composed of multiple files, you have to have a git repo for the agent to be able to run it.
π
That said, it is free on an any of these services github/bitbucket/gitlab π
IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
It actually started executing your code, but it did not capture it correctly:
/root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Which I assume means the actual Task had bad code.
What do you have under the Task execution tab in the UI (the one you were launching, i.e. enqueueing )
Hi WickedGoat98
I try to write an article on medium about ClearML and face some a problem with plotly figures.
This is awesome !
I ran the plotly_reporting.py example locally and the uploaded plot was ok.
So are you saying the same example code from the repository worked okay on your server but showed nothing on the hosted server ?
Hi PerfectChicken66
every X iterations and delete the older ones with
I have to ask, why not just overwrite the artifact? it is basically the same, no ?!
older ones with
delete_artifacts
from
Task
I think you are correct, when you delete the entire Task you can specify, delete artifacts, but it does not do that on delete_artifact π
You can manually do that with:
` task._delete_uri(task.artifacts["artifact"].url)
task.delete_artifact() ...
Hi WorriedParrot51
Assuming you run the code "manually" once (i.e. without the agent). Then when you call Task.init it will register the argparser.
When running with the agent, the first time you will call parse, it will automatically override the argparse defaults with the values stored in the Task.
Make sesne?
am getting None for Task.current_task() at the beginning of my script.
Task.init() is doing the magic , only after this call you will have current_task (either running manua...
preempting lower priority tasks to allow a higher priority task to come in
Well this is usually outside of the scope of "single researcher" / "tiny team"...
This typically a large scale problem
That said, it will be fairly easy to write a service that aborts Tasks, "tags them to be "continued", then later (at night?!) push them back into a queue... wdyt?
DeliciousBluewhale87
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..
Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)
Hi TartBear70
I'm setting up reproducibility myself but when I call Task.init() the seed is changed
Correct
. Is it possible to tell clearml not to initialize any rng? It appears that task.set_random_seed() doesn't change anything.
I think this is now fixed (meaning should be part of the post weekend release)
. Is this documented?
Hmm i'm not sure (actually we should write it, maybe in Task.init docstring?)
Specifically the function that is being called is:
https://gi...
So assuming they are all on the same LB IP: You should do:
LB 8080 (https) -> instance 8080
LB 8008 (https) -> instance 8008
LB 8081 (https) -> instance 8081
It might also work with:
LB 443 (https) -> instance 8080
I would just add git+
None to your requirements (either in the requirements.txt or even better as part of the pipeline/component where you also specify the repo to be used)
The agent will automatically push the crednetilas when it installs the repo as wheel.
wdyt?
btw: you might also get away with adding -e .
into the requirements.txt (but you will need to test that one)
That works AND the feature works!
YEY
Quick follow up question, is there any way to abort a pipeline and all of the tasks it ran?
Hmm yes currently if you abort the pipeline is has no "time" to abort the running Tasks (the DAG itself will stop, because the pipeline controller was aborted, bit the running Tasks will continue).
In order to have better support, we need to add a previously requested feature for "abort" callback. This is actually not as straight forward as it sound...
If there is new issue will let you know in the new thread
Thanks! I would really like to understand what is the correct configuration
Hi @<1524922424720625664:profile|TartLeopard58>
Yes this is the default it is designed to serve multiple models and scale horizontally
This one seem to work
` from clearml import Task
task = Task.init(...)
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('_mpl-gallery')
make data:
np.random.seed(10)
D = np.random.normal((3, 5, 4), (0.75, 1.00, 0.75), (200, 3))
plot:
fig, ax = plt.subplots()
vp = ax.violinplot(D, [2, 4, 6], widths=2,
showmeans=False, showmedians=False, showextrema=False)
styling:
for body in vp['bodies']:
body.set_alpha(0.9)
ax.set(xlim=(0, 8), xticks=np.arang...
If I edit directly the OmegaConf in the UI than the port changes correctly
This will only work if you change the Hydra/allow_omegaconf_edit to True in the UI. Did you?
Hmm, maybe the original Task was executed with older versions? (before the section names were introduced)
Let's try:DiscreteParameterRange('epochs', values=[30]),
Does that gives a warning ?