Reputation
Badges 1
25 × Eureka!DefeatedCrab47 If I remember correctly v1+ has their arguments coming from argparse .
Are you using this feature ? 2. How do you set the TB HParam ? Currently Trains does not support TB HParams, the reason is the set of HParams needs to match a single experiment. Is that your case?
clearml-agent deployment file
What do you mean by that? is that the helm of the agent ?
HandsomeCrow5client.events.debug_images(metrics=[dict(task='6adb929f66d14731bc76e3493ab89d80', metric='image')])
. So to conclude: it has to be executed manually first, then with trains agent?
Yes, that said, as you mentioned, you can always edit the "installed packages" once manually, from that point you are basically cloning the experiment, including the "installed packages" so it should work if the original worked.
Make sense ?
GiganticTurtle0 BTW, this mock example worked out of the box (python 3.6 on Ubuntu):
` from typing import Any, Dict, List, Tuple, Union
from clearml import Task
from dask.distributed import Client, LocalCluster
def start_dask_client(
n_workers: int = None, threads_per_worker: int = None, memory_limit: str = "2Gb"
) -> Client:
cluster = LocalCluster(
n_workers=n_workers,
threads_per_worker=threads_per_worker,
memory_limit=memory_limit,
)
client = Cli...
Hi JitteryCoyote63 when you run the trains-agent it tells you where it puts the logs, it's a temp auto generated filename usually under /tmp/Running TRAINS-AGENT daemon in background mode, writing stdout/stderr to /tmp/.trains_agent_daemon_out4uahki3i.txt
WittyOwl57 this is what I'm getting on my console (Notice New lines! not a single one overwritten as I would expect)
` Training: 10%|█ | 1/10 [00:00<?, ?it/s]
Training: 20%|██ | 2/10 [00:00<00:00, 9.93it/s]
Training: 30%|███ | 3/10 [00:00<00:00, 9.89it/s]
Training: 40%|████ | 4/10 [00:00<00:00, 9.87it/s]
Training: 50%|█████ | 5/10 [00:00<00:00, 9.87it/s]
Training: 60%|██████ | 6/10 [00:00<00:00, 9.88it/s]
Training: 70%|███████ | 7/10 [00:00<00...
AttributeError: 'NoneType' object has no attribute 'base_url'
can you print the model
object ?
(I think the error is a bit cryptic, but generally it might be the model is missing an actual URL link?)print(model.id, model.name, model.url)
ResponsiveHedgehong88 so I would suggest using execute_remotely in your code, basically you start locally you make sure everything is passed as intended, then from within the code you call task.execute_remotely(...)
which will stop the current process and enqueue the Task on the selected queue for the agent to execute.
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/examples/advanced/execute_remotely_example.py#L127
This way you can both easily test...
It should have been:output_uri="s3://company-clearml/artifacts/bethan/sales_journeys/artifacts/examples/load_artifacts.f0f4d1cd5eb54795b11508dd1e739145/artifacts/filename.csv.gz/filename.csv.gz
I "think" the IAM should only have the ability to create an EC2 instance (querying instances is done through the trains platform)
See here:
https://pip.pypa.io/en/stable/user_guide/#environment-variables
Pass these environment variables as part of the YAML template you are using with the k8s.
Should work for both 🙂
Unfortunately that is correct. It continues as if nothing happened!
oh dear, let me make sure this is taken care of
And thank you for the reproduce code!!!
Can the host server's service agent be used?
In theory yes, just make sure you expose the containers network (check the docker compose)
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?
Hi DeliciousBluewhale87
This sounds like a great workflow to implement.
I guess my first question is how do you imagine the manager/director interacting with the system? What will they be shown, to allow them to approve/decline the model promotion ?
This depends on how you spined the server, basically as long as you configure the clients (i.e. python clients) correctly, there is no issue.
But the auto generated configuration might be off (in the UI when you credentials it tells the clearml-init
where the server is and the ports)
I would actually recommend subdomains if this is possible
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#sub-domain-configuration
wdyt?
So if any step corresponding to 'inference_orchestrator_1' fails, then 'inference_orchestrator_2' keeps running.
GiganticTurtle0 I'm not sure it makes sense to halt the entire pipeline if one step fails.
That said, how about using the post_execution callback, then check if the step failed, you could stop the entire pipeline (and any running steps), what do you think?
Let's try:
` echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && b...
Hi NaughtyFish36
c++ module fails to import, anyone have any insight? required c++ compilers seem to be installed on the docker container.
Can you provide log for the failed Task?
BTW: if you need build-essentials
you can add it as the Task startup scriptapt-get install build-essentials
How can i get loaded model in Preporcess class in ClearML Serving?
ComfortableShark77
You mean your preprocess class needs a python package or is it your own module ?
So I see this in the build, which means it works , and compiles, what is missing ?
` Building wheels for collected packages: leap
Building wheel for leap (setup.py) ... [?25l- \ |
1667848450770 UH-LPT371:0 DEBUG / - \ | / - done
[?25h Created wheel for leap: filename=leap-0.4.1-cp38-cp38-linux_x86_64.whl size=1052746 sha256=1dcffa8da97522b2611f7b3e18ef4847f8938610180132a75fd9369f7cbcf0b6
Stored in directory: /root/.cache/pip/wheels/b4/0c/2c/37102da47f10c22620075914c8bb4a9a2b1f858263021...
Ohh that cannot be pickled... how would you suggest to store it into a file?
Hi @<1643423185791619072:profile|DashingCentipede5>
Notice that you called "start_locally", it tries to run the code locally inside your jupter notebook, it assumes everything including code already exists, is that your case ?
Hi @<1641611252780240896:profile|SilkyFlamingo57>
. It is not taking a new pull from Git repository.
When you are saying it's not trying to get the latest, are you referring to a new run of the pipeline, and then the component being pulled is Not pulling the latest from the branch, is that the issue?
When you click on the component Task details (i.e. right hand side panel "Full details"), what's the commit ID you have?
Lastly, is the component running on the same machine as the prev...
I think it's supposed to be out early Nov 🙂
Also, is there a way to reproduce this issue of not capturing the model?