Reputation
Badges 1
25 × Eureka!In theory yes, in practice you will be using the same docker image for all the services, and they will never interfere with one another. and you have the option to do more sophisticated stuff, like map the file-server data for a clean up service (should be out in a few days :)) so a balance. Also remember that relatively speaking docker are quite light weight, this is not like saying a VM per service...
Does what you suggested here >
Yes, it is basically the same underlying mechanism, only instead of 1-to-1 it's 1-to-many
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
It's building a gpu enabled docker...
you might want a diff container or to specific --cpu-only
... indicate the job needs to be run remotely? Iām imagining something like
clearml-task
and you need to specify the queue to push your Task into.
See here: https://clear.ml/docs/latest/docs/apps/clearml_task
Hmm I think the easiest is using the helm chart:
https://github.com/allegroai/clearml-server-helm-cloud-ready
I know there is work on a teraform template, not sure about instio.
Is helm okay for you ?
Oh that makes sense:
` # Create a child process
using os.fork() method
pid = os.fork()
if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)
else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", o...
Hi JitteryCoyote63
could you check if the problem exists in the latest RC?pip install clearml==1.0.4rc1
ComfortableShark77 it seems the clearml-serving is trying to Upload data to a different server (not download the model)
I'm assuming this has to do with the CLEARML_FILES_HOST, and missing credentials. It has nothing to do with downloading the model (that as you posted, will be from the s3 bucket).
Does that make sense ?
Interesting, if this is the issue, a simple sleep after reporting should prove it. Wdyt?
BTW are you using the latest package? What's your OS?
WackyRabbit7 the auto detection will only import direct packages you import (so that we do not end up with bloated venvs)
It seems that the transformers
library does not have it as a requirements, otherwise it would have pulled it...
In your code you can always do either:import torch
orTask.add_requirements('torch')
Do you want to open an issue in pip?
Funny enough this works in:
pip3 install "torch >=2.1.0.*, <2.1.1.*" --extra-index-url
@<1523701868901961728:profile|ReassuredTiger98> how did you install the nightly locally ?
Can you also provide the full log?
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
Hi SubstantialElk6
What if I have OS library dependencies as well? (Apt install, rpm install...etc).
If these are OS libraries that you always need you can put them here:
https://github.com/allegroai/clearml-agent/blob/d9b9b4984bb8a83914d0ec6d53c86c68bb847ef8/docs/clearml.conf#L136agent.extra_docker_shell_script: ["apt-get install -y bindfs", ]
In the next version, this could be controlled on a per Task basis.
FYI: the default apt package that are installed:
` apt-get update
a...
Hi SquareFish25
Sure, here are a few:
HPO
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
Pipeline
https://github.com/allegroai/trains/blob/master/examples/pipeline/pipeline_controller.py
Automation:
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Is there still an issue? Could it be the browser cannot access the file server directly?
JuicyDog96 Yes please!
Let me check what's the status with the docs repository, and I'll get back to you soon š
Ohh, two options:
From the script itself you can do:from clearml import Task task = Task.init(...) task.execute_remotely(queue='default')
Then run the script locally, it will get until the "execute_remotely call, quit the process and re-launch it on the "default" queue.
Option B:
Use the cleaml-task
$ clearml-task --folder <where the script is> --project ...
See https://github.com/allegroai/clearml/blob/master/docs/clearml-task.md#launching-a-job-from-a-local-script
Hi ShallowArcticwolf27
However, the AMI for version 0.16.1 has the following docker-compose file
I think we moved the docker-compose yaml when we upgraded from trains to clearml. Any reason your are installing the old docker-compose ?
Hi MortifiedDove27
I think you can resize the plot area in the UI (try to drag the horizontal separator)
is removed from the experiment list?
You mean archived ?
Yes I think the writer.add_figure
somehow crops the image
or by trains
We just upload the image as is ... I think this is SummaryWriter issue