Reputation
Badges 1
25 × Eureka!Hi PunyPigeon71
Can you send the log from the remote execution?
Can you see on the Task in the UI , under execution tab, the correct git repo reference, commit ID, and uncommitted changes?
I am actually saving a dictionary that contains the model as a value (+ training datasets)
How are you specifically doing that? pickle?
DeliciousBluewhale87
node.base_task_id
ย is the base task, which will always be in draft mode, Instead we should use theย
node.executed
ย which references the current executed node.
YES, maybe we should add that into the example, so it is clearer ? WDYT?
Hi DeliciousBluewhale87
Yes that should have worked, can you verify the task status ?
Print(Task.get_task(...).get_status())
Thanks for answering, Yes, this is exactly what I wanted
Hmm should be possible, how slow is the update that we want to save the time ?
It is currently only enabled when using ports mode, it should be enabled by default , i.e a new feature :)
ETA for the next release is end of the month/early March, it is planned to include many other improvements ๐
mostly by using
Task.create
instead of
Task.init
.
UnevenDolphin73 , now I'm confused , Task.create is Not meant to be used as a replacement for Task.init, this is so you can manually create an Additional Task (not the current process Task). How are you using it ?
Regarding the second - I'm not doing anything per se. I'm running in offline mode and I'm trying to create a dataset, and this is the error I get...
I think the main thing we need to...
Great to hear SourSwallow36 , contributions are always appreciated ๐
Regrading (3), MongoDB was not build for large scale logging, elastic-search on the other hand was build and designed to log millions of reports and give you the possibility to search over them. For this reason we use each DB for what it was designed for, MongoDB to store the experiment documents (a.k.a env, meta-data etc.) and elastic-search to log the execution outputs.
Also, I would like to add some other plots t...
Hi SourSwallow36
- The same docker image is used for all three jobs, just because it is easier to manage and faster to download. The full code is available on the trains-server GitHub. If you want to spin the containers manually, check the docker-compose.yml on the main repo, it has all the commands there
- Fork the trains-server, commit the changes and don't forget to PR them ;)
- Elastic search is a database, we use it to log all the experiments outputs, console logs metrics etc. This...
Hi LovelyHamster1
You mean when as a section name or a variable?
Could you change this example to include a variable that breaks the support ?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
For example, could you test if this one works:
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
the second seems like a botocore issue :
https://github.com/boto/botocore/issues/2187
We actually added a specific call to stop the local execution and continue remotely , see it here: https://github.com/allegroai/trains/blob/master/trains/task.py#L2409
pytorch DDP
with what backend ? gloo ? nvcc ? openmpi ?
yeah. I am getting logs, but they are extremely puzzling to me. I would appreciate to actually have access to whole package structure..
Actual packages are updated back to "Installed Packages" section (under the execution tab).
indeed. can you maybe point where the docker command is composed.
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/clearml_agent/commands/worker.py#L3694
๐
BTW: you can run/build the entire thing on your machin...
the first runs perfectly fine,
Just making sure, running in an agent?
the second crashes
Running inside the same container as the first one ?
ReassuredTiger98
How can I make clearml-agent use pre-installed version from the nvidia/pytorch
If the Same version is required, the agent will not try to reinstall it (the new venv the agent is creating inside the container, inherits from the preinstalled system packages)
Comes with PyTorch Version 1.12 based on a commit
. I tried
torch >= 1.11
,
torch == 1.12
If in your installed packages you have torch==1.12
the agent should not tr...
Thanks JitteryCoyote63 let me double check if there is a reason for that (there might be one, not sure)
ReassuredTiger98 quick update, the issue was located, next RC will already contain a fix.
In the mean time, you can avoid it by using limiting pip version:
https://github.com/allegroai/clearml-agent/blob/715f102f6d98a44131d5bee909ee779b456c6229/docs/clearml.conf#L67pip_version: "<20.2"
Hi MuddySquid7
You can only add reports (scalars plots etc.) , though not to a published Task.
If you want to add an artifact, this should work.prev_task = Task.get_task(task_id='112233') prev_task.mark_started(force=True) prev_task.reload() prev_task.upload_artifact(..., wait_for_upload=True) prev_task.mark_stopped(force=True)
thought the agent created a new conda env and installed all packages
It does, but I was asking what is written on the Original Task (the one created when you executed the code on your laptop, not when the agent was executing it, when the agent is executing the Task, it writes back All the packages of the entire venv it created, when the Task is run manually, it will list only the packages you import directly (i.e. from package or import package, it actually analyses the code)
My point...
Oh no, you are absolutely correct, it is broken (I mean I have no idea why it lists Hydra, or how it got there). I will let the guys know and fix it.
Bottom line, after you clone it, please edit the installed packages and remove the "Hydra" line and replace with just "hydra-core" (no need for version).
The format is the same as "requirements.txt" and will effect the venv created by the agent
I found the issue, the first run it jumps over the first day (let me check if we can quickly fix that)
Looking at theย
supervisor
ย method of the baseย
AutoScaler
ย class, where are the worker IDs kept.
Is it in the class attributeย
queues
ย ?
Actually the supervisor is passing a fixed prefix, then it asks the clearml-server on workers starting with this name.
This way we can have a fixed init script for all agents, while we still can differentiate them from the other agent instances in the system. Make sense ?
OutrageousSheep60
I found the task in the UI -
and in the
UNCOMMITTED CHANGES
execution section there is
No changes logged
This is the issue.
and then run the
session
via docker
clearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 \ --packages "clearml" "tensorflow>=2.2" "keras" \ --queue MY_QUEUE \ --verbose
Are you running the "cleamrl-session" from your machine? (i.e. not from inside a docker) ?...
sorry the point where you select the interpreter for pycharm
Oh I see...