Reputation
Badges 1
25 × Eureka!send the agent's logs to log management and monitoring service,
These are stored into ELK, it was built to store large amounts of logs, I cannot see any reason why one would want to remove it?
Maybe if there would be a way to change their format, it could also help filtering them from my side.
You mean in the UI?
DeterminedToad86 I suspect that since it was executed on sagemaker it registered a specific package that is unique for Sagemaker (no to worry installed packages can be edited after you clone/reset the Task)
no need for it actually
BTW:
Just making sure, 74 was not supposed to be the last checkpoint (in other words it is not stuck on leaving the training process, but actually in the middle)
btw,
I launch the agent
daemon
outside docker (with
--docker
) , that’s the way it is supposed to work right?
Yep that should work
is it ?
Yes the easiest is os.environ call before the import
Regarding azure blob
General azure env vars should work because it configure the underlying azure sdk, but I would double check
Generally speaking
Generic Override Format
ClearML allows you to override any config entry using this format:
bash
CLEARML__<section>__<key>=<value>
Double underscores __ separate the hierarchy levels.
All keys and values are treated as strings.
This works for nested entries in clearml.conf.
clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
If the user running this command can run "docker run", then you should ne fine
might it be related to the docker socket not being mounted to the agent daemon running inside a docker container?
Oh yes, if the daemon is running Inside a docker container than you need both --privileged and mounting of the docker socket, to get it to work
Hi SkinnyPanda43
No idea what the ImageId actually is.
That's the ami image string that the new EC2 will be started with, make sense ?
Hi, is there a possibility to use one GPU card with 2 agents concurrently
RoundMosquito25 / EnviousPanda91
You need to change the WORKER_ID (no two workers can share the same ID)CLEARML_WORKER_ID="machine:gpu01" clearml-agent daemon ....
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
From the log it installed:cudatoolkit==11.1.1
based on the CUDA it found on the host machine: agent.cuda_version = 110
But for some reason it installed the pytorch from the conda "pytorch" repo without the cuda support.
Hi SmarmyDolphin68
Maybe the plot_report can help?
See here:
https://github.com/allegroai/trains/blob/a28a97b16067fd5c548ec73b061badde2515aa9f/examples/reporting/pandas_reporting.py#L32
Hi @<1576381444509405184:profile|ManiacalLizard2>
You can also use env vars, it might be easier, I'm assuming this is kind of CI/CD process
'''
export CLEARML_API_ACCESS_KEY="your-public-key"
export CLEARML_API_SECRET_KEY="your-private-secret"
export CLEARML_API_HOST=" https://api.clear.ml "
export CLEARML_WEB_HOST=" https://app.clear.ml "
export CLEARML_FILES_HOST=" https://files.clear.ml "
'''
[https://clear.ml/do...
Have to get glue setup, which I couldn’t understand fully, so that’s a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
FiercePenguin76
So running the Task.init from the jupyter-lab works, but running the Task.init from the VSCode notebook does not work?
Well it is there, do you have it in your docker-compose as well?
https://github.com/allegroai/trains-server/blob/master/docker-compose.yml#L55
ReassuredTiger98 in theory it should work, do you know what is actually stored ? (I mean reencoding it means you have to have opencv / ffmpeg which might be too much to ask)
It seems to try to p[ull with SSH credentials, add your user/pass(or better APIkey) to the clearml.conf
(look for git_user /git_pass)
Should solve the issue
Hi PanickyFish98
It verifies it has access to it when actually creating the Task, maybe it should be a warning?!
fyi: you can also change the value from the UI (under Execution output) or have a default one set in the clearml.conf used by the agent
So the "packages" are the packages you need in the steps themselves ?
GentleSwallow91 notice that on the Task you have "Installed Packages" this is the equivalent of requirments.txt , you can edit it and add a missing package, or programatically add it in code (though usually directly imported packages are automatically registered, how come this one is missing?)
to add a package in code:Task.add_requirements(package_name="my_package", package_version=">=1") task = Task.init(...)
base docker image but clearML has not determined it during the script ru...
These paths are
pathlib.Path
. Would that be a problem?
No need to worry, it should work (i'm assuming "/src/clearml_evaluation/" actually exists on the remote machine, otherwise useless 🙂
Merged, is it working for you now?
It should be the last line (or almost) of the Log. is it there ? Also it seems that from the log, that trains you are using trains 0.14.3 , try with trains 0.15 , let me know if you are still missing packages
BTW: latest PyCharm plugin with 2022 support was just released:
https://github.com/allegroai/clearml-pycharm-plugin/releases/tag/1.1.0
Actually this is by default for any multi node training framework torch DDP / openmpi etc.