Reputation
Badges 1
25 × Eureka!that does happen when you create a normal local task, that's why i was confused
The parts that are not passed in both cases are the configurations from the conf file. Only the environment is passed (e.g. git python packages etc) , . For example if you have storage credentials in your conf file , they are not passed to a remote agent, instead the credentials from the remote agent are used when it runs the task.
make sense?
Hi, I changed it to 1.13.0, but it still threw the same error.
This is odd, just so we can make the agent better, any chance you can send the Task log ?
you mean in the enterprise
Enterprise with the smarter GPU scheduler, this is inherent problem of sharing resources, there is no perfect solution, you either have fairness, but then you get idle GPU's of you have races, where you can get starvation
If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.
Thanks MinuteGiraffe30 , fix will be pushed later today
ShallowGoldfish8 how did you get this error?self.Node(**eager_node_def) TypeError: __init__() got an unexpected keyword argument 'job_id'
, but it seems like I can only trigger a task using a Task scheduler but not a pipeline.
@<1523701132025663488:profile|SlimyElephant79> Maybe we should better state it, but Pipeline is "just" another type of Task. so triggering a Task with the Pipeline ID is essentially triggering the pipeline (do notice you need to select the "services" queue to be used so that the pipeline runs on the correct resource). Make sense ?
TartSeal39 please let me know if it works, conda is a strange beast and we do our best to tame it.
Specifically when you execute manually on a conda env we collect (separately) the conda packages & the python packages (so later we can replicate on both conda & pip, or at least do our best)
Are you running both development env and agent with conda ?
GentleSwallow91 notice this part:
Hi Martin. Sorry - missed your reply.
Yeap I am aware that docker_internal_mounts is inside agent section.
'-v', '/tmp/ssh-XXXXXXnfYTo5/agent.8946:/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-XXXXXXnfYTo5/agent.8946',
It is creating a copy of the ssh folder and setting the SSH_AUTH_SOCK env to it. You can just map the entire ssh folder automatically by un-setting SSH_AUTH_SOCK before running the agent.SSH_AUTH_SOCK= clearml-agent ...
we also provide a custom
aux-config
file. We also had to make sure to update the name inside
config.pbtxt
so that Triton is happy:
Good point, what would be the logic of the auto "config.pbtxt" patching we should employ ?
Hi TartSeal39
So the thing is, the agent does not support yaml env for conda. Currently if the requirements section is empty, the agent will use the requirements.txt of the repo. We first need to add support for conda yaml, and then allow you to disable the auto requirements or push the specific yaml. Would that work? Also is there a reason the auto package is not working?
Could that be the proper way to install ?
https://github.com/facebookresearch/pytorch3d/blob/main/INSTALL.md#3-install-wheels-for-linux
Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.
This means that you can launch a new one (i.e. abort, clone, edit, enqueue) directly from the web UI and in the UI edit the configuration. Unfortunately it does not support changing the configuration "live"
MysteriousBee56 what do you mean "save Scalars on the machine"? All metrics are sent to the trains server. You can later retrieve them from code, if you need.
Correct, but do notice that (1) task names are not unique and you can change them after the Task was executed (2) when you clone the Task, you can actually rename it, when an agent is running the Task, basically the init
function is ignored, because the Task already exists. Make sense ?
I'm not sure this is configurable from the outside π
Thanks BattyLizard6 , fix is on its way π
Do you have a specific numpy version you are installing ? why is it trying to install the wheel from code?
Hi ConvolutedSealion94
Just making sure, you spinned the docker-compose of the clearml serving as well ?
Hi BattyLizard6
Not that I'm aware of, which TF version are you using, and which clearml version?
"erasing" all the packages that had been set in the base task I'm cloning from. I
Set is not add, if you are calling set_packages, you are overwriting all of them with this single call.
You can however do:
task_data = task.export_task()
requirements = task_data["script"]["requirements"]["pip"]
requirements += "new packages"
task.set_packages(requirements)
I guess we should have get_requirements
?!
Yea the "-e ." seems to fit this problem the best.
π
It seems like whatever I add to
docker_bash_setup_script
is having no effect.
If this is running with the k8s glue, there console out of the docker_bash_setup_script ` is currently Not logged into the Task (this bug will be solved in the next version), But the code is being executed. You can see the full logs with kubectl, or test with a simple export test
docker_bash_setup_script
` export MY...
What's the difference between the example pipeeline and this code ?
Could it be the "parents" argument ? what is it?
give me a minute to test
Oh dear, I think your theory might be correct, and this is just the mongo preallocating storage.
Which means the entire /opt/trains just disappeared
Eg, i'm creating a task usingΒ
clearml.Task.create
Β , often it doesn't properly get the git diff correctly,
ShakyJellyfish91 Task.create does not store any "git diff" automatically, is there a reason not to use Task.init
?
Hi ElegantCoyote26 , yes I did π
It seems cometml puts their default callback logger for you, that's it.
Hi WittyOwl57
I think what happens is it auto-logs the joblib load/save calls, these calls track models used/created by the code, and attach them to the model repository representing these model.
I'm assuming there are multiple load/save , and there are multiple model instances pointing to the same local file "file:///tmp/..." . The earning basically says it is re-registering existing models.
Make sense ?
Hi TrickyRaccoon92
If you are reporting to tensor-board, then "iteration" equals step. Is this the case?