Reputation
Badges 1
25 × Eureka!Hmm yes this is exactly what should not happen 🙂
Let me check it
nice @<1724960458047229952:profile|EnergeticKoala33> !
The issue was that the agent was trying to start the docker but had no credentials to do that, your solution is exactly what was needed to be done
Or am I forced to do a get, check if the latest version is fainallyzed,
Dataset Must be finalized before using it. The only situation where it is not is because you are still in the "upload" state.
, then increment de version of that version and create my new version ?
I'm assuming there is a data processing pipeline pushing new data?! How do you know you have new data to push?
One issue that I see is that the Dockerfile inside the agent container
Not sure I follow, these are settings for the default container to be used when the agent spins a Task for you.
How are you running the agent itself ?
Hi VexedCat68
Yes the serving is a bit complicated. Let me try to explain the underlying setup, before going into more details.
clearml-serving CLI is a tool to launch / setup. (it does the configuration and enqueuing not the actual serving) control plan Task -> Storing the state of the serving (i.e. which end points needs to be served, what models are used, collects stats). This Task has no actual communication with the serving requests/replies (Running on the services queue) Serving Task...
Hi SubstantialElk6
I think you are absolutely correct, it seems the glue pops all the arguments, when in fact it should maybe process them a,d convert the --env/-e
What do you think?
Aloso I assume if these are the default arguments they should actually be part of the k8s apply.yaml template no ?
CheerfulGorilla72
yes, IP-based access,
hmm so this is the main downside of using IP based server, the links (debug images, models, artifacts) store the full URL (e.g. http://IP:8081/ http://IP:8081/... ) This means if you switched IP they will no longer work. Any chance to fix the new server to the old IP?
(the other option is somehow edit the DB with the links, I guess doable but quite risky)
UpsetTurkey67 are you saying there is a sym link in the original repository, and when it copies it, it breaks the symlink ?
Are they expanded in the "api_server" ? (I verified on a linux machine, same error, the env in the api_server is not being resolved)
Thanks!
I think this one will cover both case (the issue is with files on the root of the dataset)if not (fnmatch(k, path) and fnmatch(k if '/' in k else '/{}'.format(k), '*/' + wildcard))}
not really 😞
Why would you want to set it up manually ? makes sense to have it in the cache folder, no?
JitteryCoyote63 maybe this is an old example of the pytrorch ddp code? it is basically copy pasted from the pytorch website:
https://pytorch.org/tutorials/intermediate/dist_tuto.html
I could take a look and figure that out.
This will greatly accelerate integration 😉
They don't give an in app notification.
Oh I see, I assume this is because the github account is not connected with any email, so no invite is sent.
Basically they should just be able to re-login and then they could switch to your workspace (with the link you generated)
Hi, I changed it to 1.13.0, but it still threw the same error.
This is odd, just so we can make the agent better, any chance you can send the Task log ?
So you are uploading a local file (stored in a Dataset) into GS bucket? may I ask why ?
Regrading usage (I might have a typo but this is the gist):torageManager.upload_file( local_file=separated_file_posix_path, remote_url=remote_file_path + separated_file_posix_path.relative_to(files_rgb) )Notice that you need to provide the full upload URL (including path and file name to be used on your GS storage)
I should manually copy it to the remote services agents?
The code itself needs to run somewhere, currently this has to be your machine, either you manually run the AWS autoscaler or an agents runs it for you. Make sense ?
Hi SquareFish25
Sure, here are a few:
HPO
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
Pipeline
https://github.com/allegroai/trains/blob/master/examples/pipeline/pipeline_controller.py
Automation:
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Hi SkinnyPanda43
Yes, I think you are right the documentation might be missing it. I'll make sure they know it 🙂
In the meantime :task.update_output_model
https://github.com/allegroai/clearml/blob/d3929033c016476c580557639ff44f900e65904a/clearml/backend_interface/task/task.py#L734
(without having to execute it first on Machine C)
Someone some where has to create the definition of the environment...
The easiest to go about it is to execute it one.
You can add to your code the following linetask.execute_remotely(queue_name='default')This will cause you code to stop running and enqueue itself on a specific queue.
Quite useful if you want to make sure everything works, (like run a single step) then continue on another machine.
Notice that switching between cpu...
JitteryCoyote63 I think I failed explaining myself.
- I think the problem of the controller is that you are interacting (aka changing hyper parameters)) with a Task created using new SDK version, with an older SDK version. specifically we added section names to the hyper parameters, and only new version of the SDK is aware of it.
Make sense? - Regrading the actual problem. It seems like this is somehow related to the first one, the task at run time is using an older SDK version , and I t...
JitteryCoyote63 What did you have in mind?
WickedGoat98 sorry, I missed the thread...
that the trains.conf has to be located on the node running the trains-agent.
Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?
link with "localhost" in it Oo
Hmm I think this is the main issue, for some reason the dataset default upload destination is "localhost", what do you have configured in your clearml.conf under files server?
I think I found something,
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/storage/helper.py#L1442
What's the boto version you have installed?
When exactly are you getting this error ?
I imagine that these phantom dependencies will prevent parallelization. Is there a workaround?
yes, they might... workaround might be a bit ugly but copy pasting the functions and changing the name
BTW: I'll check when is the next RC scheduled for, maybe it will already contain a fix 🤞