I think the main risk is ClearML upgrades to MongoDB vX.Y, and mongo changed the API (which they did because of amazon), and now the API call (aka the mongo driver) stops working.
Long story short, I would not recommend it π
Hi MinuteWalrus85
This is great question, and super important when training models. This is why we designed a whole system to manage datasets (including storage querying, balancing data, and caching). Unfortunately this is only available in the paid tier of Allegro... You are welcome to https://allegro.ai/enterprise/ the sales guys.
π
Hmm, you are missing the entry point in the execution (script path).
Also as I mentioned you can either have a git repo or script in the uncommitted changes, but not both (if you have a git repo then the uncommitted changes are the git diff)
DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:torchvision==0.7.0
Let me know if it worked
Hi JitteryCoyote63
experiments logs ...
You mean the console outputs ?
In case of scalars it is easy to see (maximum number of iterations is a good starting point
BTW: UnevenDolphin73 you should never actually do "task = clearml.Task.get_task(clearml.config.get_remote_task_id())"
You should just do " Task.init()
" it will automatically take the "get_remote_task_id" and do all sorts of internal setups, you will end up with the same object but in an ordered fashion
Yes even without any arguments give to Task.init()
, it has everything from the server
Thanks JitteryCoyote63 !
Any chance you want to open github issue with the exact details or fix with a PR ?
(I just want to make sure we fix it as soon as we can π )
tf datasets is able to handle batch downloading quite well.
SubstantialElk6 I was not aware of that, I was under the impression tf dataset is accessed on a file level, no?
Hi @<1657918706052763648:profile|SillyRobin38>
I have included some print statements
you should see those under the Task of the inference instance.
You can also do:
import clearml
...
def preprocess(...):
clearml.Logger.current_logger().report_text(...)
clearml.Logger.current_logger().report_scalar(...)
, specifically within the containers where the inferencing occurs.
it might be that fastapi is capturing the prints...
[None](https://github.com/tiangolo/uvicor...
Yes there was a bug that it was always cached, just upgrade the clearmlpip install git+
No worries, just wanted to make sure it doesn't slip away π
But this will require some code changes...
Is trains-agent using docker-mode or virtual-env ?
Hmm I guess we should better state that, I'll pass it on π
UnevenDolphin73 something like this one?
https://github.com/allegroai/clearml/pull/225
IrritableJellyfish76 point taken, suggestions on improving the interface ?
using caching where specified but the pipeline page doesn't show anything at all.
What do you mean by " the pipeline page doesn't show anything at all."? are you running the pipeline ? how ?
Notice PipelineDecorator.component needs to be Top level not nested inside the pipeline logic, like in the original example
@PipelineDecorator.component(
cache=True,
name=f'append_string_{x}',
)
Hi IrritableGiraffe81
You can access the model object with, task.models['output']
To set the model metadata I would recommend making sure you have the latest clearml package, I think this is relatively new addition
Hi SubstantialElk6
where exactly in the log do you see the credentials ?
/tmp/.clearml_agent.234234e24s.cfg
What's the exact setup ? (I mean are you using the glue? if that's the case I think the temp config file is only created inside the pod/docker so upon completion it will be deleted along side the pod.
IrritableJellyfish76 hmm maybe we should an an extra argument partial_name_matching=False
to maintain backwards compatibility?
What are you seeing in the Task that was cloned (i.e. the one the HPO created not the original training task)?
by that I mean, configuration section, do you have the Args there ? (seems like the pic you attached, but I just want to make sure)
Also in the train.py file, do you also have Task.init ?
This looks exactly like the timeout you are getting.
I'm just not sure what's the diff between the Model autoupload and the manual upload.
When exactly are you getting this error ?
CheerfulGorilla72
yes, IP-based access,
hmm so this is the main downside of using IP based server, the links (debug images, models, artifacts) store the full URL (e.g. http://IP:8081/ http://IP:8081/... ) This means if you switched IP they will no longer work. Any chance to fix the new server to the old IP?
(the other option is somehow edit the DB with the links, I guess doable but quite risky)
I'm running agent inside docker.
So this means venv mode...
Unfortunately, right now I can not attach the logs, I will attach them a little later.
No worries, feel free to DM them if you feel this is to much to post them here
Nicely done DeterminedToad86 π
Wasn't this issue resolved by torch?
Hmm, not a bad idea π
Could you please open a Git Issue, so it will not get forgotten ?
(btw: I'm not sure how trivial it is to implement, nonetheless obviously possible π