Hi @<1795626098352984064:profile|SoggyElk61>
Where you able to pass the ClearMLVisBackend line in your code?
This needs to be added before your actual code
My main query is do I wait for it to be a sufficient batch size or do I just send each image as soon as it comes to train
This is usually a cost optimization issue, generally speaking if GPU up time is not an issue that the process is stochastic anyhow, so waiting for a batch or not is not the most important factor (unless you use batchnorm layer, in that case this is basically a must)
I would not be able to split the data into train test splits, and that it would be very expensiv...
It should have been:output_uri="s3://company-clearml/artifacts/bethan/sales_journeys/artifacts/examples/load_artifacts.f0f4d1cd5eb54795b11508dd1e739145/artifacts/filename.csv.gz/filename.csv.gz
@<1533620191232004096:profile|NuttyLobster9> I think we found the issue, when you are passing a direct link to the python venv, the agent fails to detect the python version and since the python version is required for fetching the correct torch it fails to install it. This is why passing CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none because it skipped resolving the torch / cuda version (that requires parsing the python version)
is there GPU support
That's basically depends on your template yaml resources, you can have multiple of those each one "connected" with a diff glue pulling from a diff queue. This way the user can enqueue a Task in a specific queue, say single_gpu , then the glue listens on that queue and for each clearml Task it creates a k8s job the single gpu as specified in the pod template yaml.
RC should be out later today (I hope), this will already be there, I'll ping here when it is out
Hi GiganticTurtle0
ClearML will only list the directly imported packaged (not their requirements), meaning in your case it will only list "tf_funcs" (which you imported).
But I do not think there is a package named "tf_funcs" right ?
As I'm a Full-stack developer at Core. I'd be looking to extend the TRAINS Frontend and Backend APIs to suit my need of On-Prem data storage integration and lots of other customization for Job Scheduler(CRON)/Dataset Augmentation/Custom Annot. tool etc.
That is awesome! Feel free to post a specific question here, and I'll try to direct to the right place 🙂
Can you guide me to one such tutorial that's teaching how to customize the backend/front end with an example?
You mean l...
Hi EnviousPanda91
You mean like collect plots, then generate a pdf?
It also seems that
PipelineDecorator.upload_artifact
is not compatible with caching, sadly,
Both use the exact same mechanism of uploading artifacts (i.e. including caching for downloaded artifacts), in terms of caching pipeline components, this is on a component level (i.e. same code/task same arguments, equals cache hit)
What exactly are you getting ? how is it that the "PipelineDecorator.upload_artifact" uploads to a different storage ? is that reproducible ?
JitteryCoyote63 this is standard ssh authorized server removal
https://superuser.com/a/30089
specifically you can try:ssh-keygen -R 10.105.1.77
Hi JitteryCoyote63
The NVIDIA_VISIBLE_DEVICES is set automatically for the process the trains-agent spins, so from your code, it is transparent, you can only "see" GPU 0.
(Obviously not using docker you can forcefully change the OS environment in runtime, but you should avoid that ;))
Hi RoundMosquito25
How did you spin the agent (whats the cmd line? is it in docker mode or venv mode?)
From the console it seems the pip installation inside the container (based on the log this is what I assume) seems like it is stuck ?!
My question is if there is an easy way to track gradients similar to
wandb.watch
@<1523705099182936064:profile|GrievingDeer61> not at the moment, but should be fairly easy to add.
Usually torch examples just use TB as a default logging, which would go directly to clearml , but this is a great idea to add
Could probably go straight to the next version 🙂
wdyt?
Hi @<1572395184505753600:profile|GleamingSeagull15>
Try adjusting:
None
to 30 sec
It will reduce the number of log reports (i.e. API calls)
p.s. you should remove this line 🙂extra_index_url: ["git@github.com:salimmj/xxxx"]
Of course, I used "localhost"
Do not use "localhost" use your IP then it would be registered with a URL that points to the IP and then it will work
Requested version: 2.28, Used version 1.0" for some reason
This is fine that means there is no change in that API
Hi WackyRabbit7
Yes, we definitely need to work on wording there ...
"Dynamic" means you register a pandas object that you are constantly logging into while training, think for example the image files you are feeding into the network. Then Trains will make sure it is constantly updated & uploaded so you have a way to later verify/compare different runs and detect dataset contemplation etc.
"Static" is just, this is my object/file upload and store it as an artifact for me ...
Make sense ?
ERROR: torch-1.12.0+cu102-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform
TartBear70 could it be you are running on a new Mac M1/2 ?
Also quick question, any chance you can test with the latest RC?pip3 install clearml-agent==1.3.1rc6
Hi VexedCat68
One of my steps just finds the latest model to use. I want the task to output the id, and the next step to use it. How would I go about doing this?
When you say "I want the task to output the id" do you mean to pass t to the next step:
Something like this one:
https://github.com/allegroai/clearml/blob/c226a748066daa3c62eddc6e378fa6f5bae879a1/clearml/automation/controller.py#L224
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
Hmm I'm sorry it might be "continue_last_task", can you try:Task.init(..., continue_last_task="aabb11")
TrickySheep9 Yes, let's do that!
How do you PR a change ?
HurtWoodpecker30 could it be you hit a limit of some sort ?
Since you are running in venv mode, adding the OS environment before the clearml-agent, will basically make sure it will propagate to the process itself.
ReassuredTiger98 make sense ?
main clearml repo?
Yep that sounds right 🙂 thank you!
Hi RoundMosquito25
What do you mean by "local commits" ?