Reputation
Badges 1
25 × Eureka!Hi @<1569496075083976704:profile|SweetShells3>
Try to do:
import torch.distributed as dist
if dist.get_rank()==0:
task = Task.init(...)
This will make sure only the "master" process is logged
or
if int(os.environ.get('RANK'))==0:
task = Task.init(...)
Try:task.update_requirements('\n'.join([".", ]))
the issue was related to task.connect being called multiple times I guess.
This is odd?! how would that effect the crash?
Do notice that when you connect objects, each time you call connect you are basically deserializing the configuration from the backend into the code, maybe this somehow effected the object?
JitteryCoyote63 yes this is very odd, seems like a pypi flop ?!
On the website they do say there is 0.5.0 ... I do not get it
https://pypi.org/project/pytorch3d/#history
Hi JitteryCoyote63
I would like to switch to using a single auth token.
What is the rationale behind to that ?
Hi CourageousDove78
Not the cleanest, but you can basically pass everything here:
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html#post--tasks.get_all
Reasoning is that it is passed almost as is to the server for the actual query.
That wasn't scheduled by ClearML).
This means that from Clearml perspective they are "manual" i.e the job it self (by calling Task.init) create the experiment in the system, and fills in all the fields.
But for a k8s job, I'm still unsuccessful.
HelpfulDeer76 When you say "unsuccessful" what exactly do you mean ?
Could it be they are reported to the clearml demo server (the default server if no configuration is found) ?
I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
Hmm is your code calling Logger.current_logger() directly ?
Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML support...
Hi DisturbedWalrus17
This is a bit of a hack, but will work:from clearml.backend_interface.metrics.events import UploadEvent UploadEvent._file_history_size = 10Maybe we should expose it somewhere, what do you think?
While I'll look into it, you can do:from clearml import OutputModel output_model = OutputModel() output_model.update_weights("best_model.onnx")
Yes, could you send the full log? screen grab ?
Could you verify the Task.init call is inside the main function and Not the global scope? We have noticed some issues with global scope calls in some cases
Which would also mean that the system knows which datasets are used in which pipelines etc
Like input artifacts per Task ?
It's a good abstraction for monitoring the state of the platform and call backs, if this is what you are after.
If you just need "simple" cron, then you can always just loop/sleep 🙂
Are you getting the error from boto failing to launch additional ec2 instances ?
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?...
NastyOtter17 can you provide some more info ?
Hi MysteriousBee56 ,
what do you mean by:
Can we upload our project repository to trains server?
I tried specifying helpers functions but it still gives the same error.
What's the error you are getting ?
Hi Guys,
I hear you guys, and I know this is planned but probably bump down priority.
I know the main issue is the "Execution Tab" comparison, the rest is not an issue.
Maybe a quick Hack to only compare the first 10 in the Execution, and remove the limit on the others ? (The main isue with the execution is the git-diff / installed packages comparison that is quite taxing on the FE)
Thoughts ?
Ok, but it must be somewhere in the bst class
It is the XGboost callback feature, basically just reporting everything xgbosst reports:
None
Hi SkinnyPanda43
Every "commit" is a new version, so sync changes you need to either create a new version (with parent version as the previous one), and sync the local folder (or manually add/remove files).
If you do not need to actually store the "current" version, you can just reset the Task, and sync it again.
wdyt?
Hi @<1557899668485050368:profile|FantasticSquid9>
There is some backwards compatibility issue with 1.2 (I think).
Basically what you need it to spin a new one on a new session ID and rergister the endpoints
Seems correct.
I'm assuming something is wrong with the key/secret quoting ?!
Could you generate another one and test it ?
(you can have multiple key/secretes on the same user)
can someone show me an example of how
PipelineController.create_draft
I think the idea is to store a draft versio of the pipeline (not the decorator type, I think, but the one launching pre-executed Tasks).
GiganticTurtle0 I'm not sure I fully understand how / why you are using it, can you expand?
EDIT:
However, my intention is ONLY to create it to be executed later on.
Hmm so may like enqueue it?
EmbarrassedPeacock82 are you using keras/pytorch etc for serving (i.e. Triton) ?
Thanks FlutteringWorm14 , checking 🙂