Reputation
Badges 1
19 × Eureka!Move from gdrive to S3 bucket storage. This mounts the S3 bucket storage as a network drive. ClearML does not work easily with Google Drive.
No TB (Tesnorboard) is not enabled. I just googled it and found this: https://forum.opennmt.net/t/running-tensorboard/4242 . I will try enabling TB and see if that fixes it.
In Tensorflow's init .py, tensorboard appears to be initialized (including tf.summary):
` # Hook external TensorFlow modules.
Import compat before trying to import summary from tensorboard, so that
reexport_tf_summary can get compat from sys.modules. Only needed if using
lazy loading.
_current_module.compat.v2 # pylint: disable=pointless-statement
try:
from tensorboard.summary._tf import summary
_current_module.path = (
[_module_util.get_parent_dir(summary)] + _current_m...
So, accordintg to the article (and the code as far as I could tell), OpenNmt-tf automatically enabled TensorBoard. That is, it auto-logs the relevant features through tf.summary ( https://www.tensorflow.org/api_docs/python/tf/summary ). This is output on the cmd line with the likes of:
` INFO:tensorflow:Evaluation result for step 9000: loss = 1.190986 ; perplexity = 3.290324 ; bleu = 63.569644
INFO:tensorflow:Step = 9100 ; steps/s = 2.17, source words/s = 28293, target words/s = 39388 ; Lea...
It worked! I added this call shortly after Task.init
:tf.summary.create_file_writer("C:/mypath/logs")
Thanks! I'll look forward to it. If there is a draft that can give a general scope and function to the integration, that would also be helpful for this phase of our project.
The only place I see subprocess being called in opennmt is to determine the batch size, but not for the primary training task.
That is great to hear! Is there any documentation on how it works, and if it can be configured?
If ClearML does not implement this, we may have to ourselves - None .
we are at Server: 3.18.2-1126 and pypi version 1.12.2.
It seems that once the job as completed once, it doesn't accept any new report...
@<1523701205467926528:profile|AgitatedDove14> - for some reason none of those solutions are working. I am forcing "mark_started" - but it doesn't register. Models don't have the report_* endpoints and even trying with the artifact - once the job finishes, the artifact will no longer update.
Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few sec...
But with poetry running in a docker container, the cache is not persistent. Should I map the poetry cache volume to a location on the host?
@<1523701205467926528:profile|AgitatedDove14> - after the model_trainer.train step it is marked as complete. This is done using our own repo - None . The extra reporting steps are not added here (I am working on that locally) but it is calling the job complete.
At least I made S3 enough like Google Drive to make our team happy.