
Reputation
Badges 1
63 × Eureka!Thanks for letting me know, I'd be very happy to update.
Turns out the step I missed (maybe should be mentioned in the doc...) the configuration of the Security Group for the EC2 machine to allow inbound connections to the ports 8080, 8008, 8081, and to limit the source to my ip (or my office ip) only
yes, that solved the errors, however the two lines "could not detect iteration reporting" and "reporting detected" a few moments later, still show up
the train_loss is on the second from left column (the far left is epoch num 30-36)
TRAINS Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start TRAINS Monitor: Reporting detected, reverting back to iteration based reporting
no, I meant to change the way it is reported. I'm still interested in the train_loss graph, naturally 🙂 but obviously it is reporting something that is the inverse of the train_loss, since in the graph it is exploding, and in reality (as reported in the terminal) it is decaying to 9e-2
can this give us a clue? I'm getting this error:
in the meantime, I got this error message, this time regarding Trains:
The valid_loss and Accuracy are showing on the Tboard in the same number values as they show up on the terminal, but the train_loss is showing in a different scale and I can't figure out why. I did not change anything in the core files of either torc, Tboard or fastai, and used the intialization in the same way that you showed, and was on fastai docs, using learn.callback_fns.append(partial(LearnerTensorboardWriter, base_dir=tboard_path, name=taskName))
Good morning Alon, since you helped me so much getting tensorboard to show results yesterday, I'm hoping you can help me understand why some results I'm getting are strange:
` Traceback (most recent call last):
File "/home/ubuntu/MultiClassLabeling/myenv/lib/python3.6/site-packages/torch/utils/tensorboard/init.py", line 2, in <module>
from tensorboard.summary.writer.record_writer import RecordWriter # noqa F401
File "/home/ubuntu/MultiClassLabeling/myenv/lib/python3.6/site-packages/trains/binding/import_bind.py", line 59, in __patched_import3
level=level)
ModuleNotFoundError: No module named 'tensorboard'
During handling of the above exception, ...
tried both with Firefox and Chrome, results are similar also between computers and OS (ubuntu and Windows)
the "Payload" tab contains the project id info, so it shouldn't be the cause for the delete call fail
this is an error during training that points out to ElasticSearch error. This might be also the cause of the delete error, what do you think SuccessfulKoala55 ?
this is what I got:{"meta":{"id":"7cd78b67e5384e739b9aec6cdc030e6d","trx":"7cd78b67e5384e739b9aec6cdc030e6d","endpoint":{"name":"projects.delete","requested_version":"2.20","actual_version":"1.0"},"result_code":400,"result_subcode":12,"result_msg":"Validation error (error for field 'project'. field is required!)","error_stack":null,"error_data":{}},"data":{}}
What's interesting is that SOMETIMES (rarely) it succeeds
Thanks Jake for your help, it's highly appreciated. This is an AWS EC2 running the clearml-server AMI (region of EC2 is us-east-1)