Reputation
Badges 1
25 × Eureka!BroadMole98 thank you for noticing !
I'll make sure it is fixed (a few other properties are also missing there, not sure why, I'll ask them to take a look)
FYI:ssh -R 8080:localhost:8080 -R 8008:localhost:8008 -R 8081:localhost:8081 replace_with_username@ubuntu_ip_here
solved the issue 🙂
Maybe the only thing to worry about is making sure the IP address is stable, so if k8s replaces the node, you do not have to reconfigure the clients 🙂
Trains is fully open-source, that said properly publishing and maintaining the web client is still on our to do list (I mean there is totally readable JavaScript code packaged in the trains-server and the dockers). It is constantly pushed because there is generally less contributions on the front-end with these kind of projects. That said of you guys are willing to help, it will greatly help in pushing it forward... LivelyLion31 what do you think, would you guys like to help with the fronte...
I want is to manually provide a name to each series equal to the subject name (Subject 1, Subject 2, etc.)
They appear as they are reported to TB. I think this is a PyTorchLightning thing... If you look as the TB produced, you will get the same naming schemes, no?!
Awesome, PRs are always welcome, and we try to help with any request and feature coming for users. We just added audio support (RC releasing in a few days) based only on users request.
https://github.com/allegroai/trains/issues/120
Hi UnsightlySeagull42
Basically you can get the agent to always add additional arguments for the docker run, such as -v for mounting:
https://github.com/allegroai/clearml-agent/blob/948fc4c6ce1ecf33a74619ad570d69b8188f6db9/docs/clearml.conf#L133
` from clearml.automation.parameters import LogUniformParameterRange
sampler = LogUniformParameterRange(name='test', min_value=-3.0, max_value=1.0, step_size=0.5)
sampler.to_list()
Out[2]:
[{'test': 1.0},
{'test': 3.1622776601683795},
{'test': 10.0},
{'test': 31.622776601683793},
{'test': 100.0},
{'test': 316.22776601683796},
{'test': 1000.0},
{'test': 3162.2776601683795}] `
Hmm you either need to run with SUDO or make sure the running user has docker run permissions
SmarmyDolphin68 if you can reproduce the behavior in a standalone script , it will really accelerate fixing this issue
Hi @<1578555761724755968:profile|GrievingKoala83>
Two tasks are created, but the training does not begin, both tasks are in perpetual running.
Can you print something after the task.launch_multi_node(args.nodes))
- I'm assuming the two Tasks are running and are blocked on the " Trainer
" class
If specified
args.gpus=2
and args.nodes=2,
three
tasks are created.
This is really odd, can you add some prints with task id and rank after the ...
Hi @<1657918706052763648:profile|SillyRobin38>
You should either disable certificate verification or add the self-signed certificate to your urllib
None
or set
export REQUESTS_CA_BUNDLE="/path/to/cert/file"
export SSL_CERT_FILE="/path/to/cert/file"
Basically it gives it direct access to the host, this is why it is considered less safe (access on other levels as well, like network)
Regrading the missing packages, you might want to test with:force_analyze_entire_repo: false
https://github.com/allegroai/trains/blob/c3fd3ed7c681e92e2fb2c3f6fd3493854803d781/docs/trains.conf#L162
Or if you have a full venv you like to store instead:
https://github.com/allegroai/trains/blob/c3fd3ed7c681e92e2fb2c3f6fd3493854803d781/docs/trains.conf#L169
BTW:
What is the missed package?
Regrading resetting it via code, if you need I can write a few lines for you to do that , although that might be a bit hacky.
Maybe we should just add a flag saying, use requirements.txt ?
What do you think?
Hi GiddyTurkey39
Are you referring to an already executed Task or the current running one?
(Also, what is the use case here? is it because the "installed packages are in accurate?)
Hmm make sense, then I would call the export_task once (kind of the easiest to get the entire Task object description pre-filled for you) with that, you can just create as many as needed by calling import_task.
Would that help?
WickedGoat98
I will try to collect the installation steps in a document and share it to the community once ready
Thank you! this will be awesome !
We're here if you need anything 🙂
Correct, but do notice that (1) task names are not unique and you can change them after the Task was executed (2) when you clone the Task, you can actually rename it, when an agent is running the Task, basically the init
function is ignored, because the Task already exists. Make sense ?
Also what do you have in the "Configuration" section of the serving inference Task?
BulkyTiger31 could it be there is some issue with the elastic container ?
Can you see any experiment's metrics ?
Hi RattyBat71
Do you tend to create separate experiments for each fold?
If you really want to parallelized the workload, then splitting it to multiple executions (i.e. passing an argument of the index of the same CV) makes sense, then you can compare / sort the results based on a specific metric. That said if speed is not important, just having a single script with multiple CVs might be easier to implement?!
Hi HelplessCrocodile8
yes there is:
in the first case, the new_key
will be automatically logged:a_dict = {} a_dict = task.connect(a_dict) a_dict['new_key'] = 42
In the second example changes to the "object" passed to connect are not tracked
make sense ?
Hi SmugTurtle78
Unfortunately there is no actual filtering for these logs, because they are so important for debugging and visibility. I have to ask, what's the use case to remove some of them ?
btw: I'm assuming that args
is not the ArgParser object, as the ArgParser is automatically "connected" ?
Hi ReassuredTiger98
Could you send the log of both run ?
(I'm not sure this is a bug, or some misconfiguration , but the scenario should have worked...)
I do not think this is the upload timeout, it makes no sense to me for GCP package (we do not pass any timeout, it's their internal default for the argument) to include a 60sec timeout for upload...
I'm also not sure where is the origin of the timeout (I'm assuming the initial GCP handshake connection could not actually timeout, as the response should be relatively quick, so 60sec is more than enough)