Sorry for pinging you on this old thread.
...
And what was the learning strategy? ADAM? RMSProp?
Sorry, missed it...
I would actually use the HPO to test various setups (it uses Optuna under the hood so really SOTA hyper band Bayesian optimization ontop of them)
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
you need to set
CLEARML_DEFAULT_BASE_SERVE_URL:
So it knows how to access itself
looks like at the end of the day we removed
proxy_set_header Host $host;
and use the fqdn for the proxy_pass line
And did that solve the issue?
Hi RoughTiger69
I like the direction this is taking, let me add some more complexity.
My thinking is that if we have “input datasets”, I'd also like to be able to clone the Task and automagically change them (with the need to export the dataset_id as an argument), basically I'm thinking :train = Datasset.get('aabbcc1', name='train') valid = Datasset.get('aabbcc2', name='validation') custom = Datasset.get('aabbcc3', name='custom')
Then you end up with HyperParameter Section: "Input Datas...
You’ll just need the user to
name them
as part of loading them in the code (in case they are loading multiple datasets/models).
Exactly! (and yes UI visualization is coming 🙂 )
This is odd, Can you send the full Task log? (remove any pass/user/repo that you think is sensitive)
Hmm maybe we should add a test once the download is done, comparing the expected file size and the actual file size, and if they are different we should redownload ?
EnviousStarfish54 we just fixed an issue that relates to "installed packages" on windows.
RC is due to be release in the upcoming days, I'll keep you posted
LudicrousParrot69 this is implementation issue, this entire page is based on "task comparison" single Task means totally different interface for querying the data 🙂
Just so I understand,
scheduler executes main every 60sec
main spins X sub-processes
Each subprocess needs to report scalars ?
However, there is still a delay of approximately 2 minutes between the completion of setup,
Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)
(This is why we recommend using pip, because it is stable and clearml-agent takes care of pytorch/cuda verions)
os.system
Yes that's the culprit, it actually runs a new process and clearml
assumes that there are no other scripts in the repository that are used, so it does not analyze them
A few options:
Manually add the missing requirement Task.add_requirements('package_name')
make sure you call it before the Task.init
2. import the second script from the first script. This will tell clearml to analyze it as well.
3. Force the entire clearml to analyze the whole repository: https://g...
Hmmm, yes we should definitely add --debug (if you can, please add a GitHub issue so it is not forgotten).
FiercePenguin76 Specifically are you able to ssh manually to <external_address>:<external_ssh_port> ?
Or can it also be right after
Task.init()
?
That would work as well 🙂
Hi @<1541954607595393024:profile|BattyCrocodile47>
But the files API is still open to the world, right?
No, of course not 🙂 (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...
So this should be easier to implement, and would probably be safer.
You can basically query all the workers (i.e. agents) and check if they are running a Task, then if they are not (for a while) remove the "protection flag"
wdyt?
BoredHedgehog47 if you are running it on K8s, then the setup script is running before everything else, even before an agent appears on the machine, unfortunately this means the output is not logged yet, hence the missing console lines (I think the next version of the glue will fix that)
In order to test you can do:export TEST_ME
then inside your code you will be able to see itos.environ['TEST_ME']
Make sense ?
Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)
Sure, you can pass ${stage_data.id}
as argument and the actual Task will get the reference step's Task ID of the current execution.
make sense ?
Do you think the local agent will be supported someday in the future?
We can take this ode sample and extent it. can't see any harm in that.
It will enable very easy to ran "sweeps" without any "real agent" installed.
I'm thinking roll out multiple experiments at once
You mean as multiple subprocesses, sure if you have the memory for it
Yep 🙂 but only in RC (or github)
Hi SourOx12
I think that you do not actually need this one:step = step - cfg.start_epoch + 1
you can just dostep += 1
ClearML Will take care of the offset itself
It looks somewhat familiar ... 😞
SuccessfulKoala55 any idea?
DilapidatedDucks58 trains-agent adds the artifactory URL as --extra-index-url , are you sure you are getting the correct torch version in the container? because the torch html is not an artifactory html, it is a list of links, I just want to make sure you are getting the correct version, because otherwise it can default to the CPU version, which we don't want 🙂 anyhow you can use the direct link in the "installed packages and just put there " https://download.pytorch.org/whl/nightly/cu101...
Quick update Nexus supports direct http upload, which means that as CostlyOstrich36 mentioned, just pointing to the Nexus http upload endpoint would work:output_uri="http://<nexus>:<port>/repository/something/"
See docs:
https://support.sonatype.com/hc/en-us/articles/115006744008-How-can-I-programmatically-upload-files-into-Nexus-3-
AstonishingRabbit13
https://github.com/googleapis/google-cloud-python/issues/4941#issuecomment-369472576
check the openssl and the date, this seems like SSL low level error (even before authentication)