Hi @<1566596960691949568:profile|UpsetWalrus59>
just wondering - shouldn't the job still work if I didn't push the commit yet
How would that work? it does not know which commit to take? it would also fail on git diff apply, no?
Thanks for checking NastyFox63
I double checked with both front/backend , there should not be any limit...
Could you maybe provide a toy demo to reproduce the issue ?
BTW: seems like conda doesn't support git+git:// packages
How about switching to pip ? you can still run the entire thing from conda env, it will just use pip & venv to install everything, other than that it should work as expected.
Hi DeliciousBluewhale87
Yes that should have worked, can you verify the task status ?
Print(Task.get_task(...).get_status())
EnviousStarfish54 a fix is already available in the latest RC
Could you verify it solves your issue as well?pip install trains==0.16.2rc0
My main query is do I wait for it to be a sufficient batch size or do I just send each image as soon as it comes to train
This is usually a cost optimization issue, generally speaking if GPU up time is not an issue that the process is stochastic anyhow, so waiting for a batch or not is not the most important factor (unless you use batchnorm layer, in that case this is basically a must)
I would not be able to split the data into train test splits, and that it would be very expensiv...
okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"
Hi FriendlyKoala70 , trains will report all the tensorboard graphs, I'm assuming that's who is creating the epoch_lr graph. On top of it, you can always report manually with logger (as you pointed). Does that make sense to you?
Hi GrievingTurkey78
Could you provide some more details on your use case, and what's expected?
We created an account, setup our data pipeline, and now we can't get back in. Nothing is in the project. Can someone from support reach out to help?
Hi @<1545216077846286336:profile|DistraughtSquirrel81>
You mean in the SaaS? (app.clearml.ml) or is it a local installation?
If this is the SaaS, could it be the data is on a different workspace ? (you can switch workspace and refresh the page)
This is an example of hoe one can clone an experiment and change it from code:
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
A full HPO optimization process (basically the same idea only with optimization algorithms deciding on the next set of parameters) is also available:
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
I ended up using
task = Task.init(
continue_last_task
=task_id)
to reload a specific task and it seems to work well so far.
Exactly, this will initialize and auto log the current process into existing task (task_id). Without the argument continue_last_task ` it will just create a new Task and auto log everything to it 🙂
Hi @<1523702932069945344:profile|CheerfulGorilla72>
I think more details re needed here:)
Hi TenseOstrich47
Does the .ssh folder on the user running the agent contain the correct credentials ?
Basically from the user running the agent on the agent's machine can you clone the repo with:ssh://git@github.com/15gifts/py-db.git
GiddyTurkey39 Okay, can I assume "Installed packages" contains the packages you need?
If so, you can setup trains-agent on a machine (see instructions on the github)
And then clone the experiment, and enqueue it into the "default" queue (or any other queue your agent is connected to)
https://github.com/allegroai/trains-agent
Usually in the /tmp folder under a temp filename (it is generated automatically when spinned)
In case of the services, this will be inside the docker itself
git config --system credential.helper 'store --file /root/.git-credentials'
Maybe we should use this hack for cloning with user/token in general ...
I though the dataset was only linked to the fileserver and not to the specific url used to upload it. (
ShinyRabbit94 yep exactly! the idea is that you can actually do the storage on any solution (S3/GS etc.) the file server is just the default one 🙂
However, this one should be a feature to work on, and should be fairly easy to implement.
Feel free to add as GitHub issue 🙂
Main challenge is understanding what needs to be added as "uncommitted changes"
I think that what you need is to create an OutputModel , then call update weights file when you have the better model, this will also allow you to tag the model object. Would that help? Or would it make sense to use Task.models and count on the auto logging?
Notice the configuration parameters:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L160
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L162
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L156
Hi EnviousPanda91
You mean like collect plots, then generate a pdf?
I'm a bit confused between the distinction / how to use these appropriately --
Task.init
does not have
repo
/
branch
args to set what code the task should be running.
It detects it automatically at run time 🙂 based on what is actually being used
My ideal is that I do exactly what
Task.create
does, but the task only goes into the pipeline section rather than making a new one in the experiments section.
Do y...
Okay this is indeed reported in the UI, but the trains-agent is running the experiment, and seems to be failing to clone the repository in question.
Seems like a "https" error, git is actually failing to clone the repository error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
Can you manually run the clone command on that machine ? I would guess there is some kind of firewall sitting in the middle of the https connection, and that is causing the git to ...
Hi CynicalBee90
Sorry, I missed the reply.
"I think we’ll leave the checkmark and the warning and just write SSPL below," Sounds like a good solution 👍
2. I have to admit, I would just write "language agnostic", but I will not insist further, so if you feel "platform" helps in explaining the reasoning, I'm with you.
3. "... to do smart analysis on my logged data easily, ..."
If this is the criteria, none of the options is Very easy, but they all have an interface.. not sure how to com...
Interesting... TrickyRaccoon92 could it be the validation phase was creating a new Tensorboard file ?
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?