JitteryCoyote63 What did you have in mind?
but instead, they cannot be run if the files they produce, were not committed.
The thing with git, if you have new files and you did not add them, they will not appear in the git diff, hence missing when running from the agent. Does that sound like your case?
Hi JitteryCoyote63
Wait a few hours, there is a new fix, I'll make sure we upload it later today (scheduled to be there anyhow, I'll push it forward)
(some packages that are not inside the cache seem to have be missing and then everything fails)
How did that happen?
However, it's very interesting why ability to cache the step impacts artifacts behavior
From you log:
videos_df = StorageManager.download_file(videos_df)
Seems like "videos_df" is the DataFrame, why are you trying to download the DataFrame ? I would expect to try and download the pandas file, not a DataFrame object
It's the safest way to run multiple processes and make sure they are cleaned afterwards ...
but the logger info is missing.
What do you mean? Can I reproduce it ?
BTW: The code sample you shared is very similar to how you create pipelines in ClearML, no?
(also could you expand on how you create the Kedro node ? from te face o fit it looks like another function in the repo, but I have a feeling I'm missing something)
The main issue is applying the patch requires git clone and that would fail on local (not pushed) commits.
What's the use case itself ?
(btw, if you copy the uncommitted changed into a file and git apply it, it will work)
If you one each "main" process as a single experiment, just don't call Task.init in the scheduler
Hi JitteryCoyote63 ,
The easiest would probably be to list the experiment folder, and delete its content.
I might be missing a few things but the general gist should be:from trains.storage import StorageHelper h = StorageHelper('s3://my_bucket') files = h.list(prefix='s3://my_bucket/task_project/task_name.task_id') for f in files: h.delete(f)
Obviously you should have the right credentials 🙂
 are models technicallyÂ
Task
s and can they be treated as such? If not, how to delete a model permanently (both from the server and from AWS storage)?
When you call Task.delete() it actually goes over a;; the models/artifacts and deletes them from the storage
Okay, I think I understand, but missing something. It seems you call get_parameters from old API , is your code actually calling get_parameters ? The trains-agent runs the code externally, whatever happens inside the agent should have now effect on the code. So who exactly is calling the task.get_parameters, and well, why ? :)
Hmm GreasyLeopard35 can you specify the range you are passing to the HPO, as well as the type of optimization class ? (grid/random/optuna etc.)
It might be that the worker was killed before unregistered, you will see it there but the last update will be stuck (after 10min it will be automatically removed)
Hi DeliciousBluewhale87
clearml-agent 0.17.2 was just release with the fix, let me know if it works
I'm guessing the extra index URL can be a URL to the github repo of interest?
The extra index URL is exactly what you would be passing to pip install, meaning it has to comply to pypi artifactory api.
Make sense ?
BTW MagnificentSeaurchin79 just making sure here:
but I don't see the loss plot in scalars
This is only with Detect API ?
and the inet of the same card ?
SlipperyDove40
FYI:args = task.connect(args, name="Args")
Is "kind of" reserved section for argparse. Meaning you can always use it, but argparse will also push/pull things from there. Is there any specific reason for not using a different section name?
SmarmySeaurchin8 what's the mount command you are using?
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent
or if running code that uses trains
will work on them?
and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Do you mean there is no shared filesystem among the different machines ?
And do you need to run your code inside a docker, or is venv enough ?
ShaggyHare67 in the HPO the learning should be (based on the above):General/training_config/optimizer_params/Adam/learning_rate
Notice the "General" prefix (notice it is case sensitive)
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.pip install trains-agent trains-agent init trains-agent daemon --gpus all
trains-agent doesn't run the clone, it is pip...
basically calling "pip install git+https://..."
Not sure you can pass extra arguments
Also, this is not a setup problem, otherwise it would have seen consistently failing ... this actually looks like a network issue.
The only thing I can think of is retrying to install if we get network error (not sure whats the exit code of pip though (maybe 9?)