Reputation
Badges 1
25 × Eureka!RipeGoose2 That sounds familiar. Could you test with the latest RC?pip install trains==0.16.4rc0
I think we added it somewhere in 0.14, anyhow I just checked the Logger doc, it is there now π
@<1651395720067944448:profile|GiddyHedgehong81> just to be clear, Dataset.get_local_copy returns a path to your files,
You have to Manually add the additional path to the specific files you need to use. It does Not know that in advance.
That was the initial issue you had, and I assume it is the same one here. does that make sense ?
One more question, in the second log, trains agent is configured with Conda, on the first it is configured with pip, or at least this is what it looks like, can you confirm?
but I have no idea what's behingΒ
1
,Β
2
Β andΒ
3
Β compare to the first execution
This is why I would think multiple experiments, since it will store all the arguments (and I think these arguments are somehow being lost.
wdyt?
clearml_agent: ERROR: Can not run task without repository or literalscript in
script.diff
This is odd ...
OutrageousSheep60 when you launch clearml-session
it tells you the session ID (which is also a Task ID), can you look for it in the UI and check there is something in the repo/uncommitted-changes section ?
I think this is the temp requirements it creates not your requirements file. If you attach a log here with the "installed packages" section maybe we could help to debug it
And is "requirements-dev.txt" in your git root folder?
What is your clearml-agent version?
Are you inheriting from their docker file ?
In venv mode yes, in docker mode you can pass them by setting the -e flag on the docker_extra_flags
https://github.com/allegroai/trains-agent/blob/121dec2a62022ddcbb0478ded467a7260cb60195/docs/trains.conf#L98
OddAlligator72 FYI you can also import / export an entire Task (basically allowing you to create it from scratch/json, even without calling Task.create)Task.import_task(...) Task.export_task(...)
EnviousStarfish54 could you send the conda / pip environment?
Maybe that's the diff between machine A/B ?
Hi GiddyTurkey39
Glad to see that you are already diving into the controllers (the stable release will be out early next week)
A bit of background on how the pipeline controller are designed:
All steps in the pipeline are experiments already registered in the system (i.e. you can see them in the UI). Regardless on how you created those experiments they have to be there prior to the pipeline launch. The pipeline itself can be executed on any machine (it does very little, and...
task.project
is the project ID (not name)task.get_project_name()
will return the project name
And still a difference between A/B , one detecting the repo the other does not?
Hi @<1541954607595393024:profile|BattyCrocodile47>
is this on your self hosted machine ?
just got the pipeline to runΒ
Nice!
using the default queue okay?
Using the default queue is fine. The different queue is the "services" queue that by default the "trains-server" is running an agent the will pull jobs from there.
With "services" mode, an agent will pull jobs right after the other (not waiting for the previous job to finish), as opposed to regular queue (any other) that the trains-agent will pull a job only after the previous one completed .
I'm looking into the savefig issue, meanwhile you can disable the popup by adding at the top of your code the following:import matplotlib matplotlib.rcParams['backend'] = 'agg' import matplotlib.pyplot matplotlib.pyplot.switch_backend('agg')
tried it and restarted the agent, but not working properly
What do you mean not working? can you provide logs ?
Hi @<1523701304709353472:profile|OddShrimp85>
the venv setup is totally based on my requirements.txt instead of adding on to what the image has before. Why?
Are you using the agent in docker mode ? if this is the case it creates a venv inside the docker, inheriting from the preinstalled docker system packages,
I use Yaml config for data and model. each of them would be a nested yaml (could be more than 2 layers), so it won't be a flexible solution and I need to manually flatten the dictionary
Yes, you are correct, the recommended option would be to store it with task.connect_configuration
it's goal is to store these types of configuration files/objects.
You can also store the yaml file itself directly just pass Path object instead of dict/string
Yes EnviousStarfish54 the comparison is line by line and compared only to the left experiment (like any multi comparison, you have to set the baseline, which is always the left column here, do notice you can reorder the columns and the comparison will be updated)
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
(obviously if you have dependencies, they will be installed before, and then the correct torch will be installed over the previous version
JitteryCoyote63 in the UI what's the value of "config" ? Is it empty, it a string?
Also, could you check if removing the 'type=str' from the add_argument changes the behavior?
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?