Reputation
Badges 1
25 × Eureka!I see, something like:from mystandalone import my_func_that_also_calls_task_init def task_factory(): task = Task.create(project="my_project", name="my_experiment", script="main_script.py", add_task_init_call=False) return task
if the pipeline and the my_func_that_also_calls_task_init are in the same repo, this should actually work.
You can quickly test this pipeline with
` pipe = Pipelinecontroller()
pipe.add_step(preprocess, ...)
pipe.add_step(base_task_facto...
Right, you need to pass "repo" and direct it to the repository path
(BTW, what's the cleaml version)
Hi CleanPigeon16
can I make the steps in the pipeline use the latest commit in the branch?
Yes:
manually clone the stesp's Task (in the UI), and in the UI edit the Execution section and change to "last sommit on branch" and specify the branch name programmatically (as the above, clone+edit)
ValueError: Could not parse reference '${run_experiment.models.output.-1.url}', step run_experiment could not be found
Seems like the "run_experiment" step is not defined. Could that be ...
Hi CrookedWalrus33
docker_setup_bash_script= ["export PATH=""/workspace/miniconda/bin:$PATH"])
Oh I think you are correct, this should do the trick:docker_setup_bash_script= ["export PATH=/workspace/miniconda/bin:$PATH", "export LOCAL_PYTHON=/workspace/miniconda/bin/python3"]This will make sure both agent and script execute on the same python
but to run a script inside a docker which already has the environment built in.
If this is already activated, the latest agent w...
Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )
SuperiorDucks36 from code ? or UI?
(You can always clone an experiment and change the entire thing, the question is how will you get the data to fill in the experiment, i.e. repo / arguments / configuration etc)
There is a discussion here, I would love to hear another angle.
https://github.com/allegroai/trains/issues/230
Just to clarify, where do I run the second command?
Anywhere just open a python console and import the offline task:from trains import TaskTask.import_offline_session('./my_task_aaa.zip')
Related, how to I specify in my code the cache_dir where the zip is saved?
This is the Trains cache folder, you can set it in the trains.conf file:
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/docs/trains.conf#L24
I was hoping that there's a universal flag somewhere. Asking this because I want all the Models and Artifacts to be stored in one place and the users shouldn't have to edit their configuration files.
You mean like make sure all models/artifacts are always uploaded?
Hi SmarmyDolphin68
You have two options:
Automatically upload the models when training pass output_uri to Task.init. For example output_uri=True will upload to the clearml-server, output_uri=' s3://bucket/folder ' will upload to S3 etc. Manually upload a model that you have locally: https://github.com/allegroai/clearml/blob/9ff52a8699266fec1cca486b239efa5ff1f681bc/examples/reporting/model_config.py#L37
i'm sorry, I mean if the queue name is not provided to the agent , the agent will look for the queue with the "default" tag. If you are specifying the queue name, there is no need to add the tag.
Is it working now?
User/pass should be enough,
Could it be the specific commit ID is not pushed?
I'll try to find the link...
FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?
I commented the upload_artifact at the end of the code and it finishes correctly now
upload_artifact caused the "failed" issue ?
Hey WickedGoat98
I found the bug, it is due to the fact the numpy (passed to plotly) contains both datetime and nan, and plotly.js does not like it. I'll make sure this is fixed, in the meantime you can just remove the first row (it contains the nan):df = pd.concat([tickerDf.Close, tickerDf_Change.Close_pcent], axis=1) df = df[1:]
CurvedHedgehog15 there is not need for :task.connect_configuration( configuration=normalize_and_flat_config(hparams), name="Hyperparameters", )Hydra is automatically logged for you, no?!
Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC
Because we are working with very big files, having them stored at multiple locations is something we try to avoid
Just so I better understand, is this for storing files as part of a dataset, or as debug samples ?
In other words can two diff processes create the exact same file (image) ?
Hi @<1671689437261598720:profile|FranticWhale40>
You mean the download just fails on the remote serving node becuause it takes too long to download the model?
(basically not a serving issue per-se but a download issue)
I just called exit(0) in a notebooke and it closed it (the kernel) no exception
Try to manually edit the "Installed Packages" (right click the Task, select "reset", now you can edit the section)
and change it to :-e git+ssh@github.com:user/private_package.git@57f382f51d124299788544b3e7afa11c4cba2d1f#egg=private_package(assuming " pip install -e mailto:git+ssh@github.com :user/... " will work, should solve the issue )
No worries 🙂 glad it worked
Now I need to figure out how to export that task id
You can always look it up 🙂
How come you do not have it?
Hi @<1546303269423288320:profile|MinuteStork43>
Failed uploading: cannot schedule new futures after interpreter shutdown
Failed uploading: cannot schedule new futures after interpreter shutdown
This is odd where / when exactly are you trying to upload it?
LOL I keep typing clara without noticing (maybe it's the nvidia thing I keep thinking about)
Carla makes much more sense 😄
okay that makes sense, if this is the case I would just use clearml-agent execute --id <task_id here> to continue the training Task.
Do notice you have to reload your last chekcpoint from the Task's models/artifacts to continue 🙂
Last question, what is the HPO optimization algorithm, is it just grid/random search or optuna hbop/optuna, if this is the later, how do make it "continue" ?
Hmm yes we should probably provide metrics:client.workers.get_stats(..., items=[dict(key='cpu_usage'), dict(key='gpu_usage')])