Reputation
Badges 1
25 × Eureka!Why would that require refactoring ? Dataset class should take care if it internally ,no?
The reason my_name is a subproject , is that so every version could be a "Task" inside that project , just easier to manage (or at least that was the idea)
The reasoning is that most likely simultaneous processes will fail on GPU due to memory limit
I still have name
my_name
, but the project name
my_project/.datasets/my_name
rather than
my_project/.datasets
Yes, this is the expected behavior
And I don't see any new projects / subprojects where that dataset creation Task is stored
They are marked "hidden" hence by default you cannot see them in the UI (so they will only appear in the Dataset page),
you can turn the UI hidden flag by going to your settings page and selecting "Con...
Hi QuaintJellyfish58
This is odd, this "undefined" project is also marked as "Example" which would explain why you cannot delete it, but not how you ended up with one
Any idea on what changed on your server ?
QuaintJellyfish58 this is very odd, and the "undefined" is always marked as example?
it is shown in the recording above
It was so odd, I had to ask π okay let me see if we can reproduce
I donβt have any error message in the browser console - Just an empty array returned on events.get_task_logs. This bug didnβt exist on version 1.1.0 and is quite annoyingβ¦
meaning the RestAPI returns nothing, is that correct ?
I was not able to reproduce with the example code π
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
Weird ?!, I see this in the code:
https://github.com/allegroai/clearml/blob/382d361bfff04cb663d6d695edd7d834abb92787/clearml/automation/controller.py#L2871
Hi ScaryBluewhale66
TaskScheduler I created. The status is still
running
. Any idea?
The TaskScheduler needs to actually run in order to trigger the jobs (think cron daemon)
Usually it will be executed on the clearml-agent services queue/mahine.
Make sense ?
DilapidatedDucks58 long story short:
if you do:
` from clearml import StorageManager
from clearml.storage.helper import StorageHelper
StorageHelper.get(" ", retries=5) `It should make sure that all the other s3:// links of this bucket will use the same original configuration (i.e. retries)
If this workaround works let's make sure we add it into the conf file, wdyt ?
. Yes I do have a GOOGLE_APPLICATION_CREDENTIALS environment variable set, but nowhere do we save anything to GCS. The only usage is in the code which reads from BigQuery
Are you certain you have no artifacts on GS?
Are you saying that if GOOGLE_APPLICATION_CREDENTIALS
and clearml.conf contains no "project" section it crashed when starting ?
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
just to check. Does the k8s glue install torch by default?
SubstantialElk6 what do you mean the glue installs torch ?
The glue will take a Task from the queue create a k8s job (basically use the same docker and inside the docker run get the agent to execute the requested Task). Where would the "torch" come into play?
SubstantialElk6 "Execution Tab" scroll down you should have "Installed Packages" section, what do you have there?
JitteryCoyote63 oh dear, let me see if we can reproduce (version 1.4 is already in internal testing, I want to verify this was fixed)
Hmm let me check something
- At its simplest, this could just mean checking that all of the steps and the pipeline itself have completed successfully (by checking their βTask statusβ).If a pipeline step ends with "failed" status in the pipeline execution function an exception will be raised, if the exception is not caught, the pipeline itself will also fail
run
pipeline_script.py
which contains the pipeline code as decorators.
So in theory the following should actually work.
Let's assume you ...
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?...
Should have worked, the error you are getting is docker-compose parsing the yml file
Is this exactly the one from the trains-server repo ?
Hi JitteryCoyote63 you can bus obviously you should be careful they might both try to allocate more GPU memory than they the HW actually has.TRAINS_WORKER_NAME=machine_gpu0A trains-agent daemon --gpus 0 --queue default --detached TRAINS_WORKER_NAME=machine_gpu0B trains-agent daemon --gpus 0 --queue default --detached
PompousHawk82 unfortunately this is kind of binary, either you have full tracking of load/save operations or you do not.
This warning message will disappear in the next version as we will be able to log multiple models under the same Task :)
ReassuredTiger98
Can you explain what you meant byΒ
entropy point file?
There is no need to specify entry point file.
It is automatically detected when you run the Code manually on your machine.
My assumption was that the file "src/run_task.py" (based on your log) is just a test file, and hence was not added top the repository. So the agent failed to actually restore it from the git (files that are not added are not considered part of the git diff, this is usually git behavio...
No -- that section is blank,
This is the main issue, it should be filled with the requirement being auto detected.
The entire script was executed from within vscode, and the Task was created but it was not prefilled with anything ?
Just making sure, you called Task.init
inside your code ?
ReassuredTiger98
Okay, but you should have had the prints ...uploading artifact
anddone uploading artifact
So I suspect something is going on with the agent.
Did you manage to run any experiment on this agent ?
EDIT: Can you try with artifacts example we have on the repo:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
Okay there should not be any difference ... π
Okay that look s good, now in the UI start here and then get to the artifacts Tab,
Is it there ?
Can clearml-agent currently detect this?
Hmm you mean will agent clean it self up?
In theory it should not, in practice you could run out of space while running the experiment itself...
You can always cleanup everything from time to time (maybe worth a flag?)