I think you are correct, None values should be listed as empty values not the String None.
What's the clearml version you are using? And could you retest with the latest RC?
is the base Task a file or a notebook ?
Hi, is there a possibility to use one GPU card with 2 agents concurrently
RoundMosquito25 / EnviousPanda91
You need to change the WORKER_ID (no two workers can share the same ID)CLEARML_WORKER_ID="machine:gpu01" clearml-agent daemon ....
what format should I specify it
requirements.txt format e.g. ["package >= 1.2.3"]
Would this enforce that package on various components
This is a per component control, so you can have different packages / containers based on the componnent
Would it then no longer capture import statements?
This is replacing the auto detected packages, but obviously this fails to detect your internal repo package, which is the main issue here.
How is "internal package" installed, in o...
Hi @<1600661428556009472:profile|HighCoyote66>
However, we need to allocate resources to ourselves manually, using an
srun
command or
sbatch
Long story short, there is a full SLURM integration, basically you push a job into the ClearML queue and it produces a slurm job that uses the agent to setup the venv/container and run your Task, but this is only part of the enterprise version 😞
You can however do the following (notice this is ...
EnviousStarfish54 something is also off in the git detection, it has not remote address, it just says "origin"
Any chance you have no git server ?
Regrading the installed packages, any chance you can send a sample code for me to debug ?
Hi HandsomeCrow5 hmm interesting use case,
we have seen html reports as artifacts, then you can press "download" and it should open in another tab, what would you expect on "debug samples" ?
with tensorboard logging, it works fine when running from my machine, but not when running remotely in an agent.
This is odd, could you send the full Task log?
So the way it works when you run a component the return value with the entire function execution is cached, basically:
this did NOT add the artifact to the pipeline via caching on subsequent runs ❌
you just need to do:
PipelineDecorator.upload_artifact(name='images', artifact_object=img_dir, wait_on_upload=True)
return Task.current_task().artifacts['images'].url
This will return the URL of the uploaded images (i.e. S3 bucket)
which means if this is cached you will get it...
Hi FierceHamster54
Thanks for bringing it up 🙂
... in term of secret managements/key-value stores
Currently the open-source version does not include the Vault support (e.g. secret management), this is something they added to the enterprise version a few versions away, and as far as I understand this is a per user/project/company granularity feature (i.e. company wide merging with project merging with user specific).
Is this what you are looking for or am I missing something ?
I'm not sure I'm the right person to answer that, but yes my understanding is that this is a Scale/Enterprise tier feature, at least for the time being.
Hi @<1624941407783358464:profile|GrievingTiger47>
I think you should try to contact the sales guys here: None
Task.running_localy()
Should do the trick
MysteriousBee56 Edit in your ~/trains.conf:api_server:
http://localhost:8008
toapi_server:
http://192.168.1.11:8008
and obliviously the same for web & files
I'll make sure we fix the trains-agent to output an error message instead of trying to silently keep accessing the API server
Getting you machine ip:
just run :ifconfig | grep 'inet addr:'
Then you should see a bunch of lines, pick the one that does not start with 127 or 172
Then to verify runping <my_ip_here>
FranticCormorant35 DeterminedCrab71 please continue the discussion in this thread
Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC
This looks good to me...
I will have to look into it, because it should not download it...
Is there any documentation on versioning for Datasets?
You mean how to select the version name ?
HappyLion37 did you check the https://github.com/allegroai/trains/tree/master/examples/services/hyper-parameter-optimization ?
You can very quickly get it distributed as well
the Task scheduler itself is a Task. What we did is we added a new parameter section on the Task (the task.connect call), so that we can later clone and modify it and use the new value in runtime
(Task.connect will put the data from the Task/UI back into the dict when the agent is running the Scheduler)
Does that make sense?
Should work out of the box, maybe the only thing to notice is that you will get a Task for every local_rank 0 process
does that make sense ?
at means I need to pass a single zip file to
path
argument in
add_files
, right?
actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload
it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload
call...
is it consistent ? (the error), meaning it happens on other integer values ?
Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same
RipeGoose2 yes that will work 🙂
That said, we should probably fix the S3 credentials popup 😉
this is very odd, can you post the log?
Hi RipeGoose2
Can you try with the latest from git ?pip install -U git+