Reputation
Badges 1
25 × Eureka!@<1570220844972511232:profile|ObnoxiousBluewhale25> it creates a new Model here
None
If you want it to log to something other than the default file server create the clearml Task before starting the training:
task = Task.init(..., outout_uri="file:///home/karol/data/")
# now training
It will use the existing Task and upload to the destination folder
OddAlligator72 okay, that is possible, how would you specify the main python script entry point? (wouldn't that make more sense rather than a function call?)
How do you determine which packages to require now?
Analysis of the actual repository (i.e. it will actually look for imports π ) this way you get the exact versions you hve, but nit the clutter of the entire virtual environment
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
Hi @<1526371965655322624:profile|NuttyCamel41>
. I do that because I do not know how to get the pickle file into the docker container
What would the pickle file do?
and load the MinMaxScaler within the script, as the sklearn dependency is missing
what do you mean by that? are you getting an error when loading your model ?
I see the problem now: conda is failing to install the package from the git, then it reverts to pip install, and pip just fails... " //github.com/ajliu/pytorch_baselines "
Ohh yes, if the execution script is not on git and git exists, it will not add it (it will add it if it is in a tracked file via the uncommitted changes section)
ZanyPig66 in order to expand the support to your case. Can you explain exactly which files are on git and which are not?
Hi TightElk12
would like to understand the limitations ofΒ
Task.current_task()
Basically this will always get you an instance of the current Task. This will work from sub-processes as well as the main process. Is there a specific scenario you have in mind, or a challenge with the use case ?
Notice that the StorageManager has default configuration here:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L76
Then a per bucket credentials list, with detials:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L81
UnevenDolphin73 if you have the time to help fix / make it work it will be greatly appreciated π
the services queue (where the scaler runs) will be automatically exposed to new EC2 instance?
Yes, using this extra_clearml_conf
parameter you can add configuration that will be passed to the clearml.conf
of the instances it will spin.
Now an example to the values you want to add :agent.extra_docker_arguments: ["-e", "ENV=value"]
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L149
wdyt?
it looks like nvidia is going to come up with an UI for TAO too
Interesting, any reference we could look at ?
In our case, we have a custom YAML instruction
!include
, i.e.
Hmm interesting, in theory this might work since configuration encoding (when passing dicts), is handled with HOCON which does support referencing.
That said currently it is not aware of "remote configurations" only ENV variables and local files.
It will be cool to add, do we have a github issue on that? (would you like to see if you can PR such a thing?)
RoughTiger69 I think this could work, a pseudo example:
` @PipelineDecorator.component(...)
def the_last_step_before_external_stuff():
print("doing some stuff")
@PipelineDecorator.pipeline()
def logic():
the_last_step_before_external_stuff()
if not check_if_data_was_ingested_to_the_system:
print("aborting ourselves")
Task.current_task().abort()
# we will not get here, the agent will make sure we are stopped
sleep(60)
# better safe than sorry
exit(0) `wdyt? (the...
Apparently the error comes when I try to access from
get_model_and_features
the pipeline component
load_model
. If it is not set as pipeline component and only as helper function (provided it is declared before the components that calls it (I already understood that and fixed, different from the code I sent above).
ShallowGoldfish8 so now I'm a bit confused, are you saying that now it works as expected ?
Notice the order here:Task.add_requirements("tensorflow") task = Task.init(...)
StaleMole4 you are printing the values before Task.init had the chance to populate it.
Basically try moving the print after closing the Task (closing the tasks waits for the async update)
Make sense ?
Do we support GPUs in a) docker mode b) k8s glue?
yes on both
Is there a good reference to get started with k8s glue?
A few folks here already set it up, do you have a k8s cluster with GPU support ?
owning the agent helps, but still it's much better if the credentials don't show up in logs,
They are not, they are always filtered out,
- how does
force_git_ssh_protocol
help please? it doesn't solve the issue of the agent simply not having accessIt automatically maps the host .ssh into the container, so that git can use SSH to clone.
What exactly is not working?
and how are you configuring it?
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
MysteriousBee56 and please this one: "when you run theΒ trains-agent
Β with --foreground , before it starts the docker it print the full command line"
π DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?
Is there an option to do this from a pipeline, from within theΒ
add_step
Β method? Can you link a reference to cloning and editing a task programmatically?
Hmm, I think there is an open GitHub issue requesting a similar ability , let me check on the progress ...
nope, it works well for the pipeline when not I don't choose to continue_pipeline
Could you send the full log please?
actually no it is not, alpine is Not a good baseline, is is very very slim missing a ton of stuff.
I would use bullseye or slim (depending how many aux things you need on the container)
https://hub.docker.com//python/tags?page=1&name=bullseye
https://hub.docker.com//python/tags?page=1&name=slim-bullseye
Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)
Oh I see, this seems like Triton configuration issue, usually dim -1 means flexible. I can also mention that serving 1.1 should be released later this week with better multiple input support for triton. Does that make sense?
GrievingTurkey78 can you send the entire log?
can the ClearML File server be configured to any kind of storage ? Example hdfs or even a database etc..
DeliciousBluewhale87 long story short, no π the file server, will just store/retrieve/delete files from a local/mounted folder
Is there any ways , we can scale this file server when our data volume explodes. Maybe it wouldnt be an issue in the K8s environment anyways. Or can it also be configured such that all data is stored in the hdfs (which helps with scalablity).I would su...
I am creating this user
Please explain, I think this is the culprit ...
GrievingTurkey78 I see,
Basically the arguments after the -m src.train
in the remote execution should be ignored (they are not needed).
Change the m in the Args section under the configuration. Let me know if it solved it.
Actually it is better to leave it as is, it will just automatically mount the .ssh folder into the container, i will make sure the docs point to this option first