Reputation
Badges 1
25 × Eureka!DilapidatedDucks58 trains-agent adds the artifactory URL as --extra-index-url , are you sure you are getting the correct torch version in the container? because the torch html is not an artifactory html, it is a list of links, I just want to make sure you are getting the correct version, because otherwise it can default to the CPU version, which we don't want 🙂 anyhow you can use the direct link in the "installed packages and just put there " https://download.pytorch.org/whl/nightly/cu101...
Hi PompousBeetle71 I'm with SteadyFox10 on this one. Unless you choose a file name based on epoch or step , you are literally overwriting the model file, which Trains will reflect. If you use epoch in the filename you will end up with all your models logged by Trains. BTW we are actively working on integration with pytorch ignite, so if you have any suggestions now is the time :)
Hi EnviousStarfish54
Artifacts are stored per experiment, that means that storage wise every experiment uploading an artifact (even if it is the same file content as previous execution) will create a new file on the central storage (default being the trains-server)
As for the preferred way to share data / artifacts. Where do you have your trains server ? Is it local ? Cloud? Where do you access it from home? VPN?
AstonishingSeaturtle47 I think there's a workaround for the GitHub multiple repo issue. See https://gist.github.com/gubatron/d96594d982c5043be6d4
Are you hosting your own server? Is it on http://app.clear.ml ?
PompousParrot44 please try to reply on the thread, so we do not create a mess in the main channel 🙂
What's the "working directory" in the execution section? Do you have package "test" in the installed packages?
PompousBeetle71 just making sure, and changing the name solved it?
Hi JitteryCoyote63
What do you have in the agent.cuda_version
?
(you can see it printed at the beginning of the log)
Is
mark_completed
used to complete a task from a different process and
close
from the same process - is that the idea?
Yes
However, when I tried them out,
mark_completed
terminated the process that called
mark_completed
.
Yes if you are changing the state of the Task externally or internally the SDK will kill the process. If you are calling task.close()
from the process that created the Task it will gra...
Hmm, I still wonder what is the "correct" answer for most people, is empty string in argparse redundant anyhow? will someone ever use it?
CooperativeFox72 we are aware of Pool throwing exception that causes things to hang. Fix will be deployed in 0.16 (due to be released tomorrow).
Do you have a code to reproduce it, so I can verify the fix solves the issue?
SlipperyDove40 Yes there isTRAINS_CONFIG_FILE
https://allegro.ai/docs/faq/faq/#trains-configuration
@<1523701079223570432:profile|ReassuredOwl55> did you try adding manually ?
./path/to/package
You can also do that from code:
Task.add_requirements("./path/to/package")
# notice you need to call Task.add_requirements before Task.init
task = Task.init(...)
So now for it to take place you need to enqueue the Task and set an agent to pick it up and run it.
When the agent is running the Task the new parameter will be passed.
does that make sense ?
I mean , the python package, not the trains-server version
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
I simplified the code, just so I could test it, this one seems to work, feel free to add the missing argparser parts :)
` from argparse import ArgumentParser
from trains import Task
model_snapshots_path = 'mnt/trains'
task = Task.init(project_name='examples', task_name='test argparser', output_uri=model_snapshots_path)
logger = task.get_logger()
def main(args):
print('Got args: %s' % args)
if name == 'main':
parent_parser = ArgumentParser(add_help=False)
parent_parser....
Hi @<1539055479878062080:profile|FranticLobster21>
hey, how do I use local files as dependencies?
You mean like a repository ?
Can I specify in task what local files do I use that should be packaged?
In a git repo?
Basically the agent can do two things, either replicate a single script or clone a git repo + uncommitted changes
AstonishingSeaturtle47 How would the code run without the sub-modules? And what is the problem we are trying to solve? (Because unfortunately there is no switch to disable it)
Hi @<1540142641931358208:profile|FancyBaldeagle86>
You mean in the UI? i.e. clone an experiment hover over the Configuration / Hyperparameter section and clicking edit ?
PompousParrot44 That should be very easy to do, basically a service mode code that clones a base task and puts it into a queue:
This should more or less do what you need :)
` from trains import Task
task = Task.init('devops', 'daily train', task_type='controller')
stop the local execution of this code, and put it into the service queue, so we have a remote machine running it.
task = execute_remotely('services')
while True:
a_task = Task.clone(base_task_id='aaabb111')
Task.enqueu...
I think that just backing up /opt/clearml and moving it should be just fine 🤞
Hi @<1657918724084076544:profile|EnergeticCow77>
Can I launch training with HugginFaces accelerate package using multi-gpu
Yes,
It detects torch distributed but I guess I need to setup main task?
It should 🤞
Under the execution Tab script path, you should see something like -m torch.distributed.launch ...
Ohh okay something seems to half work in terms of configuration, the agent has enough configuration to register itself, but fails to pass it to the task.
Can you test with the latest agent RC:0.17.2rc4