I Use

Answered

I Use

I use ignite and their ModelCheckpoint system to be able to resume a training in case of unexpected issue (the electricity shut down last week and our cluster shut down).
I thought at first that the hook on torch.save uploads the file to the trains ecosystem, but I was wrong after inspection. The only method I could find was to manually upload the file with task.upload_artifact(f"name_{i}", f"my_file_{i}.pth") is there an automated way to do that ?
This behavior requires to have a custom DiskSaver to be able to do the task.upload_artifact on callback their callback system, which is mostly a copy paste to only add the task.upload_artifact call or to use inheritance to create an almost empty class (which is my solution for now).

  				
Posted 
	4 years ago

					More  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Votes Newest

Answers 6

Hi SteadyFox10
I promised to mention here once we start working on ignite integration, you can check it here:
https://github.com/jkhenning/ignite/tree/trains-integration
Feel free to provide insights / requests 🙂

As for the model upload. The default behavior is
torch.save() calls will only be logged , nothing more. But, if you pass to the Task.init output_uri field, then all your models will be uploaded automatically. For example:
task = Task.init('examples', 'model upload test', output_uri=' s3://bucket/stuff/ ')
Will cause any model stored locally to be uploaded (in the background) to sub-folders (project/experiment) on the bucket/stuff on your S3 account.

The really cool thing is that even if your code does not include the output_uri argument, and you are running your experiment with trains-agent , then in the Web UI under "Execution" tab look for "Output : Destination" field. Anything you write there will be as if you added it to the Task.init output_uri. So all you have to do is just write there, for example " http://trains-server-ip:8081/ " and it will upload all the models to the trains-server (obviously you can also write, s3:// or gs:// or azure://). Then in the artifacts models, you'll see the link to the model itself. Also notice that you can get back the models when you continue the training with the previous model weight, all from code. Here is an example: https://github.com/allegroai/trains/issues/50#issuecomment-607541112

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi SteadyFox10

I'll use your version instead and put any comment if I find something.

Feel free to join the discussion 🙂 https://github.com/pytorch/ignite/issues/892

Thansk for the

ouput_uri

can I put in the

~/trains.conf

file ?

Sure you can 🙂
https://github.com/allegroai/trains/blob/master/docs/trains.conf#L152
You can add it in the trains-agent machine's conf file, or/and on your development machine. Notice that once you run an experiment with "default_output_uri" set (or with output_uri in task.init), the Web UI will reflect the used value in "Output : Destination" so you have better visibility

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Wow, that really nice. I did almost the same code, TrainsLogger , TrainsSaver , and all the OutputHandler . I'll use your version instead and put any comment if I find something.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Thansk for the ouput_uri can I put in the ~/trains.conf file ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Pretty good job. I'll try to use it as soon as possible.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Hi SteadyFox10 , the TrainsLogger PR was just merged (see https://github.com/pytorch/ignite/pull/1020 )

  				
Posted 
	4 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

6 Answers

4 years ago

2 years ago