<Computer Vision> Implementing RTMDet-Ins 5. Pretraining Backbone on ImageNet

Sunday, Mar 30, 2025 | 1 minute read | Updated at Sunday, Mar 30, 2025

Jun Yeop(Johnny) Na

Object Detection Model typically gets trained in two stages:

Trains only backbone on classification task so that the backbone can provide reasonable output first.
Trains the whole model on object detection task.

This is because the detection head can’t train if the backbone isn’t providing useful embedding of the given image. The backbone’s output will give a random output to the head, which will lead the head to train in the wrong direction.

The backbone training is usually done on ImageNet classification dataset.

Looking at ImageNet Dataset

Has only one annotation
Images have different width and height
also has Bbox

<annotation>
	<folder>n01440764</folder>
	<filename>n01440764_96</filename>
	<source>
		<database>ILSVRC_2012</database>
	</source>
	<size>
		<width>500</width>
		<height>375</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>n01440764</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>34</xmin>
			<ymin>128</ymin>
			<xmax>430</xmax>
			<ymax>305</ymax>
		</bndbox>
	</object>
</annotation>