Authors: Alexey Perminov, Tatiana Khanova
When we start using DL models in the applications it becomes evident pretty quickly that pre- and post-processing of the DL model pose a potential challenge. Everything works fine during the research stage when every part of the pipeline runs on the same machine in the same environment. Problems arise the moment we start integrating our model into the solution.
For example, the resize operation (as well as many others) may be implemented slightly differently on different platforms and in different libraries, thus leading to different results. Post-processing more complex than argmax is bound to be implemented incorrectly.
We can solve this issue by merging everything into a single computational graph to run in the OpenVINO framework. This way, all resizing, normalization and any post-processing will be consistent and run in the same engine. We just need to feed an input image as is, into the graph to get the desired results and we can even get a speed-up as a bonus.
Selecting the model for conversion
For this article, we have selected the HRNet model (proposed in Deep High-Resolution Representation Learning for Visual Recognition). This quite recent architecture has proven itself as a state of the art model for the wide range of computer vision tasks like image classification, object detection, segmentation or human pose estimation. Its application for the human pose estimation task fits quite well for our intention with pre- and post-processing merging. We will use this implementation from the authors of the original paper. You also can find more architectural details in this blog.
In modern machine learning development, we usually use some DL frameworks (TF, PyTorch, etc.) for model design and training but the inference framework may often differ (OpenVINO, TensorRT). Such a deployment process requires one or several conversion stages between different deep learning model formats and the selected model architecture may impose some restrictions on that. It happens when the model contains some unusual operations (proposed in a quite recent paper) that the conversion framework doesn’t yet support. In that case, the developers either need to redesign the model or implement some custom operation support for the selected converter or even inference backend. Usually, inference frameworks provide such possibility (with pretty cool tutorials). Luckily, the HRNet architecture is effective and simple enough so we should not face such obstacles.
Setting up the environment
Before we can proceed, we need to prepare and set up the environment. The procedure is the following:
- Create a new Python virtual environment.
- Install the OpenVINO Toolkit (version 2020.4 is used in our case).
- Download the prepared repository.
- Install Python dependencies from provided requirements.txt.
- Download the COCO 2017 Validation set (images & annotations) and place it under the ‘data’ folder as described in Readme.md.
- Download the model checkpoint (pose_hrnet_w32_256x192.pth) and place it under the ‘models/pytorch’ path.
Here you can find a Google Colab notebook with the required environment installation steps as well as reproduction steps that you can find below in this article.
Now we are ready to start.
Preparing the model for inference
First of all, we need a simple script with the PyTorch inference for the model we have. All scripts we talk about are located in the ‘tools’ directory. Let’s take a look at the pytorch_cpu_inference.py script that is based on test.py.
We use PyTorch-based dataset loader and COCO dataset binding for image loading and input pre-transformations. COCO 2017 validation set contains 5000 images in total but we run inference only on the first 100 images with human pose key points for our experiments.
Then, we load the model from the checkpoint in evaluation mode and run it for inference in a loop for all input images provided by the data loader. For the output, the HRNet pose detection model returns heatmap arrays for each key point class. We have to post-process them to get the actual coordinates of detected keypoints.
Finally, we save debug image representations to the files so we can compare the results. In the final production deployment we also usually have to use additional post-processing like object keypoint similarity-based non-maximum suppression to get rid of excess key points. We will not cover this stage now since such operation is not strictly required as isn’t supported in the inference framework just yet.
Typically, before feeding an image to the model, most computer vision task pipelines assume similar data pre-processing steps like:
- image reading from the input device (camera, disk, network);
- color schema swapping (RGB to BGR and vice versa, or RGB to YUV);
- image resizing to the input size required by the model (optional);
- channel order swapping: NxCxHxW to NxHxWxC and vice versa, where N – number of images in a batch, C – number of color channels, H and W are image height and width correspondingly;
- image data normalization which consists of subtracting the dataset-wide color channel mean value and dividing by standard deviation values, again per channel.
The proposed HRNet model is not an exception. Dataset binding reads images from the disk in BGR format, then converts them to the RGB order. Then the data loader performs image normalization with the help of torchvision’s ‘transforms’ augmentations toolset. Let’s merge the normalization step into our model. This PyTorch pipeline with merged processing is defined in the pytorch_cpu_inference_merged_processing.py script.
To merge this pre-processing normalization we need to extend the model’s graph, i.e. we have to edit the model. Fortunately, we don’t need to re-train the model after that, since pre- and post-processing operations don’t have trainable parameters. We define the new model class PoseHRNetWithProcessing inherited from PoseHighResolutionNet and override the ‘forward’ method (please see lib/models/pose_hrnet.py for the reference).
Firstly, we define the model normalization parameters:
self.mean = torch.FloatTensor([0.485, 0.456, 0.406]) self.std = torch.FloatTensor([0.229, 0.224, 0.225])
Then, at the beginning of the input data processing we insert the corresponding normalize handling:
x = x - self.mean.view(1, -1, 1, 1) x = x / self.std.view(1, -1, 1, 1)
Accordingly, we should remove this action from the inference pipeline, i.e. from data loader transformations.
As we’ve mentioned before, the model output is just unprocessed heatmaps specifying the most probable pose keypoint locations and we have to post-process them to obtain the actual coordinates. Let’s turn this stage into a part of the model’s computational graph. This is how post-processing looks like in the original pipeline it is defined in lib/core/inference.py:
def get_max_preds(batch_heatmaps): ''' get predictions from score maps heatmaps: numpy.ndarray([batch_size, num_joints, height, width]) ''' assert isinstance(batch_heatmaps, np.ndarray), \ 'batch_heatmaps should be numpy.ndarray' assert batch_heatmaps.ndim == 4, 'batch_images should be 4-ndim' batch_size = batch_heatmaps.shape num_joints = batch_heatmaps.shape width = batch_heatmaps.shape heatmaps_reshaped = batch_heatmaps.reshape((batch_size, num_joints, -1)) idx = np.argmax(heatmaps_reshaped, 2) maxvals = np.amax(heatmaps_reshaped, 2) maxvals = maxvals.reshape((batch_size, num_joints, 1)) idx = idx.reshape((batch_size, num_joints, 1)) preds = np.tile(idx, (1, 1, 2)).astype(np.float32) preds[:, :, 0] = (preds[:, :, 0]) % width preds[:, :, 1] = np.floor((preds[:, :, 1]) / width) pred_mask = np.tile(np.greater(maxvals, 0.0), (1, 1, 2)) pred_mask = pred_mask.astype(np.float32) preds *= pred_mask return preds, maxvals
Again, as for the pre-processing, we extend the original model code with similar logic, rewriting it on PyTorch (we just append it to the end of the model’s ‘forward’ method):
width = x.shape heatmaps_reshaped = torch.flatten(x, start_dim=2, end_dim=-1) maxvals, preds_0 = torch.max(heatmaps_reshaped, dim=2, keepdim=True) preds_0 = preds_0.float() preds_1 = torch.floor(preds_0 / width) preds_0 = torch.remainder(preds_0, width) preds = torch.cat((preds_0, preds_1), dim=2) * torch.gt(maxvals, 0.0) # x is returned just for the debugging purpose return x, preds
First, we flatten our heatmaps and extract values and positions of the maximal elements on them. Then we transform positions to 2-dimensional values and leave only those that correspond to positive local maximum values.
Model conversion to ONNX and Intermediate Representation
Now we have two pipelines for PyTorch inference. It is time to convert the model and run it within the OpenVINO pipeline. The process is the same as before: we convert the model to ONNX format, then we use Model Optimizer to convert it to Intermediate Representation.
The previously prepared pytorch_cpu_inference.py and pytorch_cpu_inference_merged_processing.py scripts have the key ‘–convert_onnx’ that allows converting the corresponding PyTorch model to the ONNX format:
if args.convert_onnx: x_tensor = torch.rand(1, 3, 256, 192) torch.onnx.export( model.cpu(), x_tensor.cpu(), 'model.onnx', export_params=True, operator_export_type=torch.onnx.OperatorExportTypes.ONNX, opset_version=9, verbose=False) logger.info('Model is converted to ONNX')
Next step, convert it to IR:
mo.py --input_model model.onnx
When you are converting a custom model, you may face compatibility issues related to unsupported operations. First of all, it is worth checking package versions and corresponding supported layers and operations for the used version and upgrading to the latest available one. In such cases, we may try to use another version of PyTorch, another onnx operation set version in torch.onnx.export call, or a newer OpenVINO version if it is not the latest one.
Prepare and run OpenVINO inference pipeline
In this step, we prepare two OpenVINO Inference Engine pipelines based on the prepared PyTorch inference scripts. They are openvino_cpu_inference.py and openvino_cpu_inference_merged_processing.py. In these scripts, we replace PyTorch model loading and inference calls with Inference Engine initialization with corresponding IR model and synchronous inference requests.
Now let’s run all the scripts one by one:
python tools/pytorch_cpu_inference.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml --convert_onnx TEST.MODEL_FILE models/pytorch/pose_hrnet_w32_256x192.pth OUTPUT_DIR pytorch_output
Convert resulting ONNX model:
mo.py --input_model model.onnx
Run OpenVINO Inference Engine based pipeline:
python tools/openvino_cpu_inference.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml TEST.MODEL_FILE ./model.xml OUTPUT_DIR openvino_output
The same steps for the model with merged processing:
python tools/pytorch_cpu_inference_merged_processing.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml --convert_onnx TEST.MODEL_FILE models/pytorch/pose_hrnet_w32_256x192.pth OUTPUT_DIR pytorch_w_processing_output mo.py --input_model model_with_processing.onnx --output Conv_746,Mul_772 python tools/openvino_cpu_inference_merged_processing.py --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml -m ./model_with_processing.xml OUTPUT_DIR openvino_w_processing_output
You can find average inference time printed at the end of output for each script. Resulting debug images with keypoints and heatmaps are located in the folders specified by OUTPUT_DIR parameter.
You can find results from our test setup (CPU: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz) on the below chart:
Aside from the considerable speed-up, we can see that the version with merged pre- and post-processing is only slightly slower while it comes with all benefits from the single computation graph.
We have successfully converted the proposed HRNet human pose estimation network from the original PyTorch to the OpenVINO Inference Engine pipeline. Moreover, we took a look at how we can further optimize the inference pipeline with the merging of data pre- and post-processing into the inference graph. As you can see by the comparison chart above with the OpenVINO toolkit we have achieved great inference speed-up and were able to merge all required processing steps into a single computational graph without any significant performance degradation. For me, the final result and calculations look impressive and are worth trying to be reproduced with other networks during deployment to OpenVINO Inference Engine.
Contribute – If you have any ideas in ways we can improve the product, we welcome contributions to the open-sourced OpenVINO™ toolkit.
Want to learn more? Join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in Intel’s community forum.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Testing date: December 23, 2020
Complete system configuration details: Ubuntu 18.04, Intel® Core™ i5-8265U CPU @ 1.60GHz x 8
Setup details: OpenVINO™ toolkit version 2020.4
Who did testing: Alexey Perminov, OpenCV.AI
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.