OpenMP handling asynchronous platform

290
March 16, 2017, at 01:52 AM

I am working on optimizing a library using OpenMP. I benchmark the library on two different platforms:

  • My working x86 computer that has 4 Intel Core i5-6500 CPU @3.20GHz
  • A rooted Honor 5c phone that has 4 ARM Cortex-A53 @2.0GHz and 4 ARM Cortex-A53 @1.7GHz

In order to execute code on the phone, I just cross compile everything on my workstation and control the benchmarks with scripts that use adb. However, I had some issues getting everything optimized as I wanted i.e. close to an 8 theoretical speedup on the phone. The explanation would be the CPU usage when performing a simple Matrix multiplication operation. I have this basic code that helps me measuring the usage :

#include <cstdio>
#include <cstdlib>
// For custom types
#include "smu/core.h"
int main(void) {
  long double cpua[4], cpub[4], loadavg;
  FILE *fp;
  char dump[50];
  // Setting matrices
  int32 nr = 500;
  int32 nc = 500;
  float32 *a = (float32*)malloc(nr * nc * sizeof(float32));
  float32 *b = (float32*)malloc(nr * nc * sizeof(float32));
  float32 *c = (float32*)malloc(nr * nc * sizeof(float32));
  for (int32 i = 0; i < nr; ++i) {
    float32 *adata = a + i * nc;
    float32 *bdata = b + i * nc;
    int32 cache_nc = nc;
    for (int32 j = 0; j < cache_nc; ++j) {
      adata[j] = (float32)rand() / (float32)RAND_MAX * 100.;
      bdata[j] = (float32)rand() / (float32)RAND_MAX * 100. - 50.;
    }
  }
  for(;;) {
    fp = fopen("/proc/stat", "r");
    fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpua[0], &cpua[1], &cpua[2], &cpua[3]);
    fclose(fp);
    for (int32 i = 0; i < nr ; ++i) {
      int32 cache_nc = nc;
      float32 *adata = a + i * cache_nc;
      float32 *cdata = c + i * cache_nc;
      for (int32 j = 0; j < cache_nc; ++j) {
        cdata[j] = 0.;
        for (int32 k = 0; k < cache_nc; ++k)
          cdata[j] += adata[k] * b[k * cache_nc + j];
      }
    }
    fp = fopen("/proc/stat", "r");
    fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpub[0], &cpub[1], &cpub[2], &cpub[3]);
    fclose(fp);
    loadavg = ((cpub[0] + cpub[1] + cpub[2]) - (cpua[0] + cpua[1] + cpua[2])) /
        ((cpub[0] + cpub[1] + cpub[2] + cpub[3]) - (cpua[0] + cpua[1] + cpua[2] + cpua[3]));
    printf("CPU usage             : %Lf\n", loadavg);
    fp = fopen("/proc/stat", "r");
    fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpua[0], &cpua[1], &cpua[2], &cpua[3]);
    fclose(fp);
  #pragma omp parallel for num_threads(8) schedule(dynamic, 1)
    for (int32 i = 0; i < nr ; ++i) {
      int32 cache_nc = nc;
      float32 *adata = a + i * cache_nc;
      float32 *cdata = c + i * cache_nc;
      for (int32 j = 0; j < cache_nc; ++j) {
        cdata[j] = 0.;
        for (int32 k = 0; k < cache_nc; ++k)
          cdata[j] += adata[k] * b[k * cache_nc + j];
      }
    }
    fp = fopen("/proc/stat", "r");
    fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpub[0], &cpub[1], &cpub[2], &cpub[3]);
    fclose(fp);
    loadavg = ((cpub[0] + cpub[1] + cpub[2]) - (cpua[0] + cpua[1] + cpua[2])) /
        ((cpub[0] + cpub[1] + cpub[2] + cpub[3]) - (cpua[0] + cpua[1] + cpua[2] + cpua[3]));
    printf("CPU usage with OpenMP : %Lf\n", loadavg);
  }
  free(a);
  free(b);
  free(c);
  return(0);
}

On my x86 workstation, the results are as expected:

CPU usage             : 0.267606
CPU usage with OpenMP : 1.000000
CPU usage             : 0.271429
CPU usage with OpenMP : 1.000000

While on the phone it seems, it cannot get all the cores at once:

CPU usage             : 0.143388
CPU usage with OpenMP : 0.495968
CPU usage             : 0.129955
CPU usage with OpenMP : 0.496626

That is strange as the No OpenMP usage let me think as only 1 on 8 cores is used. I checked the OpenMP platform info and he can see correctly 8 cores on the Honor 5c.

My questions are:

  • How OpenMP handles that kind of asynchronous platforms ?
  • Is there any solution in order to get close to 100% core usage anytime, anywhere ?
  • How OpenMP handles virtualized cores (not directly related to this topic but it still interests me to have an answer to that) ?

EDIT:

I've tried to see directly in the OS how he handle the cores by executing this simple script:

#!/system/bin/sh
i=0
while : ; do
  i=$(($i + 1))
done

And even having 8 threads running it would result in maximum 50% of CPU usage.

I read this article explaining that there could be several OS in a phone making only one of them usable. In my case it would be 1 per group of 4 cores. But then I don't understand why OpenMP would see 8 cores...

Rent Charter Buses Company
READ ALSO
Android - Facebook Invite, how i can make this App Invites for Android

Android - Facebook Invite, how i can make this App Invites for Android

I have problem to get some code for App Invites for AndroidI've searched a lot here and the Web, but I can not find a solution to the problem

439
Set payer information when doing a basic checkout in MercadoPago

Set payer information when doing a basic checkout in MercadoPago

I'm using the mercadopago's API to process payments, when doing a basic checkout using the startCheckoutActivity method, I get a warning that says an email is mandatory, code example:

352
MockWebServer with Retrofit and Dagger

MockWebServer with Retrofit and Dagger

I tried to implement instrumentation test for my android projectIn my main project, I used dagger and retrofit and I provide the retrofit from my Module

694