SPO600 Stage 3

The Culmination of the Course:

It has been a while since my last blog post due to exam week being the same week that all the assignments are due, oppose to have a separate exam week appended onto the end of the semester so myself and many others were even busier than usual.

I have been able to run some tests, analyze some more code and do some more research but have unfortunately not had the time to blog it.

So as a quick refresher from the previous Blog post for Stage 2, I was able to get gprof as well as the display working, however was getting odd results, as displayed here:

spo600stage2

As well as the gprof display benchmark:

gprofspo600

I discussed how the run time, culmination seconds and self sections was 0.00 despite the program having a runtime of roughly 0m20.000s. I also discussed as to why it showed every function at 100%. I looked into why this may be the case and concluded it might be the case for a couple different reasons, one explanation I found online was from a gentleman who stated that: “This is due to the fact  that Linux uses a libc compatibility routine to fake the profile() system call for gprof. This needs the SIGPROF-based ITIMER_PROF interval timer but should generally work.” The other explanation is that the program runs so fast that the 20 odd seconds it takes to finish running the program is due to the disk read speed, an idea that was suggest by my professor and I believe it so be the case with bzip2.

I mentioned the 5 functions that run the most amount of times in the program: BZ2_ indexIntoF(120245), BZ2_bzDecompress(25), BZ2_bzRead(25), myfeof(25) and unRLE_obuf_to_output_SMALL(25). I pasted the smaller functions in my code, and the larger functions I simply described their make up. In my last post I also discussed the one optimization I could find in one of the 5 functions called multiple times. I omitted an extra nested while loop from unRLE_obuf_to_output_SMALL (which runs 5 times), however not only did I get the same benchmarking results from gprof, but upon further analysis I realized that the nested while needs to be in place to allow for data integrity during compression. The section of the function I am referring to is this:

while (True) { 
/* try to finish existing run */ 
 while (True) { 

   if (s->strm->avail_out==0) return False; 
   if (s->state_out_len==0) break; 
   *( (UChar*)(s->strm->next_out) ) = s->state_out_ch; 
   BZ_UPDATE_CRC ( s->calculatedBlockCRC, s->state_out_ch ); 
   s->state_out_len--; s->strm->next_out++; s->strm->avail_out--; 
   s->strm->total_out_lo32++; if (s->strm->total_out_lo32==0) s->strm->total_out_hi32++; 
}

I have come to the conclusion that this program cannot be further optimized with changes to a few lines here and there, you need to make substantial changes to the core compression algorithm. I thought I may have been able to optimize bzip2 as it states on the wikipedia site “bzip2 compresses most files more effectively than the older LZW (.Z) and Deflate (.zip and .gz) compression algorithms, but is considerably slower. LZMA is generally more space-efficient than bzip2 at the expense of even slower compression speed, while having much faster decompression.” I was under the assumption this did mean that I would be able to further optimize the program, however upon further investigation, I would have to change their entire compression algorithm.

I discovered an indepth article that supports my point.  https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/ “Gzip vs Bzip2 vs XZ Performance Comparison” is the article and it does just that, it benchmarks, compares the compression rates, and runtimes of the three compression programs. The compression speed of the three programs is as shown below:

Capture.PNG

It seems that xz is the better choice, however that is the speed of compression at the higher levels, not the time it takes to Compress which is displayed here:Capture1.PNG

While at the 3+ levels of compression xz might be faster, it still takes more time to actually compress, and it is the opposite for gzip, the compression is slower, however it takes less time to run than bzip2. If we now delve into decompression we will see something similar:

Capture2

The decompression time for bzip2 seems to be the worst of the three, however the decompression for bzip seems to perform the best of the 3:

Capture3

The conclusions of the article from its benchmarking and testing are that bzip2 is a good middle ground for compression, bzip is only a little faster while xz may not be worth it at its higher default compression ratio of 6 as it takes ,uch longer to complete for a little extra gain. There are faster compression algorithms than bzip that is for sure, however to effectively optimize the program to make a notable difference I would have to edit the well documented Burrows-Wheeler algorithm. While on the other hand a company like gzip which does notable perform faster utilizes DEFLATE, a completely different compression algorithm that utilizes the LZ77 algorithm and Huffman coding.

Further research reveals that bzip2 is implemented by many programs such as: 7-zip, micro-bzip2, PBZIP2, bzip2smp, smpbzip2, pyflate, bz2, arnaud Bouchez BZip, Ibzip2, mpibzip2, Apache Commons Compress, jbzip2, DotNetZip, DotNetCompression(this is a streaming implementation of bzip2 in managed C# that conforms to the API of System.IO.Compression and includes assemblies for.NET Framework, .NET Compact Framework, Xamarin.iOS, Xamarin.Android, Xamarin.Mac, Windows Phone, Xbox 360, Silverlight, Mono as well as a Portable Class Library). It is safe to say that it is fairly popular, possibly due to the software itself, due to its reliable compression, or the fact that it has a BSD-style license and is patent free.

I researched how to upstream regarding bzip2, and I found the only way is to contact Julian Seward at jseward@bzip.org (from github documentation) or julian@bzip.org (from bzip2 website) who owns the copyright to bzip2 as of September 2010. I have emailed him regarding my benchmarking, research, and results. In hopes to open lines of communication, and hopefully he can point me into a certain direction of the code that is in need of optimization that I could not find. Or perhaps even a section of code that could use some more substantial reworking, see that I have now finished the semester, it would be good practice to keep coding and getting involved in optimizing an open source project would be great for that.

Reflection:

The project process was the simple 3 stages, however I would have liked for Stage 1 to be longer, to allow us more time to do research and find a project not only in need of optimization, but also to allow us the time to properly benchmark the software. I believe falling behind in Stage 1 benchmarking  led me to many mistakes later, seeing as I has to somewhat rush and find an open source software to optimize. That being said, I did spend a lot of time to try and properly benchmark the software using gprof, as well as the display for gprof, and while I was able to get it to work, I was still unsure of many things as I utilized gprof benchmarking with two pieces of software: groonga as well as bzip2.

I did feel this project was a great learning experience though, to get us to delve into patching and developing in open source. Finding software, functions and code for optimization overlooked by other people, get us to benchmarking software to see where there are deficiencies and areas to improve in code. The idea to get students to try and implement optimizations in already widely used code I feel is a great learning tool as in the industry we will have to keep these optimization methods in mind to ensure the efficiency of our code. I am not happy with the fact that I fell a bit behind with the project, and was not able to find areas to optimize as I would have liked to noticeably optimized software and get it to the upstream, however I still feel it was a good learning experience overall.

-Adam C

 

More Updates regarding Stage 2

In my previous post I attested to the fact that it was showing in the gprof display that every function was running at 100.00%. I forgot to also include the gprof flat file to help textualize things further (also it is a bit easier to read the calls).

spo600stage2.PNG

Now as I stated before it is hard to see how long everything runs when it is all listed at 0. I did run bzip2 correctly, but I have a feeling that bzip2 runs so quickly that the time stamp I included in the previous post has to do with the time it takes to read from the disk, not for the actual functions to operate. However, you can observe how many times functions are called. BZ2_indexInfoF is called 120245 times, BZ2_bzDecompress is called 25 times, same with BZ2_bzread, myfeof, and unRLE_obuf_to_output_SMALL. These are the functions, due to how many times they are run that I should really observe to make changes to.

For the function that is called the most: BZ2_indexInfoF it is a fairly simple function, despite the fact that it is called so many times, I do not believe there is room for any optimization here. I have included the function in case I find a way to further optimize it.

__inline__ Int32 BZ2_indexIntoF ( Int32 indx, Int32 *cftab )
{
 Int32 nb, na, mid;
 nb = 0;
 na = 256;
   do {
     mid = (nb + na) >> 1;
       if (indx >= cftab[mid]) nb = mid; else na = mid;
      }
   while (na - nb !=1);
 return nb;
}

I really do not think it is the case that bzip2 runs so fast, after looking at BZ2_bzDecompress there are close to 50 variables declared in this function, a massive switch statement, close to 20 if statements etc. There are a lot of operations going on, so I think there is some optimization that can be still performed. I may have to simply resort to the time function to measure my optimization.

For bz2_bzRead I could only locate that in the manual.ps file despite the fact there is documentation for it on the bzip2 website, I will have to come back to that.

The myfeof function only consists of four lines, fairly simple, an if statement, a couple declarations and a return statement. I am almost certain there is no room for optimization here, however, I will include it in case a classmate finds a resolution, or I can revisit it at a later date for consideration.

static
Bool myfeof ( FILE* f )
{
   Int32 c = fgetc ( f );
   if (c == EOF) return True;
   ungetc ( c, f );
   return False;
}

ungetc() seems like it does not perform any function as there is a return False statement immediately after, however how the function operates is that if the file (c) is at the EOF it returns true and exits the function, if it does not, it places whatever was put into c(the file), into f(variable pointing to the file).

 

Okay so in the Bool unRLE_obuf_to_output_SMALL ( DState* s ) function which runs 25 times, I was able to find an oddly placed while statement:

while (True) {

/* try to finish existing run */
  while (True) {

   if (s->strm->avail_out==0) return False;
   if (s->state_out_len==0) break;

   *( (UChar*)(s->strm->next_out) ) = s->state_out_ch;

   BZ_UPDATE_CRC ( s->calculatedBlockCRC, s->state_out_ch );

   s->state_out_len--;
   s->strm->next_out++;
   s->strm->avail_out--;
   s->strm->total_out_lo32++;
   if (s->strm->total_out_lo32==0) s->strm->total_out_hi32++;

}

Then there is another closing brace for the first while loop. I thought it was odd to have two loops nested inside one another with the same parameters, so I simply removed the nested while loop, ran make, make install again, then utilized the time.

I decided to run with time, three times compressed, 3 times decompressed (with -d flag). The results are as follows:

For Compression(with my alterations):

run1
real 0m22.131s
user 0m20.853s
sys 0m2.259s

run2
real 0m21.601s
user 0m21.664s
sys 0m0.919s

run3
real 0m22.866s
user 0m21.483s
sys 0m2.380s

For Decompression(with my alterations):

Run1
real 0m7.585s
user 0m5.462s
sys 0m2.117s

Run2
real 0m7.527s
user 0m5.489s
sys 0m2.030s

Run3
real 0m10.740s
user 0m5.379s
sys 0m2.380s

Now to run bipz2 before my alterations. With the time function the results are as follows:

For Compression(without my alterations):

run1
real 0m22.357s
user 0m21.269s
sys 0m1.070s

run2 
real 0m22.024s
user 0m21.047s
sys 0m0.960s

run3
real 0m21.987s
user 0m21.051s
sys 0m0.919s

For Decompression(without my alterations):

run1
real 0m7.515s
user 0m5.180s
sys 0m2.330s

run2
real 0m7.521s
user 0m4.977s
sys 0m2.539s
run3
real 0m7.510s
user 0m5.086s
sys 0m2.419s

I really did not anticipate a difference with decompression, but I did expect a bigger difference with compression, not having to run a while block inside of a while block with the same parameters in the while statement. The file I used again was generated with:

dd if=/dev/zero of=zero bs=1 count=0 seek=1000M

Bzip2 also has a bzcmp to ensure that the files are the same and have not been altered, however since I am using dd I am not even sure if bzcmp will work to ensure the files are remaining intact and my alterations have not incorrectly impacted the code. I have tried the bzcmp with the -l flag which causes the comparison and display to continue to the end. I have also tried with the -s flag which suppresses output and returns a non-zero status if the files differ. However if I run it with either flag, or without any flags I get no output at all so I am going to have to test the validity of the file another way, however I may not even bother because the optimization I have tested is not nearly enough of an optimization to be content with.

 

A bit of progress regarding the SPO600 Stage 2

So after attending class last week on Wednesday (Friday was a good friday) and learning how to create dummy files of a specific size to get a consistent benchmarking when running the file. I created a “zero” file with the size of originally 100M, but did not get a long enough run time and preferred by my instructor. Thus I decided to make the file 1000M with the command:

dd if=/dev/zero of=zero bs=1 count=0 seek=1000M

I got the follow results after compressing with bzip2, and then decompressing with bzip2 -d:

spo600stage1pic1

And now for the decompress:

spo600stage1pic2

As we can see the decompression takes significantly less time to run, probably due to the fact that the file is much smaller. But all in all we are making some nice progress. I was able to finally get gprof to behave and give me the display output.

gprofspo600

While this does help display what functions call what functions, how many times each of the functions run etc. For the percentages of runtime it shows 100% for each of them which is really puzzling. I tried to run the gprof command:

gprof bzip2|less|gprof2dot|dot -Tps|display 
//this does not contain the file to be compressed and allows 
the display to be properly displayed.

gprof bzip2 zero|less|gprof2dot|dot -Tps|display 
//does contain the file to be compressed however consistently 
throws an gprof: out of memory allocating 17179869144 
bytes after a total of 688128 bytes error: unexpected end of file

This is truly puzzling. I will inquire with classmates why this may be, and may have to keep fighting with the gprof to get it working correctly. Worse case scenario, I may have to find a work around or simply resort to utilizing the time function as that seems to give me an actual measurement.

So I found out why everything is showing 100%. This is because at the top of the Flat Profile it shows: Each sample counts as 0.01 seconds. No time accumulated. Apparently this is due to the fact  that Linux uses a libc compatibility routine to fake the profile() system call for gprof. This needs the SIGPROF-based ITIMER_PROF interval timer but should generally work. In the same Stack overflow post that explained this issue, it also recommends OProfile, and I also saw Valgrind recommended. So I may stick with one of those two for a less fussy, more detail oriented profiler to ensure I can properly benchmark all changes in the future.

Issues with the SPO600 Project. Leading up to Stage 2

Trouble in the Trenches regarding Stage 1

This has been much more troublesome and time consuming than I anticipated and is really taking away from the intended experience of the final project of actually analyzing and optimizing the software. I really cannot get grof, along with the the gmon.out file to operate properly.

I decided to delete and clone the repository again as I was for some reason encountering issues in their compare when running make. Even after that, and utilizing both executable they provide I still am unable to generate or find the gmon file to run. Im going to attempt to resolve this issue by adjusting the CFLAGS again and compiling, but I am not optimistic. At this point I may decide to choose a different open source software considering the troubles I am running into regarding this.

Seeing as we are moving onto part two, I really need to figure this out and move on as I already have a ton of ideas on what to do to optimize this code. Hopefully the next update will be with the benchmarking working.

Project Plan

After being able to somewhat get gprof working I was able to see that the function decompress is called 25 times. While im still having issues with benchmarking the particular time it takes for each function to run, I think it is safe to say that optimizing a function that is called 25 times whilst running a program is a good course of action. The servers are currently down, so hopefully I can take a closer look at it when they are back up.

SPO600 Stage 1

Stage 1

Find an Open Source Package

For this project, we will be analyzing and optimizing an open-source package, studying the source code of the package to determine CPU intensive aspects of the code and possible ways to increase performance. We will also be interacting with the open source community to discuss possible issues with the code as well as their compatibility with different processors, and then have the optimization accepted upstream.

We need to choose an open source package to optimize. We are looking for a package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code such as C, C++, or Assembler so we can optimize the code. I have decided to choose the open source compression software bzip2 which allows for compression and decompression, as well as many other things.

Benchmark the performance on AArch64

There are a few different files that I was looking at focusing on. I am going to be looking at the void compress (char* name) function in the bzip.c file. I have decided to pick this because it is a long function that has many switch, for, while, do while and if statements that have a lot of potential for various types of loop hoisting, loop unswitching, loop interchange as well as other potential code optimizations.

It is important to note you can use -1 (or –fast) to -9 (or -best) options to utilize, most optimizations and such, however I will run it as normal. The default runtime CFLAGS are -WALL -Winline -O2 -g, so I had to adjust the CFLAGS in the MakeFile to include a -pg to allow me to utilize gprof. However, being that it is a compression software, the files I was compressing to initially run bzip2, to generate that gmon file needed, would consistently throw the error of not being big enough. After generating a big enough file, compressing it and creating the gmon file, running gprof on it, I have run into a nice “unexpected end of file” statement, running into another wall. I am going to confer with a classmate and friend to see if we can resolve the error or come up with another course of action for the benchmarking.

 

SPO600 Lab6 Inline Assembler Code

This post will be dealing with the scaling of Audio code, however this time we will be dealing with code given to us already by our instructor, we are to run it, and view the runtime and analyze the differences when using one array, two arrays, one dimensional arrays and two dimensional arrays.

Question 1: These variables will be used in our assembler code, so we’re going to hand-allocate which register they are placed in. What is the alternative approach?

Answer 1: The alternative approach is to simply declare variables, have the computer and compiler automatically write registers in the CPU which will allow the program to run faster when more variables can be in the CPU’s registers.

Question 2: Set vol_int to fixed-point representation of 0.75. Should we use 32767 or 32768 in the next line? Why?

Answer 2: The number we will use is 32767 not 32768. The reason for this is because 32767 represents all values, within a 15 bit integer value (compensating for one bit being the fixed point) thus we cannot have 32768.

Question 3: What does it mean to duplicate values in the next line?

Answer 3: The inline assembly instructions duplicate the value of the integer into v1.8h and is done so so that sqdmulh is called and has the necessary value to scale the audio.

Question 4 : What happens if we remove the follow two lines? Why?

Answer 4: If the Input and Output operands are excluded from the inline assembly function, the assembly template does not know where ‘in’ and ‘out’ are.

Question 5: Are the results usable are they correct?

Answer 5: Yes the results are usable, however inaccurate. The runtime of the program is 0.030s which is about 0.003s faster than the third program from the previous workshop.

 

Part B the Individual task we were assigned was to select an open source package (I have choosen groonga), find the assembly-language code in that software, and determine:

  • How much assembley-language code is present
  • Is the assembly code in its own file (.s or .S) or inline
  • Which platform(s) the assembler is used on
  • What happens on other platforms
  • Why it is there (what it does)
  • Your opinion of the value of the assembler code, especially when contrasted with the loss of portability and increase in complexity of the code.

 

How must assembly code is present? There is quite a lot of assembly code, __asm__ is utilized more than 30 times in more than 10 files.

Is the assembly code it its own file or inline? Most of the code is written in C, however I was not able to find any specific assembler code it its own .s file and was only able to locate inline assembler code.

Which platforms is the assembler used on? There are many #ifdef statements to section out code for WIN32, WIN64, X86_64, AMD64 so it is quite compatible on many different platforms

What happens on other platforms? From what I tested, it seemed to work okay on the platforms I tested it on. I think it is quite compatable with differeing platforms

Why is it there? It is there for numerous reasons, it is used often for the bsrl function, that functionality computes the position to the most significant bit. When it uses the add function it often utilizes lock which ensures that the CPU has exclusive ownership of the appropriate cache line for the duration of the operation, and provides certain additional ordering guarantees. Many registers are manually moved around with mov. Set and add functions are almost entirely written in assembler utilizing volatile for the expressed purpose of working directly with registers and optimizing the software.

Your opinion of the value of the assembler code, especially when contrasted with the loss of portability and increase in complexity of the code. Of the 10+ files I named that have assembler code, most consist of very basic simple code that consists of 1 to 5 lines. There are only 3 of the 10 files that consist of functions entirely written in assembler. The simple assembler code without a doubt can help improve the optimization while not bogging down development with overtly complex code so that is not an issue. However, there are about 6 or so functions written entirely in assembler, I will display an example here:

static ngx_inline ngx_atomic_uint_t

ngx_atomic_cmp_set(ngx_atomic_t *lock, ngx_atomic_uint_t old,
    ngx_atomic_uint_t set)
{
    ngx_atomic_uint_t res, temp;
    __asm__volatile(

    " li %0, 0 \n"                          /* preset "0" to "res" */
    " lwsync \n"                            /* write barrier */
    "1: \n"
    " ldarx %1, 0, %2 \n"                   /* load from [lock] into "temp" */
                                            /* and store reservation */
    " cmpd %1, %3 \n"                       /* compare "temp" and "old" */
    " bne- 2f \n"                           /* not equal */
    " stdcx. %4, 0, %2 \n"                  /* store "set" into [lock] if reservation */
                                            /* is not cleared */
    " bne- 1b \n"                           /* the reservation was cleared */
    " isync \n"                             /* read barrier */
    " li %0, 1 \n"                          /* set "1" to "res" */
    "2: \n"

    : "=&b" (res), "=&b" (temp)
    : "b" (lock), "b" (old), "b" (set)
    : "cc", "memory");

    return res;
}

The other functions written entirely of assembly as i mentioned previous are of very similar complexity, while it is of greater complexity to write than C, considering that Groonga is a full text search engine, I do believe one of their goals is to be as optimized as possible so the utilization of assembler code to achieve that is necessary in this instance.

SPO600 Lab5 Algorithm Selection

This is my Lab5 Algorithm Selection for my SPO600 class.

For this class I will be analyzing audio files, the performance of them, the compression, size, as well as scaling sound bites.

Yes after building vol.h and vol1.c it does have the same output every time and that output is -86 every time.

The runtime for the program is 0m0.038

The code itself:

int main() {

// Allocate memory for large in and out arrays
 int16_t* in;
 int16_t* out;
 in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
 out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
 int ttl;

// Seed the pseudo-random number generator
 srand(-1);

// ######################################
 // This is the interesting part!
 // Scale the volume of all of the samples
 for (x = 0; x < SAMPLES; x++) {
 out[x] = scale_sample(in[x], 0.75);
 }
 // ######################################

// Sum up the data
 for (x = 0; x < SAMPLES; x++) {
 ttl = (ttl+out[x])%1000;
 }

// Print the sum
 printf("Result: %d\n", ttl);

return 0;

}

No multiple runs do not run the same each time. The first time it ran it was in 0m0.038s, the next time it ran 0.035s, then it ran 0.036s, it randomly ran differently any time from 0.038s to 0.035s.

After we change our code to compensate for the pre-calculation of the lookup table of all possible samples values multiplied by the volume factors. After we compile and run the code with the changes we made it now consistently runs 0.032s for a total of roughly 0.004s or more faster run time, which does optimize the runtime of the program.

The code for this:

static inline int16_t scale_sample(int16_t sample, float volume_factor) {
 return (int16_t) (volume_factor * (float) sample);
}

int main() {

// Allocate memory for large in and out arrays
 int16_t* in;
 int16_t* out;
 in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
 out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
 int ttl;

// Seed the pseudo-random number generator
 srand(-1);

// Fill the array with random data
 for (x = 0; x < SAMPLES; x++) {
 in[x] = 0.75 * ((rand()%65536)-32768);
 }

for (x = 0; x < SAMPLES; x++) {
 ttl = (ttl+out[x])%1000;
 }

// Print the sum
 printf("Result: %d\n", ttl);

return 0;

}

 

Next we are to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value “1”. For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you’re using 256 as the multiplier).

The code is as such:

static inline int16_t scale_sample(int16_t sample, float volume_factor) {
 return (int16_t) (volume_factor * (float) sample);
}

int main() {

// Allocate memory for large in and out arrays
 int16_t* in;
 int16_t* out;
 in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
 out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
 int ttl;
 int scale = 256 * 0.75;

// Seed the pseudo-random number generator
 srand(-1);

// Fill the array with random data
 for (x = 0; x < SAMPLES; x++) {
 in[x] = (rand()%65536)-32768;
 }

// Sum up the data
 for (x = 0; x < SAMPLES; x++) {
 ttl = (scale * (ttl+in[x] >> 8))%1000;
 }

// Print the sum
 printf("Result: %d\n", ttl);

return 0;

}

The run time with this code is 0.030s which is slightly faster than both the first and second implementations. Of these implementations it seems that the third implementation is the fastest, slightly, followed by the second and then lastly the first.

Approach 1 has the program multiply each element by the volume, however it is slower than Approach 2 for the reason it precalculates all possible sample values multiplied by the volume factor. Approach 3 is the fastest as it multiplies the value straight into the variable.

SPO600 Lab 4 Vectorization

This is the fourth lab for my SPO600 class. For this lab we will have to write a program that randomly allocates a number from 1000 to -1000 in two arrays, each one being set to 1000. The two arrays will be added up, element by element into a third array, and that array will be summed. We will compile this program and use object dump to analyze the assembly code of our original program, and then our new program once we rearrange to code to allow to auto Vectorization.

Original code for the C program: https://pastebin.com/4Rtghd4p

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

void main(){


int arr1[1000];
int arr2[1000];
int arr3[1000];

int max = 1000;
int min = -1000;

time_t t;
int sum = 0;
int i;

srand(time(&t));

for(i=0;i<1000;i++){
        arr1[i] = rand() % 2000 - 1000;
        arr2[i] = rand() % 2000 - 1000;
        arr3[i] = arr1[i] + arr2[i];
        sum += arr3[i];
        }
printf("The sum of the third array is %d \n", sum);
}
 As you can see I store the random numbers in the arrays, add the arrays together by element, and then sum the third array all in one for loop.
The assembly code after running the command gcc -03 -ftree-vectorize -o array2 array.c:
0000000000400560 <main>:
 400560: a9bb7bfd stp x29, x30, [sp, #-80]! //Store the initial value into the array
 400564: 910003fd mov x29, sp // store initial value into the array
 400568: a9025bf5 stp x21, x22, [sp, #32] //Store pair value into register using 32bits
 40056c: 5289ba75 mov w21, #0x4dd3 // #19923
 400570: a90153f3 stp x19, x20, [sp, #16] // Store pair value into register using 16bits
 400574: 72a20c55 movk w21, #0x1062, lsl #16 //Move the 16bits into a register while not changing values
 400578: f9001bf7 str x23, [sp, #48] //Store values into register
 40057c: 52807d13 mov w19, #0x3e8 // #1000 // setting max value to 1000
 400580: 5280fa14 mov w20, #0x7d0 // #2000 // setting the value so it can go up to 2000
 400584: 910123a0 add x0, x29, #0x48 // adds 
 400588: 52800017 mov w23, #0x0 // #0

/* The time, seed random and rand functions */
 40058c: 97ffffd9 bl 4004f0 <time@plt>
 400590: 97ffffec bl 400540 <srand@plt>
 400594: 97ffffdf bl 400510 <rand@plt>

/* The loop    */
 400598: 2a0003f6 mov w22, w0
 40059c: 97ffffdd bl 400510 <rand@plt> // Random number for array
 4005a0: 9b357c03 smull x3, w0, w21 // Signed Multiple Long
 4005a4: 71000673 subs w19, w19, #0x1 // Subtract and set flags for register 19
 4005a8: 9b357ec2 smull x2, w22, w21 // Signed Multiply Long
 4005ac: 9367fc63 asr x3, x3, #39 // Shift 39 bits and fit new bit
 4005b0: 4b807c63 sub w3, w3, w0, asr #31 // subtract and set flag for register
 4005b4: 9367fc42 asr x2, x2, #39 //Shift 39 bits and then fill new bits
 4005b8: 4b967c42 sub w2, w2, w22, asr #31 // subtract and set flag for register
 4005bc: 1b148060 msub w0, w3, w20, w0 // load w0 with w0-(w3*w21)
 4005c0: 1b14d842 msub w2, w2, w20, w22 // load w2 with w22-(w2*w20)
 4005c4: 0b000040 add w0, w2, w0 // load w0 with w2+20
 4005c8: 511f4000 sub w0, w0, #0x7d0 //subtract value of w0 by 2000
 4005cc: 0b0002f7 add w23, w23, w0 // load w20 with w20_w0
 4005d0: 54fffe21 b.ne 400594 <main+0x34> // b.any // check if condition is met

/* Prints the sum of the array */
 4005d4: 2a1703e1 mov w1, w23
 4005d8: 90000000 adrp x0, 400000 <_init-0x4b8>
 4005dc: 911ee000 add x0, x0, #0x7b8
 4005e0: 97ffffdc bl 400550 <printf@plt>
 4005e4: a94153f3 ldp x19, x20, [sp, #16]
 4005e8: a9425bf5 ldp x21, x22, [sp, #32]
 4005ec: f9401bf7 ldr x23, [sp, #48]
 4005f0: a8c57bfd ldp x29, x30, [sp], #80
 4005f4: d65f03c0 ret
As you can see vectorization has not taken place in our assembly object dump, thus we must edit our C code to allow for auto-vectorization.
The changing the code so that the 3 operations I mentioned previously: assigning random number from 1000 to -1000, adding the first and second array together element by element and storing it into a third array, and then summing the third array are all done in their own separate for loops: https://pastebin.com/qa3mrv4a
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void main(){

int arr1[1000];
int arr2[1000];
int arr3[1000];

int max = 1000;
int min = -1000;

time_t t;
int sum = 0;
int i;

srand(time(&t));

for(i=0;i<1000;i++){
arr1[i] = rand() % 2000 - 1000;
arr2[i] = rand() % 2000 - 1000;
}

for(i=0;i<1000;i++){
arr3[i] = arr1[i] + arr2[i];
}

for(i=0;i<1000;i++){
sum += arr3[i];
}
printf("The sum of the third array is %d \n", sum);
}

 

The assembly code on the new C code after running the command gcc -03 -ftree-vectorize -o array2 array.c:

 

0000000000400560 <main>:
 400560: d285e610 mov x16, #0x2f30 // #12080
 400564: cb3063ff sub sp, sp, x16
 400568: a9007bfd stp x29, x30, [sp]
 40056c: 910003fd mov x29, sp
 400570: 910123a0 add x0, x29, #0x48
 400574: a90153f3 stp x19, x20, [sp, #16]
 400578: 5289ba74 mov w20, #0x4dd3 // #19923
 40057c: a9025bf5 stp x21, x22, [sp, #32]
 400580: 72a20c54 movk w20, #0x1062, lsl #16
 400584: f9001bf7 str x23, [sp, #48]
 400588: 910143b6 add x22, x29, #0x50
 40058c: 913fc3b5 add x21, x29, #0xff0
 400590: 5280fa13 mov w19, #0x7d0 // #2000
 400594: d2800017 mov x23, #0x0 // #0
 400598: 97ffffd6 bl 4004f0 <time@plt>
 40059c: 97ffffe9 bl 400540 <srand@plt>
 4005a0: 97ffffdc bl 400510 <rand@plt>
 4005a4: 9b347c01 smull x1, w0, w20
 4005a8: 9367fc21 asr x1, x1, #39
 4005ac: 4b807c21 sub w1, w1, w0, asr #31
 4005b0: 1b138020 msub w0, w1, w19, w0
 4005b4: 510fa000 sub w0, w0, #0x3e8
 4005b8: b8376ac0 str w0, [x22, x23]
 4005bc: 97ffffd5 bl 400510 <rand@plt>
 4005c0: 9b347c01 smull x1, w0, w20
 4005c4: 9367fc21 asr x1, x1, #39
 4005c8: 4b807c21 sub w1, w1, w0, asr #31
 4005cc: 1b138020 msub w0, w1, w19, w0
 4005d0: 510fa000 sub w0, w0, #0x3e8
 4005d4: b8376aa0 str w0, [x21, x23]
 4005d8: 910012f7 add x23, x23, #0x4
 4005dc: f13e82ff cmp x23, #0xfa0
 4005e0: 54fffe01 b.ne 4005a0 <main+0x40> // b.any
 4005e4: d283f202 mov x2, #0x1f90 // #8080
 4005e8: 8b0203a1 add x1, x29, x2
 4005ec: d2800000 mov x0, #0x0 // #0
 4005f0: 3ce06ac0 ldr q0, [x22, x0]
 4005f4: 3ce06aa1 ldr q1, [x21, x0]
 4005f8: 4ea18400 add v0.4s, v0.4s, v1.4s
 4005fc: 3ca06820 str q0, [x1, x0]
 400600: 91004000 add x0, x0, #0x10
 400604: f13e801f cmp x0, #0xfa0
 400608: 54ffff41 b.ne 4005f0 <main+0x90> // b.any
 40060c: 4f000400 movi v0.4s, #0x0
 400610: aa0103e0 mov x0, x1
 400614: d285e601 mov x1, #0x2f30 // #12080
 400618: 8b0103a1 add x1, x29, x1
 40061c: 3cc10401 ldr q1, [x0], #16
 400620: 4ea18400 add v0.4s, v0.4s, v1.4s
 400624: eb01001f cmp x0, x1
 400628: 54ffffa1 b.ne 40061c <main+0xbc> // b.any
 40062c: 4eb1b800 addv s0, v0.4s
 400630: 90000000 adrp x0, 400000 <_init-0x4b8>
 400634: 91208000 add x0, x0, #0x820
 400638: 0e043c01 mov w1, v0.s[0]
 40063c: 97ffffc5 bl 400550 <printf@plt>
 400640: f9401bf7 ldr x23, [sp, #48]
 400644: a94153f3 ldp x19, x20, [sp, #16]
 400648: d285e610 mov x16, #0x2f30 // #12080
 40064c: a9425bf5 ldp x21, x22, [sp, #32]
 400650: a9407bfd ldp x29, x30, [sp]
 400654: 8b3063ff add sp, sp, x16
 400658: d65f03c0 ret
 40065c: 00000000 .inst 0x00000000 ; undefined

The assembly code has grown since enacting the changes to the C code. In addition to this we now see our first evidence of auto-vectorization taking place in the bold. The v is evidence of the vector  register being accessed, and vectorization finally taking place.

My experience with these labs was interesting, I almost had to reverse engineer my code to allow for the compiler to utilize auto-vectorization by changing the operations of the program from completing in one for loop, to doing so in separate loops. Auto-vectorization mainly makes changes to optimization, I assume the reason it did not use vectors for the first instance of my C program is because the compiler would figure that utilizing a vector in that instance is not the most optimal way to compile. However it utilizes vectorization for the second C program with the 3 for loops (the 3 loops are less optimal than simply doing the operations in one loop), thus it does actually help with optimization.

 

SPO600 Lab3

SPO600 Lab3

Firstly for this lab, I will build and compile three different versions of a C program provided to me by my professor. I will have to build and compile the code in both aarch64 as well as x86-64, objdump in both and then analyze the differences.

I will start with aarch64.

For hello1 utilizing the standard prinf function,

0000000000400624 <main>:
 400624: a9bf7bfd stp x29, x30, [sp, #-16]!
 400628: 910003fd mov x29, sp
 40062c: 90000000 adrp x0, 400000 <_init-0x4a8>
 400630: 911c0000 add x0, x0, #0x700
 400634: 97ffffb7 bl 400510 <puts@plt>
 400638: 52800000 mov w0, #0x0 // #0
 40063c: a8c17bfd ldp x29, x30, [sp], #16
 400640: d65f03c0 ret
 400644: 00000000 .inst 0x00000000 ; undefined

Puts is the defining feature of the hello c program utilizing the standard printf function to print hello world to the screen.

For hello2 utilizing the direct write to stdout,

0000000000400624 <main>:
 400624: a9bf7bfd stp x29, x30, [sp, #-16]!
 400628: 910003fd mov x29, sp
 40062c: 90000000 adrp x0, 400000 <_init-0x4a8>
 400630: 911c2000 add x0, x0, #0x708
 400634: d28001a2 mov x2, #0xd // #13
 400638: aa0003e1 mov x1, x0
 40063c: 52800020 mov w0, #0x1 // #1
 400640: 97ffffb4 bl 400510 <write@plt>
 400644: 52800000 mov w0, #0x0 // #0
 400648: a8c17bfd ldp x29, x30, [sp], #16
 40064c: d65f03c0 ret

For the hello2 c program utilizing write, we see in the assembler code that write, not puts is used, also we see that there are three mov operations before the write function, showing its less than optimized operation.

For hello3 utilizing the direct kernal system call to write to file descriptor,

0000000000400624 <main>:
 400624: a9bf7bfd stp x29, x30, [sp, #-16]!
 400628: 910003fd mov x29, sp
 40062c: 90000000 adrp x0, 400000 <_init-0x4a8>
 400630: 911c4000 add x0, x0, #0x710
 400634: 528001a3 mov w3, #0xd // #13
 400638: aa0003e2 mov x2, x0
 40063c: 52800021 mov w1, #0x1 // #1
 400640: d2800800 mov x0, #0x40 // #64
 400644: 97ffffb3 bl 400510 <syscall@plt>
 400648: 52800000 mov w0, #0x0 // #0
 40064c: a8c17bfd ldp x29, x30, [sp], #16
 400650: d65f03c0 ret
 400654: 00000000 .inst 0x00000000 ; undefined

For this hello 3 program, we see the assembler code utilizes syscall, this also entails 4 mov operations before the syscall, one more mov function than the previous program utilzing write, for an even less optimized version.

 

Now I will build and compile the three C programs in x86_64 and analyze how assembly behaves.

For Hello1:

0000000000400507 <main>:
 400507: 55 push %rbp
 400508: 48 89 e5 mov %rsp,%rbp
 40050b: bf b0 05 40 00 mov $0x4005b0,%edi
 400510: e8 0b ff ff ff callq 400420 <puts@plt>
 400515: b8 00 00 00 00 mov $0x0,%eax
 40051a: 5d pop %rbp
 40051b: c3 retq
 40051c: 0f 1f 40 00 nopl 0x0(%rax)

So in xerxes (x86) for hello 1 we can see there are many more operations before the functions such as push, move and the function call than there were for aarch64. While it still uses puts to help optimize the code, the additional operations required before the functions of assembler to run further hamper the optimization of the code.

For Hello2:

0000000000400507 <main>:
 400507: 55 push %rbp
 400508: 48 89 e5 mov %rsp,%rbp
 40050b: ba 0d 00 00 00 mov $0xd,%edx
 400510: be c0 05 40 00 mov $0x4005c0,%esi
 400515: bf 01 00 00 00 mov $0x1,%edi
 40051a: e8 01 ff ff ff callq 400420 <write@plt>
 40051f: b8 00 00 00 00 mov $0x0,%eax
 400524: 5d pop %rbp
 400525: c3 retq
 400526: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
 40052d: 00 00 00

For hello2 which utilizes write instead of printf, we see there are 3 more lines of code, the same amount of excess lines seen between hello and hello2 when built and compiled in aarch64, however there are obviously many more operations occurring before the actual functions take effect. This adds for and additional decrease in optimization.

For Hello3:

0000000000400507 <main>:
 400507: 55 push %rbp
 400508: 48 89 e5 mov %rsp,%rbp
 40050b: b9 0d 00 00 00 mov $0xd,%ecx
 400510: ba c0 05 40 00 mov $0x4005c0,%edx
 400515: be 01 00 00 00 mov $0x1,%esi
 40051a: bf 01 00 00 00 mov $0x1,%edi
 40051f: b8 00 00 00 00 mov $0x0,%eax
 400524: e8 f7 fe ff ff callq 400420 <syscall@plt>
 400529: b8 00 00 00 00 mov $0x0,%eax
 40052e: 5d pop %rbp
 40052f: c3 retq

For hello3 whch utilizes the syscall function, similarly like when we compiled the programs in aarch64, there is an extra mov function in the assembly code. This is the most least optimized code that we have documented, with the extra operations before the function calls, as well as the additional mov function.

For the last part of the group assignment we had to create an assembler program that looped to 30 while ensuring it reads both decimal and ASCII characters. We are provided with an initial Hello World program, as well as a basic loop program to work off of.

I firstly created the program for x86_64:

For the sake of not making the post too long I will only include snippets of code. For the code we had start by initializing the index as the counter for the loop. We then choose a register to become our new variable, and thus set this variable to 0 by initializing the value to “0x30”, this is 0 in hexadecimal which gets converted to ASCII, then you must initialize the remainder.

mov $0x30, %r12 /*Initialize to 0 in ASCII*/
mov $0, %rdx /*Initialize the remainder*/

We must then have to set the dividend as the index counter which. We then set the operator that perform the division. We then call the operator and divide

 mov %r15, %rax /*Set dividend*/
 mov $10, %r10 /*Set the divisor*/
 div %r10 /*divide*/
 mov %rax, %r14 /*store quotient*/
 mov %rdx, %r13 /*store remainder*/
 add $0x30, %r14 /*convert first digit from quotient into ascii*/
 add $0x30, %r13 /*convert second digit from remainder into ascii*/

After we set the division, we proceed to store the quotient and remainder in two registers. We then convert the value of the quotient and remainder into ASCII.

We then modify the message variable with the remainder and compare the initial value of r12 to r14 which holds the first digit. We then modify the message but add the quotient this time around.

Within the print section we set the length and value of the string. It then increments the value of the loop register and compares it to the max value of the loop register. If the loop register does not equal max then it loops back again, once it does equal the max it changes syscall to exit.

mov $len, %rdx /*Length*/
mov $msg, %rsi /*Message*/
mov $1, %rdi /*stdout*/
mov $1, %rax /*change syscall to 1*/
syscall
 
inc %r15 /*Increments r15*/
cmp $max, %r15 /*compare r15 with the max value*/
jne loop /*goes to the loop section*/
 
mov $0, %rdi /*exit status*/
mov $60, %rax /*syscall 60 = exit*/
syscall

 

Now we had to do the same thing but for aarch64. It is similar to the x86_64 code as they follow the same logic of the movement of info, however there are minor differences. x86-64 has to call increment every time it goes through a loop but in aarch64 you do not need to.

Programming in assembly was exceedingly difficult, not only because it was learning a new language, but also the heavy reliance on manually setting registers, space in memory etc. adds so many more lines of code and command, and also opens up the avenue for many more memory errors to occur. The fact that it is low level was a learning curve as in the Seneca College CPA program as the lowest level languages we have been exposed to would be RPG, CLLE or C. The logic is completely different, the very archaic feel of the language was also a hurdle to overcome, but oddly enough the compiler error checker and message was similar to that if you were to debug with gcc (which is an option although I avoided doing so as it required the use of main instead of _start).

I definitely did not enjoy coding in assembly, but I do understand it is a valuable learning experience to understand how code operates with machines on a much lower level than we are use to.

Link to x86_64: https://www.pastiebin.com/5a80b40e2a53f

Link to aarch64: https://www.pastiebin.com/5a80b45741ffe

SPO600 Lab2

Adam Chimienti

107904156

SPO600

Lab2 Compiler C Lab

For this lab I will write a very simple “Hello World!” program in C. I will then compile the program several different times with different options in the command line. I will analyze the different results that each of the command line options yield. I will also test these results over different architectures, and for this I will be using our default SPO600 servers aarchie which is a standard set up, and ccharlie chich has 8GB of RAM and 40GB of space. The default compiler command we will enter at the beginning will be “gcc  -g -O0 -fno-builtin – o hello hello.c” The different compiler options will be listed and a description will follow.

1) Add -static to the command line

The file size is almost 10X as large, the header call is “free_mem” instead of “init”, also the function call is ”IO_printf”. Static is used to instruct the compiler to link statically against the C runtimes libraries.

 

2) Remove -fno-builtin from the command line

The file is the same size as the default command, however the function call is “puts” instead of “printf”. This is because -fno-builtin tells the compiler not to recognize built-in functions that do not being with __builtin__ as a prefix. This means there are a whole host of library functions that the compiler will not recognize. Thus removing the -fno-builtin from the compiler command allows it to use the easiest to find and resolve function to optimize results.

 

3)Remove -g from the command line.

The file is a bit smaller than the default one. The section header is listed as “init”. I did not happen to notice any other changes from the main compiler command. The command -g produces debugging information in the operating system’s native format. Thus the shortcuts taken by optimized code may occasionally produce surprising results: some variables you declared may not exist at all; flow of control may briefly move where you did not expect it; some statements may not be executed because they compute constant results or their values were already at hand; some statements may execute in different places because they were moved out of loops. Although I did not notice these changes.

 

4) Add additional arguments to the printf() function in your program (go up to 10 arguments)

What happened here is when looking at the compiler up until the 7th variable, after that it no longer registers in the compiler they simply come up as zero values,  and the str or store register command appears, and tries to store SP or and address into w0 instead of w1, w2, w3 etc.

 

5) Move the printf call to a separate function named output() and call that function from main().

The function call is “output” not “printf” or “puts”. Ardp, add and .inst are no longer there as well. The file seems to be ever so slightly larger, probably due to the few extra lines of code.

 

6) Remove -O0 and add -O3 to the gcc options.

The command -O0 before meant “do not optimize” and this is default. The -O3 option turns on all optimizations specified by -O2 and turns on “-fineline-functions” and “-frename-registers” options. So this options compared to the main one is that -O3 does not have mov, and it also does not have .insta probably due to maximizing the optimization.