Algorithm
The GI algorithm in this work is designed to work in the optimization loop of an emergent software system. This oversall GI framework is illustrated in Fig.~ref{fig:concept}. An emergent software system is assembled from a large collection of small building blocks, such as stream processors, memory cache implementations, hash tables, and so on, and is deployed into a real environment. Once that system has learned the best composition of blocks for a given environment, it will select one of those building blocks and capture a short trace of the method calls that are issued to that block within the present environment. This trace is then sent to a GI system, along with the source code of the building block from which it was captured, so that the GI system can attempt to generate an improved variation of that building block which has higher performance for the given input data sample. If the GI system is successful, the improved building block is pushed back to the emergent software system which uses real-time learning to determine if the proposed improvement does yield higher performance for the intended environment conditions in deployment. Our GI implementation is designed to operate on source code written in the Dana programming language, which supports sound hot-swapping of code for emergent software systems citep{porter2021sound}.
TODO: add image
For the purposes of this study we examine only the GI element of this overall concept. We focus on one particular building block throughout our experiments (a hash table), and we assume that short traces of function calls to this block have already been captured by the emergent system. We also assume that the GI process is able to identify the textit{hash function} of the hash table implementation as the specific area in which to focus when generating improved variations; in practice this focus area could be determined using function call frequency or CPU intensity analysis.
\begin{algorithm} \begin{algorithmic} \FOR{i=0 to generations} \IF{i==0} \STATE create initial population of clones \ENDIF \\ mutate a \% of the population\\ check fitness of all population members\\ \IF{i\%5 == 0} \STATE check performance on unseen data of all population members \\ \ENDIF \\ select new population (roulette wheel) \\ crossover a \% of the population\\ \ENDFOR \end{algorithmic} \end{algorithm}
The core of our system is based on a typical genetic improvement process, using mutation, fitness, selection, and crossover. We include crossover in this work to make use of existing code to expand the code base of members of our population. Because our volume of starting genetic material is very small, we also include new code synthesis in our set of available mutations, in combination with more typical mutation types.
Mutation
In this section we present the detail of how mutations work, including how we mix code synthesis with more traditional mutation types. We first describe our approach to synthesizing new genetic material for insertion, then present our general mutation approach.
Synthesis
Whenever an insert mutation is selected, we synthesize basic code of three types: variable declarations (Listing ref{cd:dec}), operations (Listing ref{cd:op}), and control structures (Listing ref{cd:con}). All three are restricted to the use of primitive types (integers (int) range: 0 - 1000, characters (char) range: a - z, decimal (dec) range: 0 - 1 and booleans (bool) 0 or 1) or for operations and control structures existing variables, and are inserted either before or after a code token in the existing code (including potentially inside existing control structures).
Example variable declarations are shown in Listing ref{cd:dec}, and are composed of a type declaration and a variable name using a lowercase letter string (generated in alphabetical order, e.g. x, y, z, aa, ab, ac). We have an equals operator and a randomly generated constant which matches the declared type.
int a = 26
dec b = 0.029817845987
bool c = false
char d = "g"
Example synthesised operations are shown in Listing ref{cd:op}. These are synthesised using pre-existing variables that are in-scope at the intended point of insertion. Depending on the types of variables that are in-scope an operation is then chosen at random (to make sure we have variables of the type needed). Once an operation has been chosen, variables are chosen at random to serve as operands from among all of the possible variables that are in-scope. Operations use between one and three variables, for example \(a = b * c\) or \(a ++\).
result = a + result
a += result
result ++
result = result % a
An example control structure is shown in Listing ref{cd:con}; we support for-loops, while-loops, and if-statements. In the case of for-loops the variable name for the loop is chosen from the same list as other variable declarations, and the limiter on the loop is set using either a variable that is in-scope or from a randomly generated integer. In the case of if and while structure the Boolean condition is generated using existing variables. All control structures are generated with an empty body (which can be filled in with future mutations).
for(e = 0; e < result; e++)
{
}
Mutations
Overall we use three types of mutation: insertion, deletion, and modification, each on which has different conditions placed on them. We select which mutation to apply randomly based on a parameterised weighting.
Insert mutations are always possible, but the inserted code must not cause scope problems in terms of variable usage. An example of an insertion mutation applied to the original code in Listing ref{cd:hf1} would be:
int hash(char key[])
{
int result = 1
int a = 36
for (int i = 0; i < key.arrayLength; i++)
{
result = a + key[i]
}
return result % HT_LEN
where the variable \(a\) has been added with the integer type and a random integer as its initial value.
Modifications can be performed on operations or conditions in control constructs; in both cases they can modify either the operator or the operands. In the case of operands we replace an operand with a variable of the same type. In the case of operators, we select an operator that works on the quantity and types of operands. In the example above we have mutated the operator and one of the variables to use the inserted \(a\) variable. Thus in the operation within the for-loop, the operation has changed from \(*\) to \(+\) and \(result\) is now \(a\).
Deletions have conditions on their use to ensure that the resulting code is still compilable. Simple operations can always be deleted, but control structures and declarations have restrictions. Specifically, we can only delete a variable declaration if that variable is not used later in the code; and we can be only delete control structures if there is no code in the body of the control structure. Two separate deletions are therefore needed to remove the control structure in the above code: the first to delete the content:
int hash(char key[])
{
int result = 1
int a = 36
for (int i = 0; i < key.arrayLength; i++)
{
}
return result % HT_LEN
and the second to delete the structure itself:
int hash(char key[])
{
int result = 1
int a = 36
return result % HT_LEN
}
This partciular example series of deletions does not create a good hash function (it always returns 1 and so places everything in a single bucket); by comparison, one of the best population members from our experiments after 200 generations is:
int hash (char key[])
{
...
result = 1
result = result * result
result = result * result
result = result * result
i = 0 i = 0 i = 0 i = 0
e = 843 bool a = 1 i = 0
h = 823 i = 0 i = 0
result = result * key[i]
result = result * result
result = result * result
result = result + result
for (i=0; i<key.arrayLength; i++)
{
result = result * key[i]
result = result * key[i]
}
return result % HT_LEN
By this stage the hash function has gone from the simple original hash function:
To become a more complex hash function which projects key values into a larger space before the modulus is taken:
Crossover
Genetic crossover works in much the same way as insertion, except that the code inserted is taken from another member of the population. This creates additional issues surrounding scope. Since all population members use the same set of names for synthesised variables, and have the same original source code, there is a very high likelihood of variable name conflict in the inserted code. If the code is inserted without consideration then variables could be: undeclared at the point of insertion, declared twice in the same scope or have different types at different points in the code.
To resolve issues of this sort we will often rename the variable in the code being inserted as well as adding additional declarations with the inserted code for it’s renamed variables. Renaming of variables uses the same name list as synthesis of variable declarations.
An example of crossover is visible in Listing ref{cd:opt199} where a second copy of the \(result=results*key[i]\) line has been inserted. In this case all the variables used were declared in the insert location scope with the same type so there was no need to rename. If this was inserted before the declaration of \(result\) then we would copy the declaration for the insert code and insert it with the line. This would mean there were two declarations for the variable so the second one would be deleted.
int hash (char key[])
{
...
result = 1
result = result * result
result = result * result
result = result * result
i = 0 i = 0 i = 0 i = 0
e = 843 bool a = 1 i = 0
h = 823 i = 0 i = 0
result = result * key[i]
result = result * result
result = result * result
result = result + result
for (i=0; i<key.arrayLength; i++)
{
result = result * key[i]
result = result * key[i]
}
return result % HT_LEN
Selection
Our selection process, where we choose which individuals to select for copying from one generation for inclusion in the next (with associated crossover/mutation), uses a rank-weighted roulette wheel approach. Each population is ordered by fitness and ranked. This rank is then used such that the fittest individuals have the highest probability of selection for the next generation. Selection is done with replacement, so some individuals will appear multiple times in the following generation, and some will be completely absent, with a (small) possibility of even the worst individual being selected for the next generation. The exception is elites which are the x top individuals who pass to the next generation and do not undergo mutation or crossover. This preserves known good code.
where \(w_i\) is the weight for sampling of the \(i^{th}\) individual, \(n\) the number of individual in the generation and \(r_i\) is the rank of the individual.