50.3. Genetic Query Optimization (GEQO) in PostgreSQL
The GEQO module approaches the queryoptimization problem as though it were the well-known traveling salesmanproblem (TSP).Possible query plans are encoded as integer strings. Each stringrepresents the join order from one relation of the query to the next.For example, the join tree
is encoded by the integer string '4-1-3-2',which means, first join relation '4' and '1', then '3', andthen '2', where 1, 2, 3, 4 are relation IDs within thePostgreSQL optimizer.
Specific characteristics of the GEQOimplementation in PostgreSQLare:
Usage of a steady state GA (replacement of the least fit individuals in a population, not whole-generational replacement) allows fast convergence towards improved query plans. This is essential for query handling with reasonable time;
Usage of edge recombination crossover which is especially suited to keep edge losses low for the solution of the TSP by means of a GA;
Mutation as genetic operator is deprecated so that no repair mechanisms are needed to generate legal TSP tours.
Parts of the GEQO module are adapted from D. Whitley'sGenitor algorithm.
The GEQO module allowsthe PostgreSQL query optimizer tosupport large join queries effectively throughnon-exhaustive search.
50.3.1. Generating Possible Plans with GEQO
The GEQO planning process uses the standard plannercode to generate plans for scans of individual relations. Then joinplans are developed using the genetic approach. As shown above, eachcandidate join plan is represented by a sequence in which to jointhe base relations. In the initial stage, the GEQOcode simply generates some possible join sequences at random. For eachjoin sequence considered, the standard planner code is invoked toestimate the cost of performing the query using that join sequence.(For each step of the join sequence, all three possible join strategiesare considered; and all the initially-determined relation scan plansare available. The estimated cost is the cheapest of thesepossibilities.) Join sequences with lower estimated cost are considered"more fit" than those with higher cost. The genetic algorithmdiscards the least fit candidates. Then new candidates are generatedby combining genes of more-fit candidates — that is, by usingrandomly-chosen portions of known low-cost join sequences to createnew sequences for consideration. This process is repeated until apreset number of join sequences have been considered; then the bestone found at any time during the search is used to generate the finishedplan.
This process is inherently nondeterministic, because of the randomizedchoices made during both the initial population selection and subsequent"mutation" of the best candidates. To avoid surprising changesof the selected plan, each run of the GEQO algorithm restarts itsrandom number generator with the current geqo_seedparameter setting. As long as geqo_seed and the otherGEQO parameters are kept fixed, the same plan will be generated for agiven query (and other planner inputs such as statistics). To experimentwith different search paths, try changing geqo_seed.
50.3.2. Future Implementation Tasks forPostgreSQL GEQO
Work is still needed to improve the genetic algorithm parameter settings. In file src/backend/optimizer/geqo/geqo_main.c, routines gimme_pool_size and gimme_number_generations, we have to find a compromise for the parameter settings to satisfy two competing demands:
In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner's join selection and cost estimation code from scratch. To the extent that different candidates use similar sub-sequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state.
At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.