您正在查看: 标签 字符串 下的文章

Genetic Query OptimizerFurther Reading

50.4. Further Reading

The following resources contain additional information about genetic algorithms:

  • The Hitch-Hiker's Guide to Evolutionary Computation, (FAQ for news://comp.ai.genetic)

  • Evolutionary Computation and its application to art and design, by Craig Reynolds

  • Fundamentals of Database Systems

  • POSTGRES查询优化器的设计与实现

Genetic Query OptimizerGenetic Query Optimization (GEQO) in PostgreSQL

50.3. Genetic Query Optimization (GEQO) in PostgreSQL

The GEQO module approaches the queryoptimization problem as though it were the well-known traveling salesmanproblem (TSP).Possible query plans are encoded as integer strings. Each stringrepresents the join order from one relation of the query to the next.For example, the join tree

   /\
  /\ 2
 /\ 3
4  1

is encoded by the integer string '4-1-3-2',which means, first join relation '4' and '1', then '3', andthen '2', where 1, 2, 3, 4 are relation IDs within thePostgreSQL optimizer.

Specific characteristics of the GEQOimplementation in PostgreSQLare:

  • Usage of a steady state GA (replacement of the least fit individuals in a population, not whole-generational replacement) allows fast convergence towards improved query plans. This is essential for query handling with reasonable time;

  • Usage of edge recombination crossover which is especially suited to keep edge losses low for the solution of the TSP by means of a GA;

  • Mutation as genetic operator is deprecated so that no repair mechanisms are needed to generate legal TSP tours.

Parts of the GEQO module are adapted from D. Whitley'sGenitor algorithm.

The GEQO module allowsthe PostgreSQL query optimizer tosupport large join queries effectively throughnon-exhaustive search.

50.3.1. Generating Possible Plans with GEQO

The GEQO planning process uses the standard plannercode to generate plans for scans of individual relations. Then joinplans are developed using the genetic approach. As shown above, eachcandidate join plan is represented by a sequence in which to jointhe base relations. In the initial stage, the GEQOcode simply generates some possible join sequences at random. For eachjoin sequence considered, the standard planner code is invoked toestimate the cost of performing the query using that join sequence.(For each step of the join sequence, all three possible join strategiesare considered; and all the initially-determined relation scan plansare available. The estimated cost is the cheapest of thesepossibilities.) Join sequences with lower estimated cost are considered"more fit" than those with higher cost. The genetic algorithmdiscards the least fit candidates. Then new candidates are generatedby combining genes of more-fit candidates — that is, by usingrandomly-chosen portions of known low-cost join sequences to createnew sequences for consideration. This process is repeated until apreset number of join sequences have been considered; then the bestone found at any time during the search is used to generate the finishedplan.

This process is inherently nondeterministic, because of the randomizedchoices made during both the initial population selection and subsequent"mutation" of the best candidates. To avoid surprising changesof the selected plan, each run of the GEQO algorithm restarts itsrandom number generator with the current geqo_seedparameter setting. As long as geqo_seed and the otherGEQO parameters are kept fixed, the same plan will be generated for agiven query (and other planner inputs such as statistics). To experimentwith different search paths, try changing geqo_seed.

50.3.2. Future Implementation Tasks forPostgreSQL GEQO

Work is still needed to improve the genetic algorithm parameter settings. In file src/backend/optimizer/geqo/geqo_main.c, routines gimme_pool_size and gimme_number_generations, we have to find a compromise for the parameter settings to satisfy two competing demands:

  • Optimality of the query plan

  • Computing time

In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner's join selection and cost estimation code from scratch. To the extent that different candidates use similar sub-sequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state.

At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.

Genetic Query OptimizerGenetic Algorithms

50.2. Genetic Algorithms

The genetic algorithm (GA) is a heuristic optimization method whichoperates through randomized search. The set of possible solutions for theoptimization problem is considered as apopulation of individuals.The degree of adaptation of an individual to its environment is specifiedby its fitness.

The coordinates of an individual in the search space are representedby chromosomes, in essence a set of characterstrings. A gene is asubsection of a chromosome which encodes the value of a single parameterbeing optimized. Typical encodings for a gene could be binary orinteger.

Through simulation of the evolutionary operations recombination,mutation, andselection new generations of search points are foundthat show a higher average fitness than their ancestors.

According to the comp.ai.genetic FAQ it cannot be stressed toostrongly that a GA is not a pure random search for a solution to aproblem. A GA uses stochastic processes, but the result is distinctlynon-random (better than random).

Figure 50-1. Structured Diagram of a Genetic Algorithm

P(t)generation of ancestors at a time t
P''(t)generation of descendants at a time t
+=========================================+
|>>>>>>>>>>>  Algorithm GA  <<<<<<<<<<<<<<|
+=========================================+
| INITIALIZE t := 0   |
+=========================================+
| INITIALIZE P(t)|
+=========================================+
| evaluate FITNESS of P(t) |
+=========================================+
| while not STOPPING CRITERION do    |
|   +-------------------------------------+
|   | P'(t)  := RECOMBINATION{P(t)}  |
|   +-------------------------------------+
|   | P''(t) := MUTATION{P'(t)} |
|   +-------------------------------------+
|   | P(t+1) := SELECTION{P''(t) + P(t)}  |
|   +-------------------------------------+
|   | evaluate FITNESS of P''(t)|
|   +-------------------------------------+
|   | t := t + 1 |
+===+=====================================+

Genetic Query OptimizerQuery Handling as a Complex Optimization Problem

50.1. Query Handling as a Complex Optimization Problem

Among all relational operators the most difficult one to processand optimize is the join. The number ofpossible query plans grows exponentially with thenumber of joins in the query. Further optimization effort iscaused by the support of a variety of joinmethods (e.g., nested loop, hash join, merge join inPostgreSQL) to process individual joinsand a diversity of indexes (e.g.,B-tree, hash, GiST and GIN in PostgreSQL) asaccess paths for relations.

The normal PostgreSQL query optimizerperforms a near-exhaustive search over thespace of alternative strategies. This algorithm, first introducedin IBM's System R database, produces a near-optimal join order,but can take an enormous amount of time and memory space when thenumber of joins in the query grows large. This makes the ordinaryPostgreSQL query optimizerinappropriate for queries that join a large number of tables.

The Institute of Automatic Control at the University of Mining andTechnology, in Freiberg, Germany, encountered some problems whenit wanted to use PostgreSQL as thebackend for a decision support knowledge based system for themaintenance of an electrical power grid. The DBMS needed to handlelarge join queries for the inference machine of the knowledgebased system. The number of joins in these queries made using thenormal query optimizer infeasible.

In the following we describe the implementation of agenetic algorithm to solve the joinordering problem in a manner that is efficient for queriesinvolving large numbers of joins.

Native Language SupportFor the Programmer

48.2. For the Programmer

48.2.1. Mechanics

This section describes how to implement native language support in a program or library that is part of the PostgreSQL distribution. Currently, it only applies to C programs.

Adding NLS support to a program

  • Insert this code into the start-up sequence of the program:

    #ifdef ENABLE_NLS
    #include <locale.h>
    #endif
    

    ...

    #ifdef ENABLE_NLS
    setlocale(LC_ALL, "");
    bindtextdomain("progname", LOCALEDIR);
    textdomain("progname");
    #endif

    (The progname can actually be chosen freely.)

  • Wherever a message that is a candidate for translation is found, a call to gettext() needs to be inserted. E.g.:

    fprintf(stderr, "panic level %d\n", lvl);

    would be changed to:

    fprintf(stderr, gettext("panic level %d\n"), lvl);

    (gettext is defined as a no-op if NLS support is not configured.)

    This tends to add a lot of clutter. One common shortcut is to use:

    #define _(x) gettext(x)

    Another solution is feasible if the program does much of its communication through one or a few functions, such as ereport() in the backend. Then you make this function call gettext internally on all input strings.

  • Add a file nls.mk in the directory with the program sources. This file will be read as a makefile. The following variable assignments need to be made here:

    CATALOG_NAME

    The program name, as provided in thetextdomain() call.

    AVAIL_LANGUAGES

    List of provided translations — initially empty.

    GETTEXT_FILES

    List of files that contain translatable strings, i.e., thosemarked with gettext or an alternativesolution. Eventually, this will include nearly all sourcefiles of the program. If this list gets too long you canmake the first "file" be a +and the second word be a file that contains one file name perline.

    GETTEXT_TRIGGERS

    The tools that generate message catalogs for the translatorsto work on need to know what function calls containtranslatable strings. By default, onlygettext() calls are known. If you used_ or other identifiers you need to listthem here. If the translatable string is not the firstargument, the item needs to be of the formfunc:2 (for the second argument).If you have a function that supports pluralized messages,the item should look like func:1,2(identifying the singular and plural message arguments).

  • The build system will automatically take care of building and installing the message catalogs.

    48.2.2. Message-writing guidelines

    Here are some guidelines for writing messages that are easily translatable.

    • Do not construct sentences at run-time, like:

      printf("Files were %s.\n", flag ? "copied" : "removed");

      The word order within the sentence might be different in other languages. Also, even if you remember to call gettext() on each fragment, the fragments might not translate well separately. It's better to duplicate a little code so that each message to be translated is a coherent whole. Only numbers, file names, and such-like run-time variables should be inserted at run time into a message text.

    • For similar reasons, this won't work:

      printf("copied %d file%s", n, n!=1 ? "s" : "");

      because it assumes how the plural is formed. If you figured you could solve it like this:

      if (n==1)printf("copied 1 file");
      elseprintf("copied %d files", n):

      then be disappointed. Some languages have more than two forms, with some peculiar rules. It's often best to design the message to avoid the issue altogether, for instance like this:

      printf("number of copied files: %d", n);

      If you really want to construct a properly pluralized message, there is support for this, but it's a bit awkward. When generating a primary or detail error message in ereport(), you can write something like this:

      errmsg_plural("copied %d file","copied %d files",n,n)

      The first argument is the format string appropriate for English singular form, the second is the format string appropriate for English plural form, and the third is the integer control value that determines which plural form to use. Subsequent arguments are formatted per the format string as usual. (Normally, the pluralization control value will also be one of the values to be formatted, so it has to be written twice.) In English it only matters whether n is 1 or not 1, but in other languages there can be many different plural forms. The translator sees the two English forms as a group and has the opportunity to supply multiple substitute strings, with the appropriate one being selected based on the run-time value of n.

      If you need to pluralize a message that isn't going directly to an errmsg or errdetail report, you have to use the underlying function ngettext. See the gettext documentation.

    • If you want to communicate something to the translator, such as about how a message is intended to line up with other output, precede the occurrence of the string with a comment that starts with translator, e.g.:

      /* translator: This message is not what it seems to be. */

      These comments are copied to the message catalog files so that the translators can see them.