# To unbundle, run this file echo pubuse sed 's/.//' >pubuse <<'//GO.SYSIN DD pubuse' -..... use tbl and troff \-ms -.if \nP=0 .IM -.TL -Updating Publication Lists -.AU -M. E. Lesk -.NH -Introduction. -.PP -.\".if \nP>0 .pn 14 -This note describes several commands to update the -publication lists. -The data base consisting of these lists is kept in -a set of files in -the directory -.I /usr/dict/papers -on the Version 7 -.UX -system. -The reason for having special commands to update these files is -that they are indexed, and the only reasonable way to find the -items to be updated is to use the index. -However, altering the files -destroys the usefulness of the index, -and makes further editing difficult. -So the recommended procedure is to -.IP (1) -Prepare additions, deletions, and changes in separate files. -.IP (2) -Update the data base and reindex. -.LP -Whenever you make changes, etc. it is necessary to run -the ``add & index'' step before logging off; otherwise the -changes do not take effect. -The next section shows the format of the files -in the data base. -After that, the procedures for -preparing additions, preparing changes, preparing deletions, -and updating the public data base are given. -.NH -Publication Format. -.PP -The format of a data base entry is given completely in ``Some Applications -of Inverted Indexes on UNIX'' by M. E. Lesk, -the first part of this report, -.if \nP=0 (also TM 77-1274-17) -and is summarized here via a few examples. -In each example, first the output format for an item is shown, -and then the corresponding data base entry. -.LP -.DS -.ti 0 -Journal article: -.fi -.ll 5i -A. V. Aho, D. J. Hirschberg, and J. D. Ullman, ``Bounds -on the Complexity of the Maximal Common Subsequence Problem,'' -.I -J. Assoc. Comp. Mach., -.R -vol. 23, no. 1, pp. 1-12 (Jan. 1976). -.nf -.ll -.sp -%T Bounds on the Complexity of the Maximal Common -Subsequence Problem -%A A. V. Aho -%A D. S. Hirschberg -%A J. D. Ullman -%J J. Assoc. Comp. Mach. -%V 23 -%N 1 -%P 1-12 -%D Jan. 1976 -.if \nP=0 %M TM 75-1271-7 -.if \nP>0 %M Memo abcd... -.DE -.DS -.ti 0 -Conference proceedings: -.fi -.ll 5i -B. Prabhala and R. Sethi, ``Efficient Computation of Expressions with Common -Subexpressions,'' -.I -Proc. 5th ACM Symp. on Principles of Programming Languages, -.R -pp. 222-230, Tucson, Ariz. (January 1978). -.nf -.ll -.sp -%A B. Prabhala -%A R. Sethi -%T Efficient Computation of Expressions with -Common Subexpressions -%J Proc. 5th ACM Symp. on Principles -of Programming Languages -%C Tucson, Ariz. -%D January 1978 -%P 222-230 -.DE -.DS -.ti 0 -Book: -.fi -.ll 5i -B. W. Kernighan and P. J. Plauger, -.I -Software Tools, -.R -Addison-Wesley, Reading, Mass. (1976). -.nf -.ll -.sp -%T Software Tools -%A B. W. Kernighan -%A P. J. Plauger -%I Addison-Wesley -%C Reading, Mass. -%D 1976 -.DE -.DS -.ti 0 -Article within book: -.fi -.ll 5i -J. W. de Bakker, ``Semantics of Programming Languages,'' -pp. 173-227 in -.I -Advances in Information Systems Science, Vol. 2, -.R -ed. J. T. Tou, Plenum Press, New York, N. Y. (1969). -.nf -.ll -.sp -%A J. W. de Bakker -%T Semantics of programming languages -%E J. T. Tou -%B Advances in Information Systems Science, Vol. 2 -%I Plenum Press -%C New York, N. Y. -%D 1969 -%P 173-227 -.DE -.DS -.ti 0 -Technical Report: -.fi -.ll 5i -F. E. Allen, ``Bibliography on Program Optimization,'' -Report RC-5767, IBM T. J. Watson Research Center, -Yorktown Heights, N. Y. (1975). -.nf -.ll -.sp -%A F. E. Allen -%D 1975 -%T Bibliography on Program Optimization -%R Report RC-5767 -%I IBM T. J. Watson Research Center -%C Yorktown Heights, N. Y. -.DE -.DS -.di xx -.ti 0 -Technical Memorandum: -.fi -.ll 5i -A. V. Aho, B. W. Kernighan and P. J. Weinberg, -``AWK \- Pattern Scanning and Processing Language'', -TM 77-1271-5, TM 77-1273-12, TM 77-3444-1 (1977). -.nf -.ll -.sp -%T AWK \- Pattern Scanning and Processing Language -%A A. V. Aho -%A B. W. Kernighan -%A P. J. Weinberger -%M TM 77-1271-5, TM 77-1273-12, TM 77-3444-1 -%D 1977 -.di -.if \nP=0 .xx -.rm xx -.DE -.LP -Other forms of publication can be entered similarly. -Note that conference -proceedings are entered as if journals, -with the conference name on a -.I %J -line. -This is also sometimes appropriate for obscure publications -such as series of lecture notes. -When something is both a report and an article, or -both a memorandum and an article, enter all necessary information -for both; see the first article above, for example. -Extra information (such as ``In preparation'' or ``Japanese translation'') -should be placed on a line beginning -.I %O . -The most common use of %O lines now is for ``Also in ...'' to give -an additional reference to a secondary appearance of the same paper. -.PP -Some of the possible fields of a citation are: -.TS -c c 5 c c -a l a l . -Letter Meaning Letter Meaning -A Author K Extra keys -B Book including item N Issue number -C City of publication O Other -D Date P Page numbers -E Editor of book R Report number -I Publisher (issuer) T Title of item -J Journal name V Volume number -.TE -Note that -.I %B -is used to indicate the title -of a book containing the article being entered; -when an item is an entire book, the title should -be entered with a -.I %T -as usual. -.PP -Normally, the order of items does not matter. The only exception is -that if there are multiple authors (%A lines) the order of authors -should be that on the paper. -If a line is too long, it may be continued on to the next line; -any line not beginning with % or . (dot) is assumed to be -a continuation of the previous line. -Again, see the first article above for an example of a long title. -Except for authors, do not repeat any items; if two %J lines are -given, for example, the first is ignored. -Multiple items on the same file should be separated by blank lines. -.PP -Note that in formatted printouts of the file, the -exact appearance of the items is determined by -a set of macros and the formatting programs. -Do not try to adjust fonts, punctuation, etc. by editing -the data base; it is wasted effort. In case someone has -a real need for a differently-formatted output, a new set -of macros can easily be generated to provide alternative -appearances of the citations. -.NH -Updating and Re-indexing. -.PP -This section describes the commands that are used to manipulate -and change the data base. -It explains the procedures for (a) finding references in the data base, -(b) adding new references, (c) changing existing references, and (d) -deleting references. -Remember that all changes, additions, and deletions are done by preparing -separate files and then running an `update and reindex' step. -.PP -.I -Checking what's there now. -.R -Often you will want to know what is currently in the data base. -There is a special command -.I lookbib -to look for things and print them -out. -It searches for articles based on words in the title, or the author's name, -or the date. -For example, you could find the first paper above with -.DS -lookbib aho ullman maximal subsequence 1976 -.DE -or -.DS -lookbib aho ullman hirschberg -.DE -.LP -If you don't give enough words, several items will be found; -if you spell some wrong, nothing will be found. -There are around 4300 papers in the public file; you should -always use this command to check when you are not sure -whether a certain paper is there or not. -.PP -.I -Additions. -.R -To add new papers, just type in, on one or more files, the citations -for the new papers. -Remember to check first if the papers are already in the data base. -For example, if a paper has a previous memo version, this should -be treated as a change to an existing entry, rather than -a new entry. -If several new papers are being typed on the same file, be -sure that there is a blank line between each two papers. -.PP -.I -Changes. -.R -To change an item, it should be extracted onto a file. -This is done with the command -.DS -pub.chg key1 key2 key3 ... -.DE -where the items key1, key2, key3, etc. are -a set of keys that will find the paper, -as in the -.I lookbib -command. -That is, if -.DS -lookbib johnson yacc cstr -.DE -will find a item (to, in this case, Computing Science Technical Report -No. 32, ``YACC: Yet Another Compiler-Compiler,'' -by S. C. Johnson) -then -.DS -pub.chg johnson yacc cstr -.DE -will permit you to edit the item. -The -.I pub.chg -command -extracts the item onto a file named ``bibxxx'' where ``xxx'' -is a 3-digit number, e.g. ``bib234''. -The command will print the file name it has chosen. -If the set of keys finds more than one paper (or no papers) an -error message is printed and no file is written. -Each reference to be changed must be extracted with a separate -.I pub.chg -command, and each will be placed on a separate file. -You should then edit the ``bibxxx'' file as desired to change the item, -using the UNIX editor. -Do not delete or change the first line of the file, however, which begins -.I %# -and is a special code line to tell the update program -which item is being altered. -You may delete or change other lines, or add lines, as you wish. -The changes are not actually made in the public data -base until you run the update command -.I pub.run -(see below). -Thus, if after extracting an item and modifying it, you decide -that you'd rather leave things as they were, delete the -``bibxxx'' file, and your change request will disappear. -.PP -.I -Deletions. -.R -To delete an entry from the data base, -type the command -.DS -pub.del key1 key2 key3 ... -.DE -where the items key1, key2, etc. are a set -of keys that will find the paper, as with the -.I lookbib -command. -That is, if -.DS -lookbib aho hirschberg ullman -.DE -will find a paper, -.DS -pub.del aho hirschberg ullman -.DE -deletes it. -Upper and lower case are equivalent in keys; -the command -.DS -pub.del Aho Hirschberg Ullman -.DE -is an equivalent -.I pub.del -command. -The -.I pub.del -command will print the entry being deleted. -It also gives the name of a ``bibxxx'' file on which the deletion -command is stored. -The actual deletion is not done until the changes, additions, etc. -are processed, as with the -.I pub.chg -command. -If, after seeing the item to be deleted, you change your -mind about throwing it away, delete the ``bibxxx'' file -and the delete request disappears. -Again, if the list of keys does not uniquely identify one paper, -an error message is given. -.PP -Remember that the default versions of the commands described here -edit a public data base. -Do not delete -items unless you are sure deletion is proper; usually this -means that there are duplicate entries for the same paper. -Otherwise, view requests for deletion with skepticism; even -if one person has no need for a particular item in the data base, -someone else may want it there. -.PP -If an item is correct, but should not appear in the ``List of Publications'' -as normally produced, add the line -.DS -%K DNL -.DE -to the item. -This preserves the item intact, but implies ``Do Not List'' to the -to the commands that print publication lists. -The DNL line is normally used for some technical reports, -minor memoranda, or other -low-grade publications. -.PP -.I -Update and reindex. -.R -When you have completed a session of changes, you should -type the command -.DS -pub.run file1 file2 ... -.DE -where the names ``file1'', ... are the new files of additions you -have prepared. -You need not list the ``bibxxx'' files representing changes and -deletions; they are processed automatically. -All of the new items are edited into the standard -public data base, and then a new index is made. This process -takes about one minute of processor time. -The index is not made by re-analyzing and re-sorting the -entire data base; the new or changed items are indexed -in the usual way and then merged with the previous data files. -A complete re-index would be much slower for small changes. -.PP -Normally, you should execute -.I pub.run -just before you logoff after performing -some edit requests. -However, if you don't, the various change request files remain -in your directory -until you finally do execute -.I pub.run. -When the changes are processed, the ``bibxxx'' files are deleted. -It is not desirable to wait too long before processing changes, -however, to avoid conflicts with someone else who wishes to change -the same file. -If executing -.I pub.run -produces the message ``File bibxxx too old'' -it means that someone else has been editing the same file -between the time you prepared your changes, and the time you typed -.I pub.run. -You must delete such old change files and -re-enter them. -.PP -Note that although -.I pub.run -discards the ``bibxxx'' files after -processing them, your files of additions are left around -even after -.I pub.run -is finished. -If they were typed in only for purposes of -updating the data base, you may delete them -after they have been processed by -.I pub.run. -.PP -.I -Example. -.R -Suppose, for example, that you wish to -.IP (1) -Add to the data base the memos ``The Dilogarithm Function of a Real Argument'' -by R. Morris, and -``UNIX Software Distribution by Communication Link,'' -by M. E. Lesk and A. S. Cohen; -.IP (2) -Delete from the data base the item -``Cheap Typesetters'', by M. E. Lesk, -SIGLASH Newsletter, 1973; -and -.IP (3) -Change ``J. Assoc. Comp. Mach.'' to ``Jour. ACM'' in the citation -for Aho, Hirschberg, and Ullman shown above. -.LP -The procedure would be as follows. -First, you would make a file -containing the additions, -here called ``new.1'', in the normal way using the UNIX editor. -In the script shown below, the computer prompts are -.if n underlined. -.if t in italics. -.DS -.if n _$ ed new.1 -.if t \f2$\f1 ed new.1 -.if n _? -.if t \f2?\f1 -a -%T The Dilogarithm Function of a Real Argument -%A Robert Morris -.if \nP=0 %M TM 78-1271-1 -.if \nP>0 %M abcd -%D 1978 - -%T UNIX Software Distribution by Communication Link -%A M. E. Lesk -%A A. S. Cohen -.if \nP=0 %M TM 78-1274-1, 78-8234-1 -.if \nP>0 %M abcd -%D 1978 -. -w new.1 -.if n _1_9_9 -.if t \f2199\f1 -q -.DE -Next you would specify the deletion, which would be done with the -.I pub.del -command: -.DS -.if n _$ pub.del lesk cheap typesetters siglash -.if t \f2$\f1 pub.del lesk cheap typesetters siglash -.ti 0 -to which the computer responds: - -.if n W_i_l_l_ d_e_l_e_t_e_:_ (_f_i_l_e_ b_i_b_1_7_6_)_ -.if t \f2Will delete: (file bib176)\f1 - -.if n %_T_ C_h_e_a_p_ T_y_p_e_s_e_t_t_e_r_s_ -.if t \f2%T Cheap Typesetters\f1 -.if n %_A_ M_._ E_._ L_e_s_k_ -.if t \f2%A M. E. Lesk\f1 -.if n %_J_ A_C_M_ S_I_G_L_A_S_H_ N_e_w_s_l_e_t_t_e_r_ -.if t \f2%J ACM SIGLASH Newsletter\f1 -.if n %_V_ 6_ -.if t \f2%V 6\f1 -.if n %_N_ 4_ -.if t \f2%N 4\f1 -.if n %_P_ 1_4_-_1_6_ -.if t \f2%P 14-16\f1 -.if n %_D_ O_c_t_o_b_e_r_ 1_9_7_3_ -.if t \f2%D October 1973\f1 -.DE -And then you would extract the Aho, Hirschberg and Ullman paper. -The dialogue involved is shown below. -First run -.I pub.chg -to extract the paper; it responds by printing -the citation and informing you that it was placed on file \f2bib123\f1. -That file is then edited. -.DS -.if n _$ pub.chg aho hirschberg ullman -.if t \f2$\f1 pub.chg aho hirschberg ullman -.if n _E_x_t_r_a_c_t_i_n_g _a_s _f_i_l_e _b_i_b_1_2_3 -.if t \f2Extracting as file bib123\f1 -.if n _%_T _B_o_u_n_d_s _o_n _t_h_e _C_o_m_p_l_e_x_i_t_y _o_f _t_h_e _M_a_x_i_m_a_l -.if t \f2%T Bounds on the Complexity of the Maximal\f1 -.if n _C_o_m_m_o_n _S_u_b_s_e_q_u_e_n_c_e _P_r_o_b_l_e_m -.if t \f2Common Subsequence Problem\f1 -.if n _%_A _A_. _V_. _A_h_o -.if t \f2%A A. V. Aho\f1 -.if n _%_A _D_. _S_. _H_i_r_s_c_h_b_e_r_g -.if t \f2%A D. S. Hirschberg\f1 -.if n _%_A _J_. _D_. _U_l_l_m_a_n -.if t \f2%A J. D. Ullman\f1 -.if n _%_J _J_. _A_s_s_o_c_. _C_o_m_p_. _M_a_c_h_. -.if t \f2%J J. Assoc. Comp. Mach.\f1 -.if n _%_V _2_3 -.if t \f2%V 23\f1 -.if n _%_N _1 -.if t \f2%N 1\f1 -.if n _%_P _1_-_1_2 -.if t \f2%P 1-12\f1 -.if \nP=0 .if n _%_M _T_M _7_5_-_1_2_7_1_-_7 -.if \nP>0 .if n %_M_ M_e_m_o_ n_u_m_b_e_r_ -.if \nP=0 .if t \f2%M TM 75-1271-7\f1 -.if \nP>0 .if t \f2%M abcd\f1 -.if n _%_D _J_a_n_. _1_9_7_6 -.if t \f2%D Jan. 1976\f1 - -.if n _$ ed bib123 -.if t \f2$\f1 ed bib123 -.if n _3_1_2 -.if t \f2312\f1 -/Assoc/s/ J/ Jour/p -.if n _%_J _J_o_u_r_. _A_s_s_o_c_. _C_o_m_p_. _M_a_c_h_. -.if t \f2%J Jour. Assoc. Comp. Mach.\f1 -s/Assoc.*/ACM/p -.if n _%_J _J_o_u_r_. _A_C_M -.if t \f2%J Jour. ACM\f1 -1,$p -.if n _%_# _/_u_s_r_/_d_i_c_t_/_p_a_p_e_r_s_/_p_7_6 _2_3_3 _2_4_5 _c_h_a_n_g_e -.if t \f2%# /usr/dict/papers/p76 233 245 change\f1 -.if n _%_T _B_o_u_n_d_s _o_n _t_h_e _C_o_m_p_l_e_x_i_t_y _o_f _t_h_e _M_a_x_i_m_a_l -.if t \f2%T Bounds on the Complexity of the Maximal\f1 -.if n _C_o_m_m_o_n _S_u_b_s_e_q_u_e_n_c_e _P_r_o_b_l_e_m -.if t \f2Common Subsequence Problem\f1 -.if n _%_A _A_. _V_. _A_h_o -.if t \f2%A A. V. Aho\f1 -.if n _%_A _D_. _S_. _H_i_r_s_c_h_b_e_r_g -.if t \f2%A D. S. Hirschberg\f1 -.if n _%_A _J_. _D_. _U_l_l_m_a_n -.if t \f2%A J. D. Ullman\f1 -.if n _%_J _J_o_u_r_. _A_C_M -.if t \f2%J Jour. ACM\f1 -.if n _%_V _2_3 -.if t \f2%V 23\f1 -.if n _%_N _1 -.if t \f2%N 1\f1 -.if n _%_P _1_-_1_2 -.if t \f2%P 1-12\f1 -.if \nP=0 .if n _%_M _T_M _7_5_-_1_2_7_1_-_7 -.if \nP>0 .if n _%_M _M_e_m_o _n_u_m_b_e_r -.if \nP=0 .if t \f2%M TM 75-1271-7\f1 -.if \nP>0 .if t \f2%M abcd\f1 -.if n _%_D _J_a_n_. _1_9_7_6 -.if t \f2%D Jan. 1976\f1 - -w -.if n _2_9_2 -.if t \f2292\f1 -q -.if n _$ -.if t \f2$\f1 -.DE -Finally, execute -.I pub.run , -making sure to remember that you -have prepared a new file ``new.1'': -.DS -\f2$\f1 pub.run new.1 -.DE -Currently, this takes about 1 minute of 11/70 processor time. -.NH -Printing a Publication List -.PP -There are two commands for printing a publication list, -depending on whether you want to print one person's list, -or the list of many people. -To print a list for one person, use the -.I pub.indiv -command: -.DS -pub.indiv M Lesk -.DE -This runs off the list for M. Lesk and puts it in file ``output''. -Note that no `.' is given after the initial. -In case of ambiguity two initials can be used. -Similarly, to get the list for group of people, say -.DS -pub.org xxx -.DE -which prints all the publications of the members of organization -.I xxx , -taking the names for the list in the file -.I /usr/dict/papers/centlist/xxx . -This command should normally be run in the background; it takes -perhaps 15 minutes. -Two options are available with these commands: -.DS -pub.indiv \-p M Lesk -.DE -prints only the papers, leaving out unpublished notes, patents, etc. -Also -.DS -pub.indiv \-t M Lesk | gcat -.DE -prints a typeset copy, instead of a computer printer copy. -In this case it has been directed to an alternate typesetter with the -`gcat' command. -These options may be used together, and may be used with the -.I pub.org -command as well. -For example, to print -only the papers for all of organization zzz and typeset them, -you could type -.DS -pub.center \-t \-p zzz | gcat & -.DE -These publication lists are printed double column with a citation style -taken from a set of publication list macros; the macros, of course, can be -changed easily to adjust the format of the lists. //GO.SYSIN DD pubuse echo refer sed 's/.//' >refer <<'//GO.SYSIN DD refer' -.... refer | tbl | nroff -ms -.de UC -\\s-2\\$1\\s0\\$2 -.. -.ds . \&\s+2.\s0 -.if t .ds -- \(em -.if n .ds -- -- -.TR 69 -.TM 77-1274-17 39199 39199-11 -.ND October 27, 1977 -.ND June 21, 1978 -.TL -Some Applications of Inverted Indexes on the UNIX System -.AU "MH 2C-572" 6377 -M. E. Lesk -.AI -.MH -.AB -.LP -.ft B -I. Some Applications of Inverted Indexes \- Overview -.ft R -.PP -This memorandum describes a set of programs which -make inverted indexes to -UNIX* -text files, and their -application to -retrieving and formatting citations for documents prepared using -.I troff. -.PP -These indexing and searching programs make keyword -indexes to volumes of material too large for linear searching. -Searches for combinations of single words can be performed quickly. -The programs are divided into -two phases. The first makes an index from the original -data; the second searches the index and retrieves -items. -Both of these phases are further divided into two parts -to separate the data-dependent and algorithm dependent -code. -.PP -The major current application of these programs is -the -.I troff -preprocessor -.I refer. -A list of 4300 references is maintained on line, -containing primarily papers written and cited by -local authors. -Whenever one of these references is required -in a paper, a few words from the title or author list -will retrieve it, and the user need not bother to re-enter -the exact citation. -Alternatively, authors can use their own lists of papers. -.PP -This memorandum is of interest to -those who are interested in facilities for searching large -but relatively unchanging text files on -the -UNIX -system, -and those who are interested in handling bibliographic -citations with -UNIX -.I troff. -.LP -.ft B -II. Updating Publication Lists -.PP -This section is a brief note describing the -auxiliary programs for managing the updating -processing. -It is written to aid clerical users in -maintaining lists of references. -Primarily, the programs described permit a large -amount of individual control over the content -of publication lists while retaining the -usefulness of the files to other users. -.LP -.ft B -III. Manual Pages -.PP -This section contains the pages from the -UNIX programmer's manual -for the -.I lookall, -.I pubindex, -and -.I refer -commands. -It is useful for reference. -.sp -\l'3i' -.br -* UNIX is a Trademark of Bell Laboratories. -.AE -.CS 10 4 14 0 0 4 -.NH -Introduction. -.PP -The -.UX -system -has many utilities -(e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR) -to search through files of text, -but most of them are based on a linear scan through the -entire file, using some deterministic automaton. -.ev 1 -.ps 8 -.vs 10p -.ev -This memorandum discusses a program which uses inverted -indexes -.[ -%A D. Knuth -%T The Art of Computer Programming: Vol. 3, Sorting and Searching -%I Addison-Wesley -%C Reading, Mass. -%D 1977 -%O See section 6.5. -.] -and can thus be used on much larger data bases. -.PP -As with any indexing system, of course, there are some disadvantages; -once an index is made, the files that have been indexed can not be changed -without remaking the index. -Thus applications are restricted -to those making many searches -of relatively stable data. -Furthermore, these programs depend on hashing, and can only -search for exact matches of whole keywords. -It is not possible to look for -arithmetic or logical expressions (e.g. ``date greater than 1970'') or -for regular expression searching such as that in -.I lex. -.[ -lex lesk cstr -.] -.PP -Currently there are two uses of this software, -the -.I refer -preprocessor to format references, -and the -.I lookall -command to search through all text files on -the -.UX -system. -.PP -The remaining sections of this memorandum discuss -the searching programs and their uses. -Section 2 explains the operation of the searching algorithm and describes -the data collected for use with the -.I lookall -command. -The more important application, -.I refer -has a user's description in section 3. -Section 4 goes into more detail on -reference files -for the benefit of those who -wish to add references to data bases or -write new -.I troff -macros for use with -.I refer. -The options to make -.I refer -collect identical citations, or otherwise relocate and adjust references, -are described in section 5. -The -.UX -manual sections for -.I "refer, lookall," -and associated commands are attached as appendices. -.NH -Searching. -.PP -The indexing and searching process is divided into two phases, -each made of two parts. -These are -shown below. -.IP A. -Construct the index. -.RS -.IP (1) -Find keys \*(-- turn the input files into a sequence of tags and keys, -where each tag identifies a distinct item in the input -and the keys for each such item are the strings under which it is -to be indexed. -.IP (2) -Hash and sort \*(-- -prepare a set of inverted indexes from which, given a set of keys, -the appropriate item tags can be found quickly. -.RE -.IP B. -Retrieve an item in response to a query. -.RS -.IP (3) -Search \*(-- -Given some keys, look through the files prepared by the hashing -and sorting facility and derive the appropriate tags. -.IP (4) -Deliver \*(-- -Given the tags, find the original items. This completes the -searching process. -.RE -.LP -The first phase, making the index, is presumably done relatively infrequently. -It should, of course, be done whenever the data being -indexed change. -In contrast, the second phase, retrieving items, -is presumably done often, and must be rapid. -.PP -An effort is made to separate code which depends on the data -being handled from code which depends on the searching procedure. -The search algorithm is involved only in steps -(2) and (3), while knowledge of the actual data files is -needed only by steps (1) and (4). -Thus it is easy to adapt to different data files or different -search algorithms. -.PP -To start with, it is necessary to have some way of selecting -or generating keys from input files. -For dealing with files that are basically English, we have -a key-making program which automatically selects words -and passes them to the hashing and sorting program (step 2). -The format used has one line for each input item, -arranged -as follows: -.DS -name:start,length (tab) key1 key2 key3 ... -.DE -where -.I name -is the file name, -.I start -is the starting byte number, -and -.I length -is the number of bytes in the entry. -.PP -These lines are the only input used to make the -index. -The first field (the file name, byte position, and byte count) -is the tag of the item -and can be used to retrieve it quickly. -Normally, an item is either a whole file or a section of a file -delimited by blank lines. -After the tab, the second field contains the keys. -The keys, if selected by the automatic program, are -any alphanumeric strings which -are not among the 100 most frequent words in English -and which are not entirely numeric (except for four-digit -numbers beginning 19, which are accepted as dates). -Keys are truncated to six characters and converted to lower case. -Some selection is needed if the original items are very large. -We normally just take the first -.I n -keys, with -.I n -less than 100 or so; this replaces any attempt at intelligent selection. -One file in our system is -a complete English dictionary; it would presumably be retrieved for all queries. -.PP -To generate an inverted index to the list of record tags and keys, -the keys -are hashed -and sorted to produce an index. -What is wanted, ideally, is a series of lists showing the tags associated -with each key. -To condense this, -what is actually produced is a list showing the tags associated -with each hash code, and thus with some set of keys. -To speed up access and further save space, -a set of three or possibly four files is produced. -These files are: -.KS -.bd 2 2 -.TS -center; -c c -lI l. -File Contents -entry Pointers to posting file - for each hash code -posting Lists of tag pointers for - each hash code -tag Tags for each item -key Keys for each item - (optional) -.TE -.bd 2 -.KE -The posting file comprises the real data: it contains a sequence of lists -of items posted under each hash code. To speed up searching, -the entry file is an array of pointers into the posting file, one per potential -hash code. -Furthermore, the items in the lists in the posting file are not referred to by their -complete tag, but just by an address in the tag file, which -gives the complete tags. -The key file is optional and contains a copy of the keys -used in the indexing. -.PP -The searching process starts with a query, containing several keys. -The goal is to obtain all items which were indexed under these keys. -The query keys are hashed, and the pointers in the entry file used -to access the lists in the posting file. These lists -are addresses in the tag file of documents posted under the -hash codes derived from the query. -The common items from all lists are determined; -this must include the items indexed by every key, but may also -contain some items which are false drops, since items referenced by -the correct hash codes need not actually have contained the correct keys. -Normally, if there are several keys in the query, there are not -likely to be many false drops in the final combined list even though -each hash code is somewhat ambiguous. -The actual tags are then obtained from the tag file, and to guard against -the possibility that an item has false-dropped on some hash code -in the query, the original items are normally obtained from the delivery -program (4) and the query keys checked against them -by string comparison. -.PP -Usually, therefore, the check for bad drops is made against the original file. -However, if the key derivation procedure is complex, it may be preferable -to check against the keys fed to program (2). -In this case the optional key file which contains the -keys associated with each item is generated, and the item tag is supplemented -by a string -.DS -;start,length -.DE -which indicates the starting byte number in the key file and the length of -the string of keys for each item. -This file is not usually necessary with the present -key-selection program, since the keys -always appear in the original document. -.PP -There is also an option -(\f3-C\f2n\|\f1) -for coordination level searching. -This retrieves items which match all but -.I n -of the query keys. -The items are retrieved in the order of the number -of keys that they match. -Of course, -.I n -must be less than the number of query keys (nothing is -retrieved unless it matches at least one key). -.PP -As an example, consider one set of 4377 references, comprising -660,000 bytes. -This included 51,000 keys, of which 5,900 were distinct -keys. -The hash table is kept full to save space (at the expense of time); -995 of 997 possible hash codes were used. -The total set of index files (no key file) included 171,000 bytes, -about 26% of the original file size. -It took 8 minutes of processor time to -hash, sort, and write the index. -To search for a single query with the resulting index took 1.9 seconds -of processor time, -while to find the same paper -with a sequential linear search -using -.I grep -(reading all of the tags and keys) -took 12.3 seconds of processor time. -.PP -We have also used this software to index all of the English stored on our -.UX -system. -This is the index searched by the -.I lookall -command. -On a typical day there were -29,000 files in our user file system, containing about 152,000,000 -bytes. -Of these -5,300 files, containing 32,000,000 bytes (about 21%) -were English text. -The total number of `words' (determined mechanically) -was 5,100,000. -Of these 227,000 were selected as keys; -19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes. -The -resulting inverted file indexes used 845,000 bytes, or about -2.6% of the size of the original files. -The particularly small indexes are caused by the -fact that keys are taken from only the first 50 non-common words of -some very long input files. -.PP -Even this large \f2lookall\f1 index can be searched quickly. -For example, to find this document -by looking for the keys -``lesk inverted indexes'' -required -1.7 seconds of processor time -and system time. -By comparison, just to search the 800,000 byte dictionary (smaller than even -the inverted indexes, let alone the 32,000,000 bytes of text files) with -.I grep -takes 29 seconds of processor time. -The -.I lookall -program is thus useful when looking for a document which you believe -is stored on-line, but do not know where. For example, many memos -from the Computing Science Research Center are in its -.UX -file system, but -it is often -difficult to guess where a particular memo might be (it might have several -authors, each with many directories, and have been worked on by -a secretary with yet more directories). -Instructions for the use of the -.I lookall -command are given in the manual section, shown -in the appendix to this memorandum. -.PP -The only indexes maintained routinely are those of publication lists and -all English files. -To make other indexes, the programs for making keys, sorting them, -searching the indexes, and delivering answers must be used. -Since they are usually invoked as parts of higher-level commands, -they are not in the default command -directory, but are available to any user in the directory -.I /usr/lib/refer . -Three programs are of interest: -.I mkey , -which isolates keys from input files; -.I inv , -which makes an index from a set of keys; -and -.I hunt , -which searches the index and delivers the items. -Note that the two parts of the retrieval phase are combined into -one program, to avoid the excessive system work and delay which -would result from running these as separate processes. -.PP -These three commands have a large number of options to adapt to different -kinds of input. -The user not interested in the detailed description that now follows may -skip to section 3, which describes the -.I refer -program, a packaged-up version of these tools specifically -oriented towards formatting references. -.PP -.B -Make Keys. -.R -The program -.I mkey -is the key-making program corresponding to step (1) in phase A. -Normally, it reads its input from the file names given as arguments, -and if there are no arguments it reads from the standard input. -It assumes that blank lines in the input delimit -separate items, for each of which a different line of -keys should be generated. -The lines of keys are written on the standard output. -Keys are any alphanumeric string in the input not -among the most frequent words in English and not entirely numeric -(except that all-numeric strings are acceptable if they -are between 1900 and 1999). -In the output, keys are translated to lower case, and truncated -to six characters in length; any associated punctuation is removed. -The following flag arguments are recognized by -.I mkey: -.TS -center; -lB lw(4i). -\-c \f2name T{ -Name of file of common words; -default is -.I /usr/lib/eign. -T} -\-f \f2name T{ -Read a list of files from -.I name -and take each as an input argument. -T} -\-i \f2chars T{ -Ignore all lines which begin with `%' followed by any character -in -.I chars . -T} -\-k\f2n T{ -Use at most -.I n -keys per input item. -T} -\-l\f2n T{ -Ignore items shorter than -.I n -letters long. -T} -\-n\f2m T{ -Ignore as a key any word in the first -.I m -words of the list of common English words. -The default is 100. -T} -\-s T{ -Remove the labels -.I (file:start,length) -from the output; just give the keys. -Used when searching rather than indexing. -T} -\-w T{ -Each whole file is a separate item; -blank lines in files are irrelevant. -T} -.TE -.PP -The normal arguments for indexing references are -the defaults, which are -.I "\-c /usr/lib/eign" , -.I \-n100 , -and -.I \-l3 . -For searching, the -.I \-s -option is also needed. -When the big -.I lookall -index of all English files is run, -the options are -.I \-w , -.I \-k50 , -and -.I "\-f (filelist)" . -When running on textual input, -the -.I mkey -program processes about 1000 English words per processor second. -Unless the -.I \-k -option is used (and the input files are long enough for -it to take effect) -the output of -.I mkey -is comparable in size to its input. -.PP -.B -Hash and invert. -.R -The -.I inv -program computes the hash codes and writes -the inverted files. -It reads the output of -.I mkey -and writes the set of files described earlier -in this section. -It expects one argument, which is used as the base name for -the three (or four) files to be written. -Assuming an argument of -.I Index -(the default) -the entry file is named -.I Index.ia , -the posting file -.I Index.ib , -the tag file -.I Index.ic , -and the key file (if present) -.I Index.id . -The -.I inv -program recognizes the following options: -.TS -center; -lB lw(4i). -\-a T{ -Append the new keys to a previous set of inverted files, -making new files if there is no old set using the same base name. -T} -\-d T{ -Write the optional key file. -This is needed when you can not check for false drops by looking -for the keys in the original inputs, i.e. when the key derivation -procedure is complicated and -the output keys are not words from the input files. -T} -\-h\f2n T{ -The hash table size is -.I n -(default 997); -.I n -should be prime. -Making \f2n\f1 bigger saves search time and spends disk space. -T} -\-i[u] \f2name T{ -Take input from file -.I name , -instead of the standard input; -if -.B u -is present -.I name -is unlinked when the sort is started. -Using this option permits the sort scratch space -to overlap the disk space used for input keys. -T} -\-n T{ -Make a completely new set of inverted files, ignoring -previous files. -T} -\-p T{ -Pipe into the sort program, rather than writing a temporary -input file. -This saves disk space and spends processor time. -T} -\-v T{ -Verbose mode; print a summary of the number of keys which -finished indexing. -T} -.TE -.PP -About half the time used in -.I inv -is in the contained sort. -Assuming the sort is roughly linear, however, -a guess at the total timing for -.I inv -is 250 keys per second. -The space used is usually of more importance: -the entry file uses four bytes per possible hash (note -the -.B \-h -option), -and the tag file around 15-20 bytes per item indexed. -Roughly, the posting file contains one item for each key instance -and one item for each possible hash code; the items are two bytes -long if the tag file is less than 65336 bytes long, and the -items are four bytes wide if the tag file is greater than -65536 bytes long. -To minimize storage, the hash tables should be -over-full; -for most of the files indexed in this way, there is no -other real choice, since the -.I entry -file must fit in memory. -.PP -.B -Searching and Retrieving. -.R -The -.I hunt -program retrieves items from an index. -It combines, as mentioned above, the two parts of phase (B): -search and delivery. -The reason why it is efficient to combine delivery and search -is partly to avoid starting unnecessary processes, and partly -because the delivery operation must be a part of the search -operation in any case. -Because of the hashing, the search part takes place in two stages: -first items are retrieved which have the right hash codes associated with them, -and then the actual items are inspected to determine false drops, i.e. -to determine if anything with the right hash codes doesn't really have the right -keys. -Since the original item is retrieved to check on false drops, -it is efficient to present it immediately, rather than only -giving the tag as output and later retrieving the -item again. -If there were a separate key file, this argument would not apply, -but separate key files are not common. -.PP -Input to -.I hunt -is taken from the standard input, -one query per line. -Each query should be in -.I "mkey \-s" -output format; -all lower case, no punctuation. -The -.I hunt -program takes one argument which specifies the base name of the index -files to be searched. -Only one set of index files can be searched at a time, -although many text files may be indexed as a group, of course. -If one of the text files has been changed since the index, that file -is searched with -.I fgrep; -this may occasionally slow down the searching, and care should be taken to -avoid having many out of date files. -The following option arguments are recognized by -.I hunt: -.TS -center; -lB lw(4i). -\-a T{ -Give all output; ignore checking for false drops. -T} -\-C\f2n T{ -Coordination level -.I n; -retrieve items with not more than -.I n -terms of the input missing; -default -.I C0 , -implying that each search term must be in the output items. -T} -\-F[yn\f2d\f3\|] T{ -``\-Fy'' gives the text of all the items found; -``\-Fn'' suppresses them. -``\-F\f2d\|\f1'' where \f2d\f1\| is an integer -gives the text of the first \f2d\f1 items. -The default is -.I \-Fy. -T} -\-g T{ -Do not use -.I fgrep -to search files changed since the index was made; -print an error comment instead. -T} -\-i \f2string T{ -Take -.I string -as input, instead of reading the standard input. -T} -\-l \f2n T{ -The maximum length of internal lists of candidate -items is -.I n; -default 1000. -T} -\-o \f2string T{ -Put text output (``\-Fy'') in -.I string; -of use -.I only -when -invoked from another program. -T} -\-p T{ -Print hash code frequencies; mostly -for use in optimizing hash table sizes. -T} -\-T[yn\f2d\|\f3] T{ -``\-Ty'' gives the tags of the items found; -``\-Tn'' suppresses them. -``\-T\f2d\f1\|'' where \f2d\f1\| is an integer -gives the first \f2d\f1 tags. -The default is -.I \-Tn . -T} -\-t \f2string T{ -Put tag output (``\-Ty'') in -.I string; -of use -.I only -when invoked from another program. -T} -.TE -.PP -The timing of -.I hunt -is complex. -Normally the hash table is overfull, so that there will -be many false drops on any single term; -but a multi-term query will have few false drops on -all terms. -Thus if a query is underspecified (one search term) -many potential items will be examined and discarded as false -drops, wasting time. -If the query is overspecified (a dozen search terms) -many keys will be examined only to verify that -the single item under consideration has that key posted. -The variation of search time with number of keys is -shown in the table below. -Queries of varying length were constructed to retrieve -a particular document from the file of references. -In the sequence to the left, search terms were chosen so as -to select the desired paper as quickly as possible. -In the sequence on the right, terms were chosen inefficiently, -so that the query did not uniquely select the desired document -until four keys had been used. -The same document was the target in each case, -and the final set of eight keys are also identical; the differences -at five, six and seven keys are produced by measurement error, not -by the slightly different key lists. -.TS -center; -c s s s5 | c s s s -cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 -cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 -n n n n | n n n n . -Efficient Keys Inefficient Keys -No. keys Total drops Retrieved Search time No. keys Total drops Retrieved Search time - (incl. false) Documents (seconds) (incl. false) Documents (seconds) -1 15 3 1.27 1 68 55 5.96 -2 1 1 0.11 2 29 29 2.72 -3 1 1 0.14 3 8 8 0.95 -4 1 1 0.17 4 1 1 0.18 -5 1 1 0.19 5 1 1 0.21 -6 1 1 0.23 6 1 1 0.22 -7 1 1 0.27 7 1 1 0.26 -8 1 1 0.29 8 1 1 0.29 -.TE -As would be expected, the optimal search is achieved -when the query just specifies the answer; however, -overspecification is quite cheap. -Roughly, the time required by -.I hunt -can be approximated as -30 milliseconds per search key plus 75 milliseconds -per dropped document (whether it is a false drop or -a real answer). -In general, overspecification can be recommended; -it protects the user against additions to the data base -which turn previously uniquely-answered queries -into ambiguous queries. -.PP -The careful reader will have noted an enormous discrepancy between these times -and the earlier quoted time of around 1.9 seconds for a search. The times -here are purely for the search and retrieval: they are measured by -running many searches through a single invocation of the -.I hunt -program alone. -Usually, the UNIX command processor (the shell) must start both -the -.I mkey -and -.I hunt -processes for each query, and arrange for the output of -.I mkey -to be fed to -the -.I hunt -program. -This adds a fixed overhead of about 1.7 seconds -of processor time -to any single search. -Furthermore, remember that all these times are processor times: -on a typical morning on our \s-2PDP\s0 11/70 system, with about one dozen -people logged on, -to obtain 1 second of processor time for the search program -took between 2 and 12 seconds of real time, with a median of -3.9 seconds and a mean of 4.8 seconds. -Thus, although the work involved in a single search may be only -200 milliseconds, after you add the 1.7 seconds of startup processor -time -and then assume a 4:1 elapsed/processor time -ratio, it will be 8 seconds before any response is printed. -.NH -Selecting and Formatting References for T\s-2ROFF\s0 -.PP -The major application of the retrieval software -is -.I refer, -which is a -.I troff -preprocessor -like -.I eqn . -.[ -kernighan cherry acm 1975 -.] -It scans its input looking for items of the form -.DS -\*.[ -imprecise citation -\*.\^] -.DE -where an imprecise citation is merely a string -of words found in the relevant bibliographic citation. -This is translated into a properly formatted reference. -If the imprecise citation does not correctly identify -a single paper -(either -selecting no papers or too many) a message is given. -The data base of citations searched may be tailored to each -system, and individual users may specify their own -citation -files. -On our system, the default data base is accumulated from -the publication lists of the members of our organization, plus -about half a dozen personal bibliographies that were collected. -The present total is about 4300 citations, but this increases steadily. -Even now, -the data base covers a large fraction of local citations. -.PP -For example, the reference for the -.I eqn -paper above was specified as -.DS -\&\*.\*.\*. -\&preprocessor like -\&.I eqn. -\&.[ -\&kernighan cherry acm 1975 -\&.] -\&It scans its input looking for items -\&\*.\*.\*. -.DE -This paper was itself printed using -.I refer. -The above input text was processed by -.I refer -as well as -.I tbl -and -.I troff -by the command -.DS -.ft I -refer memo-file | tbl | troff \-ms -.ft R -.DE -and the reference was automatically translated into a correct -citation to the ACM paper on mathematical typesetting. -.PP -The procedure to use to place a reference in a paper -using -.I refer -is as follows. -First, use the -.I lookbib -command to check that the paper is in the data base -and to find out what keys are necessary to retrieve it. -This is done by typing -.I lookbib -and then typing some potential queries until -a suitable query is found. -For example, had one started to find -the -.I eqn -paper shown above by presenting the query -.DS - $ lookbib - kernighan cherry - (EOT) -.DE -.I lookbib -would have found several items; experimentation would quickly -have shown that the query given above is adequate. -Overspecifying the query is of course harmless; it is even desirable, -since it decreases the risk that a document added to the publication -data base in the future will be retrieved in addition to the -intended document. -The extra time taken by even a grossly overspecified query is -quite small. -A particularly careful reader may have noticed that ``acm'' does not -appear in the printed citation; -we have supplemented some of the data base items with -extra keywords, such as common abbreviations for journals -or other sources, to aid in searching. -.PP -If the reference is in the data base, the query -that retrieved it can be inserted in the text, -between -.B \*.[ -and -.B \*.\^] -brackets. -If it is not in the data base, it can be typed -into a private file of references, using the format -discussed in the next section, and then -the -.B \-p -option -used to search this private file. -Such a command might read -(if the private references are called -.I myfile ) -.DS -.ft 2 -refer \-p myfile document | tbl | eqn | troff \-ms \*. \*. \*. -.ft 1 -.DE -where -.I tbl -and/or -.I eqn -could be omitted if not needed. -The use -of the -.I \-ms -macros -.[ -lesk typing documents unix gcos -.] -or some other macro package, however, -is essential. -.I Refer -only generates the data for the references; exact formatting -is done by some macro package, and if none is supplied the -references will not be printed. -.PP -By default, -the references are numbered sequentially, -and -the -.I \-ms -macros format references as footnotes at the bottom of the page. -This memorandum is an example of that style. -Other possibilities are discussed in section 5 below. -.NH -Reference Files. -.PP -A reference file is a set of bibliographic references usable with -.I refer. -It can be indexed using the software described in section 2 -for fast searching. -What -.I refer -does is to read the input document stream, -looking for imprecise citation references. -It then searches through reference files to find -the full citations, and inserts them into the -document. -The format of the full citation is arranged to make it -convenient for a macro package, such as the -.I \-ms -macros, to format the reference -for printing. -Since -the format of the final reference is determined -by the desired style of output, -which is determined by the macros used, -.I refer -avoids forcing any kind of reference appearance. -All it does is define a set of string registers which -contain the basic information about the reference; -and provide a macro call which is expanded by the macro -package to format the reference. -It is the responsibility of the final macro package -to see that the reference is actually printed; if no -macros are used, and the output of -.I refer -fed untranslated to -.I troff, -nothing at all will be printed. -.PP -The strings defined by -.I refer -are taken directly from the files of references, which -are in the following format. -The references should be separated -by blank lines. -Each reference is a sequence of lines beginning with -.B % -and followed -by a key-letter. -The remainder of that line, and successive lines until the next line beginning -with -.B % , -contain the information specified by the key-letter. -In general, -.I refer -does not interpret the information, but merely presents -it to the macro package for final formatting. -A user with a separate macro package, for example, -can add new key-letters or use the existing ones for other purposes -without bothering -.I refer. -.PP -The meaning of the key-letters given below, in particular, -is that assigned by the -.I \-ms -macros. -Not all information, obviously, is used with each citation. -For example, if a document is both an internal memorandum and a journal article, -the macros ignore the memorandum version and cite only the journal article. -Some kinds of information are not used at all in printing the reference; -if a user does not like finding references by specifying title -or author keywords, and prefers to add specific keywords to the -citation, a field is available which is searched but not -printed (\f3K\f1). -.PP -The key letters currently recognized by -.I refer -and -.I \-ms, -with the kind of information implied, are: -.KS -.TS -center; -c c6 c c -c l c l. -Key Information specified Key Information specified -A Author's name N Issue number -B Title of book containing item O Other information -C City of publication P Page(s) of article -D Date R Technical report reference -E Editor of book containing item T Title -G Government (NTIS) ordering number V Volume number -I Issuer (publisher) -J Journal name -K Keys (for searching) X or -L Label Y or -M Memorandum label Z Information not used by \f2refer\f1 -.TE -.KE -For example, a sample reference could be -typed as: -.DS -%T Bounds on the Complexity of the Maximal -Common Subsequence Problem -%Z ctr127 -%A A. V. Aho -%A D. S. Hirschberg -%A J. D. Ullman -%J J. ACM -%V 23 -%N 1 -%P 1-12 -.\"%M TM 75-1271-7 -%M abcd-78 -%D Jan. 1976 -.DE -Order is irrelevant, except that authors are shown in the order -given. The output of -.I refer -is a stream of string definitions, one -for each of the fields of each reference, as -shown below. -.DS -\*.]- -\*.ds [A authors' names \*.\*.\*. -\*.ds [T title \*.\*.\*. -\*.ds [J journal \*.\*.\*. -\*.\*.\*. -\*.]\|[ type-number -.DE -The -.I refer -program, in general, does not concern itself with the significance -of the strings. -The different fields are treated identically by -.I refer , -except that the -X, Y and Z fields are ignored (see the -.B \-i -option of -.I mkey\^ ) -in indexing and searching. -All -.I refer -does is select the appropriate citation, based on the keys. -The macro package must arrange the strings so as to produce an appropriately -formatted citation. -In this process, it uses the convention that the `T' field is the title, -the `J' field the journal, and so forth. -.PP -The -.I refer -program does arrange the citation to simplify the macro package's -job, however. -The special macro -.B \&\*.]\- -precedes the string definitions -and the special macro -.B \*.]\|[ -follows. -These are changed from the input -.B \*.[ -and -.B \*.\^] -so that running the same file through -.I refer -again is harmless. -The -.B \*.]\- -macro can be used by the macro package to -initialize. -The -.B \*.]\|[ -macro, which should be used -to print the reference, is given an -argument -.I type-number -to indicate the kind of reference, as follows: -.KS -.TS -center; -c c -n l. -Value Kind of reference -1 Journal article -2 Book -3 Article within book -4 Technical report -5 Bell Labs technical memorandum -0 Other -.TE -.KE -The type is determined by the presence or absence of particular fields -in the citation (a journal article must have a `J' field, a book must have -an `I' field, and so forth). -To a small extent, this violates -the above rule that -.I refer -does not concern itself with the contents of the citation; -however, the classification of the citation in -.I troff -macros would require a relatively expensive and obscure -program. -Any macro writer may, of course, preserve consistency by ignoring -the argument to the -.B \*.]\|[ -macro. -.PP -The reference is flagged in the text -with the sequence -.DS -\e*\|([\*.number\e*\|(\*.\^] -.DE -where -.I number -is the footnote number. -The strings -.B [\*. -and -.B \*.\^] -should be used by the macro package -to format the reference flag in the text. -These strings can be replaced for a particular -footnote, as described in section 5. -The footnote number (or other signal) is available -to the reference macro -.B \*.]\|[ -as the -string register -.B [F . -To simplify dealing with a text reference that occurs -at the end of a sentence, -.I refer -treats a reference which follows a period in a special way. -The period is removed, and the reference is preceded by -a call for the string -.B <. -and followed by a call for the string -.B >. -For example, if a reference follows ``end.'' it will appear -as -.DS -end\e*(<\*.\e*([\*.number\e*(\*.]\e*(>\*. -.DE -where -.I number -is the footnote number. -The macro package should turn either the string -.B >. -or -.B <. -into a period and delete the other one. -This permits the output to have either the form -``end[31].'' or ``end.\s-3\u31\d\s0'' as the -macro package wishes. -Note that in one case the period precedes the number and in the -other it follows the number. -.PP -In some cases users wish to suspend the searching, and merely -use the reference macro formatting. -That is, the user doesn't want to provide a search key -between -.B \*.[ -and -.B \*.\^] -brackets, but merely -the reference lines for the appropriate document. -Alternatively, the user -can wish -to add a few fields to those in the reference -as in the standard file, or -override some fields. -Altering or replacing fields, or supplying whole references, is easily done -by inserting lines beginning -with -.B % ; -any such line is taken as -direct input to the reference -processor rather than keys to be searched. -Thus -.DS -\*.[ -key1 key2 key3 \*.\*.\*. -%Q New format item -%R Override report name -\*.\^] -.DE -makes the indicates changes to the result of searching for -the keys. -All of the search keys must be given before the first -\f3%\f1 line. -.PP -If no search keys are provided, an entire citation can -be provided in-line in the text. -For example, if the -.I eqn -paper citation were to be inserted in -this way, rather than by searching for it in the data base, -the input would read -.DS -\&\*.\*.\*. -\&preprocessor like -\&.I eqn. -\&.[ -\&%A B. W. Kernighan -\&%A L. L. Cherry -\&%T A System for Typesetting Mathematics -\&%J Comm. ACM -\&%V 18 -\&%N 3 -\&%P 151-157 -\&%D March 1975 -\&.] -\&It scans its input looking for items -\&\*.\*.\*. -.DE -This would produce a citation of the same appearance as that -resulting from the file search. -.PP -As shown, fields are normally turned into -.I troff -strings. -Sometimes users would rather have them defined as macros, -so that other -.I troff -commands can be placed into the data. -When this is necessary, simply double the control character -.B % -in the data. -Thus the input -.DS -\&.[ -%V 23 -%%M -Bell Laboratories, -Murray Hill, N.J. 07974 -\&.] -.DE -is processed by -.I refer -into -.DS -\&.ds [V 23 -\&.de [M -Bell Laboratories, -Murray Hill, N.J. 07974 -\&.. -.DE -The information after -.B %%M -is defined as a macro to be invoked by -.B .[M -while the information after -.B %V -is turned into a string to be invoked by -.B \e\(**([V . -At present -.I \-ms -expects all information as strings. -.NH -Collecting References and other Refer Options -.PP -Normally, the combination of -.I refer -and -.I \-ms -formats output as -.I troff -footnotes which are consecutively numbered and placed -at the bottom of the page. However, -options exist to -place the references at the end; to arrange references alphabetically -by senior author; and to indicate references by strings in the text of the form -[Name1975a] -rather than by number. -Whenever references are not placed at the bottom of a page -identical references are coalesced. -.PP -For example, the -.B \-e -option to -.I refer -specifies that references are to be collected; in this case -they are output whenever the sequence -.DS -\*.[ -$LIST$ -\*.\^] -.DE -is encountered. -Thus, to place references at the end of a paper, the user would run -.I refer -with the -.I \-e -option and place the above $LIST$ commands after the last -line of the text. -.I Refer -will then move all the references to that point. -To aid in formatting the collected references, -.I refer -writes the references preceded by the line -.DS -.B .]< -.DE -and -followed by the line -.DS -.B .]> -.DE -to invoke special macros before and after the references. -.PP -Another possible option to -.I refer -is the -.B \-s -option to specify -sorting of references. The default, -of course, is to list references in the order presented. -The -.B \-s -option implies the -.B \-e -option, and thus requires -a -.DS -\*.[ -$LIST$ -\*.\^] -.DE -entry to call out the reference list. -The -.B \-s -option may be followed by a string of letters, numbers, and `+' signs indicating how -the references are to be sorted. -The sort is done using the fields whose key-letters are -in the string as sorting keys; the numbers indicate how many -of the fields are to be considered, with `+' -taken as a large number. -Thus the default is -.B \-sAD -meaning ``Sort on senior author, then date.'' To -sort on all authors and then title, specify -.B \-sA+T . -And to sort on two authors and then the journal, -write -.B \-sA2J . -.PP -Other options to -.I refer -change the signal or label inserted in the text for each reference. -Normally these are just sequential numbers, -and their exact placement (within brackets, as superscripts, etc.) is determined -by the macro package. -The -.B \-l -option replaces reference numbers by -strings composed of the senior author's last name, the date, -and a disambiguating letter. -If a number follows the -.B l -as in -.B \-l3 -only that many letters of the last name are used -in the label string. -To abbreviate the date as well the form -\f3-l\f2m,n\f1 -shortens the last name to the -first -.I m -letters and the date to the -last -.I n -digits. -For example, the option -.B \-l3,2 -would refer to the -.I eqn -paper (reference 3) by the signal -.I Ker75a , -since it is the first cited reference by Kernighan in 1975. -.PP -A user wishing to specify particular labels for -a private bibliography may use the -.B \-k -option. -Specifying -\f3\-k\f2x\f1 -causes the field \f2x\f1 to be used as a label. -The default is \f3L\f1. -If this field ends in \f3\-\f1, that character -is replaced by a sequence letter; otherwise the field -is used exactly as given. -.PP -If none of the -.I refer -produced -signals are desired, -the -.B \-b -option entirely suppresses automatic text signals. -.PP -If the user wishes to override the -.I \-ms -treatment of the reference signal (which is normally to -enclose the number in brackets in -.I nroff -and make it a superscript in -.I troff\\| ) -this can be done easily. -If the lines -.B \&.[ -or -.B \&.] -contain anything following these characters, -the remainders of these lines are used to surround -the reference signal, instead of the default. -Thus, for example, to say ``See reference (2).'' -and avoid -``See reference.\s-3\u2\d\s+3'' the -input might appear -.DS -\&See reference -\&\*.[ ( -imprecise citation ... -\&\*.\^])\*. -.DE -Note that blanks are significant in this construction. -If a permanent change is desired in the style of reference -signals, however, it is probably easier to redefine the strings -.B \&[. -and -.B \&.] -(which are used to bracket each signal) -than to change each citation. -.PP -Although normally -.I refer -limits itself to retrieving the data for the reference, -and leaves to a macro package the job of arranging that -data as required by the local format, there are two -special options for rearrangements that can not be -done by macro packages. -The -.B \-c -option puts fields into all upper case -(C\s-2APS\s+2-S\s-2MALL\s+2 C\s-2APS\s+2 -in -.I troff -output). -The key-letters indicated what information is to be translated -to upper case follow the -.B c , -so that -.B \-cAJ -means that authors' names and journals are to be in caps. -The -.B \-a -option writes the names of authors last name first, that is -.I "A. D. Hall, Jr." -is written as -.I "Hall, A. D. Jr" . -The citation form of -the -.I "Journal of the ACM" , -for example, would require -both -.B \-cA -and -.B \-a -options. -This produces authors' names in the style -.I -K\s-2ERNIGHAN\s0, B. W. \s-2AND\s0 C\s-2HERRY\s0, L. L.\& -.R -for the previous example. -The -.B \-a -option may be followed by a number to indicate how many -author names should be reversed; -.B \-a1 -(without any -.B \-c -option) -would produce -.I -Kernighan, B. W. and L. L. Cherry, -.R -for example. -.PP -Finally, there is also the previously-mentioned -.B \-p -option to let the user specify -a private file of references to be searched before the public files. -Note that -.I refer -does not insist on a previously made index for these files. -If a file is named which contains reference -data but is not indexed, it will be searched -(more slowly) -by -.I refer -using -.I fgrep. -In this way -it is easy for users to keep small files of -new references, which can later be added to the -public data bases. -.SG MH-1274-MEL-\s8UNIX\s0 //GO.SYSIN DD refer