Web pages --------- Data file: pages Pattern file (ESA 2011): MSN Pattern file (Algorithmica): yahoo The weight file (e.g. MSN.found for MSN) can be computed by using an RLCSA of the data file. After building the index, the command is rlcsa_test -i8 -l -w pages MSN Terms is the set of search terms with frequency >= 100 that were used for constructing the data file (March 2011). original/MSN is the original query log. Script split can be used to generate the pattern file from it. Stop words must be removed separately. DBLP ---- Data file: dblp Pattern file (ESA 2011): authors_terms Pattern file (Algorithmica): mend The ESA 2011 pattern file was created by scripts/parse_dblp.py. Initial patterns were all author names, as well all terms occurring inside that start and end with an alphanumeric character. Short patterns and stop words (stopwords.txt) were then removed from the file. Final pattern file contains the above terms, sorted by the number of occurrences in descending order. The frequencies are the same as for web pages (combine.py). As there were more terms than for web pages, the remaining ones were removed. dblp: version 2011-03-29