Help:Searching/Regex

Safer nicotine wiki Tobacco Harm Reduction
Jump to navigation Jump to search

Indexed search

First, all pages are scanned by the search engine. The entire wiki is treated as one "full text" kept in a separate database built just for search indexes. It's like the index in a book, but practically every word and every number is indexed to every page.[1]

Since each word in the prebuilt search index already points to the pages that contain it, most any word you search for, is actually a single record lookup in that index. (This is also true for phrases to a certain extent.)

There are separate indexes kept updated for the

  • titles
  • visual content
  • wikitext
  • templates

All the words each template outputs are indexed to the all pages onto which they are transcluded. In other words, any text transcluded by a template is indexed to its target page.[2] "Index searches" take basically no time to execute. They are cheap and plentiful.

Preparing and maintaining the search indexes is done in the background in near real time. As soon as you save the page, a few seconds later you can search for the changes you just made. For templates that are transcluded onto many many pages, the propagation of those changes to all the pages index entries might take a minute.

Indexes are based on alphanumeric characters; they store no information on non-alphanumeric characters. Although an indexed search can ask for punctuation, brackets, math and other symbolic characters of the keyboard, these are ignored without warning. Now you understand why search indexes are so fast.

Each regex search needs an indexed search filter to provide them a search domain under 10,000 pages.

Indexed search

A basic indexed search

  • searches only article space. That is the default.
  • matches only letters and numbers. This is usually not a problem.
  • works basically the same way for all public search engines. You can usually find the information you are looking for near the top of the search results by relying on page ranking software.
  • lands a lot of search results. You rely heavily on page ranking rules. You then refine search results based on the topmost pages. This is done with the not filter, signified by a minus sign attached to the front of the unwanted word to filter out page-hit noise you could not have predicted. This is the first thing you learn.
  • is an "aggressive matcher" including as many pages as it can by matching all forms of each word you enter.

Basically, why would anyone ever want to learn "how to search", since it is just key words, and these are obviously known?

An advanced index search

  • targets specific pages, instead of seeking general information.
  • doesn't need page ranking at all and cannot accept myriad results.
  • Cares about the quantity of page-hits shown on the right hand side of the search results page.

Regular expressions

Regular expressions is a special but small language for specifying sequences of characters that define a search pattern. A regular expression is referred to as a "regex" for short.

A regex search actually scours each page in the search domain character-by-character. By contrast, an indexed search actually queries a few records from a database separately maintained from the wiki database, and provides nearly instant results. So when using an insource:// (a regexp of any kind), you should consider creating the other search terms that will limit the regex search domain as much as possible. There are many search terms that use an index and so instantly provide a more refined search domain for the /regexp/. In order of general effectiveness:

  • insource:"" with quotation marks, duplicating the regexp except without the slashes or escape characters, is ideal.
  • intitle, incategory, and linksto are excellent filters.
  • hastemplate: is a very good filter.
  • "word1 word2 word3", with or without the quotation marks, are good.
  • namespace: is an advanced filter, but practically useless for regex, except that it may enable a slow regexp search to complete a long life.

The prefix operator is especially useful with a {{FULLPAGENAME}} in a search template, a search link, or an input box, because it automatically searches any subdirectories. To develop a new regexp, or refine a complex regexp, use prefix:{{FULLPAGENAME}} on a page with a sample of the target data.

Search terms that do not increase the efficiency of a regexp search are the page-weighting operators: morelike, boost-template, and prefer-recent. The regex search main concern is to first limit the search domain with an indexed search employing filters, not such "search engine queries". Then it actually searches every page character by character. It examines each page of a narrowly defined search domain.

A basic regex search

  • can pattern any character string exactly using a regexp.
  • excludes as many pages as it can.
  • quotes each regexp in double quotes to turn off metacharacters.

An advanced regex search

  • uses metacharacters.
  • benefits by being developed in a sandbox.

The regexp can be a thousand words matching every character literally, or a few symbols of a regex metacharacter language, or any combination of the two. It can match any character from any keyboard.

Regex thus have the power to produce exactitude, but are slow (expensive), and come with the responsibility to add filters to increase speed (reduce costs).

Developing a regex search virtually always requires trial and error, an iterative development process supported by {{regex}} and {{template usage}}. The easiest filters to add are a namespace, or a prefix, or a copy of the regex without the slashes removed. Such filters all use an index to search, and by doing so are much faster. This is one of the first things we learn about regex searches: accompany them by filters.[3]

Search domain size

  • are voluntary, not automatic
  • protect accessibility to regex
  • sustain the open use of the most advanced search feature possible
  • help avoid an HTML timeout, which will kill a search
  • create {{regex}} developmental sandboxes
  • are considerate, taking only the processing power they need

Running a bare regex can't hurt Wikipedia:performance, but without applying basic search techniques to them, regex searches can limit other regex searchers, and become an issue of contention.


all: insource "question sublinks" How many pages on the wiki is that? OK. sublinks]]?"/

This covers enough of the regular expressions to get started answering questions about wikitext contents on the wiki. Regex are about using meta characters to create patterns that match any literal characters. The pattern you give will match a target, character by character. To make some positions match with multiple possibilities, metacharacters are needed, and they are from the same keyboard characters that are also in the wikitext.

Metacharacters

The left curly bracket is a metacharacter, and so the regexp pattern given must "escape" any opening curly bracket \{ in the target "{" intending to match a template in the wikitext. All target text (all wikitext) is literal text, but we can backslash "escape" the regex metacharacters \. \? \+ \* \| \{ \[ \] \( \) \" \\ \# \@ \< \~ when we refer to them as literal characters in the wikitext we are interested in mining. Search will ignore the backslash wherever it is meaningless or unnecessary: \n matches n, and so on. So although you don't need to backslash escape & or > or }, it is safe to do so. An unnecessary backslash will not cause your pattern to fail, but what will is using certain characters literally— [ ] . * + ? | { ( ) " \ # @ < ~ .

  • [0-9] will match any digit, [a-y] any lowercase letter except z, [zZ] any z, (and so on). So square brackets mean "character class".
  • Dot . will match a newline, or any character in the targeted position

The number of sequential digits or characters these symbols match is expressed by following it with a quantifying metacharacter:

  • * means zero or more
  • + means one or more
  • ? means zero or one

of the character it follows after. The number of times it matches can also be given in a range, a{2} a{2,} a{2,5} matches exactly 2, 2 or more, or 2–5 a's. So curly brackets mean "quantifier".

  • The parenthesis are a grouping mechanism, so we can quantify more than just the previous character, and so we can make boundaries for a set of alternative matches. (See alternation below.)
  • The quotation marks are an escape mechanism, like square brackets or the backslash.
  • The angle brackets stand for numerals, not digits. Say <5-799>, to match 5–799, in one to three positions. Compare this with the alternative: [0-9]{1,3} could match ones, tens, or thousands as, 0-999 or 00-999 or 000-999.
  • Tilde ~ looks ahead and negates the next character. In other words, if the pattern matches in this position, then un-match it if the next character is ~character.

It is not safe to search for a lone @ because that single metacharacter matches literally everything; you can use \@ to find all pages that use an "at" symbol.

Similarly find all pages that use the number zero, Search returns an error to search for a lone 0; use one of the three escape mechanisms for 0 or @.

  • "0"
  • \0
  • [0]

or find a larger pattern around the zero you seek. Although zero is not a metacharacter, these escape mechanisms work.

The rest of wiki regex is pretty straightforward. Characters stand for themselves unless they are metacharacters. If they are metacharacters they are escaped if outside of a character class.

Character classes

A character class means "literal characters", plural. It means "literal", and so normally you don't have to escape a metacharacter character in a character class; they're already square-brackets escaped. The /slash delimiters/ mean we must of course escape any slash character, even inside a character class. No other character in a character class except slash always needs escaping; but because ] and - have special meaning (metacharacter) to a character class, they must be escaped sometimes: those two are also literal (escaped) metacharacters if they are the first character, but otherwise they must be also, like dash, be escaped: only backslash-escape works as the escape mechanism in a character class.

A character class can serve to escape metacharacters, so [-|*\/.{\]] or []|*\/.{\-] means "either a dash OR pipe OR star OR slash OR dot OR left curly bracket or a right square bracket". So [][.?+*|\/{}()\-]" or [-[.?+*|\/{}()\]]" works to find all the metacharacters in the wikitext, all of them except the backslash. Neither [\] nor [\\] allows us to OR a literal backslash. To OR a backslash character, there's alternation with the pattern \\ to handle that case. (See below.)

A character class understands the "inverse" of itself, [^abc] is "not a or b or c". A character class stands for a single character in a targeted position, so it's not really an inverse of a set, but rather a NOT of a character.

Currently character classes are limited to an expansion of four characters, so [0-9] would require three searches [0-3], [4-7], and [8-9]. The alphabet would require seven searches. This is to guarantee regex will work without overloading the search engine. See task T106685.

Note that constructs such as \d (digit) or \a (alphabetic), used in some other regex implementations, are not accepted.

Alternation

Finally, alternation is a class of regex that contains alternative possibilities for a match, say an AA or a BB, or a CC:

  • "AA" OR "BB" OR "CC" to Word search an entire page
  • AA|BB|CC to regexp search a two-character position
  • (AA|BB|CC) where used within a larger regexp because an alternation finds the longest pattern, and so the parentheses define that boundary, but it's a boundary you don't have to make if an alternation is the entire regexp pattern.

Notes

  1. ^ When you search you are not scanning pages, you are looking up an entry in an index (a database). All content is at all times "known" and resides in indexes. So when you read "searches namespace" or "searches transcluded content on a page", you can mentally replace "search" with "searches the index for".
  2. ^ This is also said "the template on a page is expanded before the search of a page is done", but that is just an abstraction.
  3. ^ Because there are 821 users, and a well-filtered regex search only takes milliseconds, while a bare, wiki-wide regex can take tens of seconds, the benefits of adding a filter are enormous.