Buckwheat is a multi-language tokenizer for extracting identifiers from source code.
Firstly, the languages of the project are recognized with enry. This operation returns a dictionary with languages as keys and corresponding lists of files as values. Only the files in supported languages are passed on to the next step (see the full list below).
Every file is parsed with one of the two parsers. The most popular languages are parsed with tree-sitter, and the languages that do not yet have tree-sitter grammar are parsed with pygments. At this point, identifiers are extracted and every identifier is passed on to the next step. For tree-sitter languages, class-level and function-level parsing is also available.
Every identifier can be split into subtokens by camelCase and snake_case, small subtokens are connected to longer ones, and the subtokens are stemmed. In general, the preprocessing is carried out as described in this paper.
The counters of subtokens are aggregated for the given granularity (project, file, class, or function) and saved to file. Alternatively, sequences of tokens are saved in order of appearance in the bag (file, class, or function), optionally with coordinates of every identifier.