r-tokenizers
Fast, consistent tokenization of natural language text
This is a package for converting natural language text into tokens. It includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the stringi
and Rcpp
packages for fast yet correct tokenization in UTF-8 encoding.
- Versions: 0.3.0
- Website: https://lincolnmullen.com/software/tokenizers/
- Licenses: Expat
- Package source: gnu/packages/cran.scm
- Builds: See build status
- Issues: See known issues
Installation
Install the latest version of r-tokenizers
as follows:
guix install r-tokenizers
Or install a particular version:
guix install r-tokenizers@0.3.0
You can also install packages in augmented, pure or containerized environments for development or simply to try them out without polluting your user profile. See the guix shell
documentation for more information.
Badge code
HTML: <a href='http://127.0.0.1:3000/packages/r-tokenizers/'><img src='http://127.0.0.1:3000/packages/r-tokenizers/badges/latest-version.svg'></img></a> Markdown: [![GNU Guix](http://127.0.0.1:3000/packages/r-tokenizers/badges/latest-version.svg)](http://127.0.0.1:3000/packages/r-tokenizers/) Org: [[http://127.0.0.1:3000/packages/r-tokenizers/][http://127.0.0.1:3000/packages/r-tokenizers/badges/latest-version.svg]]