Text-cleaning resource now available

Beta version of SAGE Texti open to social science researchers

Beta version of SAGE Texti open to social science researchers

SAGE Publishing has made a free, beta version of Texti, a web interface to support social science researchers in cleaning and preparing corpora for text mining, available to NC State students and faculty.

The Texti beta supports .pdf, .txt, and very basic scraping (.html). Load a document into the Texti web interface and build a cleaning and pre-processing workflow. You can also preview how each cleaner works on your document or change their order. Once you are happy with the output, you can then extract the text. 

Texti is free to use for building a workflow. You can access a list of cleaners in SAGE’s GitHub repository. Future plans for Texti include batching capabilities, or the ability to process entire corpora of thousands or millions of documents.

SAGE is interested in feedback on the cleaners, recommendations on what cleaners you frequently use, and what features or capabilities researchers would like to see in Texti. They are also looking for testers with large corpora in order to understand how to scale the application appropriately.

If you are interested in taking advantage of this free resource, contact Darby Orcutt (dcorcutt@ncsu.edu), the Libraries’ Assistant Head, Collections & Research Strategy.