Google Books Ngrams Recompressed and Searchable

Szczegóły
Abstrakt

Tytuł:: Google Books Ngrams Recompressed and Searchable
Autorzy:: Grabowski, S.
Swacha, J.
Data publikacji:: 2012
Słowa kluczowe:: data compression
random access
n-gram language model
Język:: angielski
Dostawca treści:: BazTech
: Artykuł

Przejdź do źródła

One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams the textual format (compressed with Deflate) in which they are distributed is highly inefficient we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.

Informacja

Google Books Ngrams Recompressed and Searchable