emily.index_classes.CooccurrenceIndex

Created on Thu May 7 08:43:03 2026

@author: Dr Peter J Bleackley

Attributes

stop

Classes

CooccurrenceIndex

Uses information theory to identify documents in which the query terms

Module Contents

emily.index_classes.CooccurrenceIndex.stop[source]
class emily.index_classes.CooccurrenceIndex.CooccurrenceIndex(path: pathlib.Path | None = None)[source]

Uses information theory to identify documents in which the query terms tend to occur in the same sentences. See this page for details of the algorithm.

async add_documents(corpus: collections.abc.AsyncIterable[tuple[str, list[list[str]]]])[source]

Adds documents to the index

Parameters:

corpus (AsyncIterable[tuple[str,list[list[str]]]]) – Iterable of tuples of filename, and the document as a list of list of strings (parsed sentences)

Return type:

None.

__call__(query: list[str]) polars.LazyFrame[source]

Finds candidate documents where the words in the query tend to cooccur. Uses the Pareto principal to automatically threhold the results. For N candidate documents Np results will be returned, such that the account for (1-p) of the total relevance of the sample

Parameters:

query (list[str]) – The query as a list of strings

Returns:

Contains a single column, “filename”

Return type:

pl.LazyFrame

save(path: pathlib.Path)[source]

Saves the indices to parquet files

Parameters:

path (Path) – Directory to save indices in

Return type:

None.

async clear()[source]

Clears indices

Return type:

None.