emily.index_classes.CooccurrenceIndex
Created on Thu May 7 08:43:03 2026
@author: Dr Peter J Bleackley
Attributes
Classes
Uses information theory to identify documents in which the query terms |
Module Contents
- class emily.index_classes.CooccurrenceIndex.CooccurrenceIndex(path: pathlib.Path | None = None)[source]
Uses information theory to identify documents in which the query terms tend to occur in the same sentences. See this page for details of the algorithm.
- async add_documents(corpus: collections.abc.AsyncIterable[tuple[str, list[list[str]]]])[source]
Adds documents to the index
- Parameters:
corpus (AsyncIterable[tuple[str,list[list[str]]]]) – Iterable of tuples of filename, and the document as a list of list of strings (parsed sentences)
- Return type:
None.
- __call__(query: list[str]) polars.LazyFrame[source]
Finds candidate documents where the words in the query tend to cooccur. Uses the Pareto principal to automatically threhold the results. For N candidate documents Np results will be returned, such that the account for (1-p) of the total relevance of the sample
- Parameters:
query (list[str]) – The query as a list of strings
- Returns:
Contains a single column, “filename”
- Return type:
pl.LazyFrame