emily.DocSearch

Created on Thu May 7 12:26:18 2026

@author: Dr Peter J Bleackley

Attributes

splitter

Classes

DocSearch

Overall search component

Functions

`clean`(corpus)
`parse`(→ list[str])	Splits sentence into lower-case words with punctuation removed

Module Contents

emily.DocSearch.splitter[source]

async emily.DocSearch.clean(corpus)[source]

emily.DocSearch.parse(sentence: str) → list[str][source]

Splits sentence into lower-case words with punctuation removed

Parameters:: sentence (str) – Sentence to be parsed
Returns:: Sentence as a list of lower-case words
Return type:: list[str]

class emily.DocSearch.DocSearch(embedding_url: str, reranking_url: str, vector_dir: str, collection_name: str, index_dir: str | None = None)[source]

Overall search component

vector_index[source]

reranker[source]

sentence_tokenizer[source]

sentences(text: str) → list[list[str]][source]

Splits a text into sentences, and then each sentence into a list of lower-case strings

Parameters:: text (str) – A document to be split into sentences.
Returns:: Each sentence in the document is represented by a list of strings
Return type:: list[list[str]]

async add_documents(corpus: collections.abc.AsyncIterable[tuple[str, str]])[source]

Adds documents to the database

Parameters:: corpus (AsyncIterable[tuple[str,str]]) – Documents as tuples of (filename,text)
Return type:: None.

async __call__(query: str, top_k: int = 10) → pandas.Series[source]

Searches for documents relevant to query. Finds the 2*top_k best matches from the vector database, the automatically thresholded reposnses from each of the Okapi and Cooccurrence indices, retrieves the text and reranks them

Parameters:

query (str) – Text to query for.
top_k (int, optional) – Number of results to return. The default is 10.

Returns:

The reranker scores of the top_k best matching documents, indexed by their filenames.

Return type:

pd.Series

save(path: str)[source]

Saves search indices

Parameters:: path (str) – Directory in which to save indices.
Return type:: None.

async clear()[source]

Clears indices

Return type:: None.