emily.DocSearch

Created on Thu May 7 12:26:18 2026

@author: Dr Peter J Bleackley

Attributes

splitter

Classes

DocSearch

Overall search component

Functions

parse(→ list[str])

Splits sentence into lower-case words with punctuation removed

Module Contents

emily.DocSearch.splitter[source]
emily.DocSearch.parse(sentence: str) list[str][source]

Splits sentence into lower-case words with punctuation removed

Parameters:

sentence (str) – Sentence to be parsed

Returns:

Sentence as a list of lower-case words

Return type:

list[str]

class emily.DocSearch.DocSearch(embedding_url: str, reranking_url: str, vector_dir: str, collection_name: str, index_dir: str | None = None)[source]

Overall search component

vector_index[source]
reranker[source]
sentence_tokenizer[source]
sentences(text: str) list[list[str]][source]

Splits a text into sentences, and then each sentence into a list of lower-case strings

Parameters:

text (str) – A document to be split into sentences.

Returns:

Each sentence in the document is represented by a list of strings

Return type:

list[list[str]]

async add_documents(corpus: collections.abc.AsyncIterable[tuple[str, str]])[source]

Adds documents to the database

Parameters:

corpus (AsyncIterable[tuple[str,str]]) – Documents as tuples of (filename,text)

Return type:

None.

async __call__(query: str, top_k: int = 10) pandas.Series[source]

Searches for documents relevant to query. Finds the 2*top_k best matches from the vector database, the automatically thresholded reposnses from each of the Okapi and Cooccurrence indices, retrieves the text and reranks them

Parameters:
  • query (str) – Text to query for.

  • top_k (int, optional) – Number of results to return. The default is 10.

Returns:

The reranker scores of the top_k best matching documents, indexed by their filenames.

Return type:

pd.Series

save(path: str)[source]

Saves search indices

Parameters:

path (str) – Directory in which to save indices.

Return type:

None.

async clear()[source]

Clears indices

Return type:

None.