Concepts

Introduction

Search Engine is a complex software system that facilitates the search of information in the given data. There are numerous steps in the search process. At Joyspace, we take care of all the heavy lifting of the search engine. We have developed a search engine that is truly multimodal. This means, you can search inside your text, audio, video and images with ease.

Below, we will discuss the various components of the search engine briefly. Then we introduce some concepts that you should be familiar with before you start using the Joyspace Search Engine.

Note

You don't have to do all of these steps to get started with the search engine. In order to use Joyspace Search, all you will have to do is call our APIs and provide us with the data you want to search.

These steps are written for you to understand the problem better.

Important Concepts

Files

Joyspace Search Engine supports indexing of text, audio, video and images. In each of the modality we support numerous file formats. For example, in text search engine, we support indexing of plain text, HTML, PDF, CSV, XML, JSON, Word, and so on.

Files are the raw files that you provide us with. For example, if you want to index a document in PDF format, then the .pdf file(s) that you provide us with is the File file.

Same logic extends to audio, video and images data.

Files Size

Now that you understand what File is, the size of file is easy to understand. The size of the file is the number of bytes that the file contains.

Distinction

Note that most search engines out there measure your data in the size of the indices. This is usually a bad idea because you do not know how big the index is going to be before you index it all. Usually, customers get a surprise invoice after they are done with indexing. They are also locked in by that time, and have no other option but to pay hefty prices on a monthly basis.

At Joyspace, we believe in transparency. Joyspace Search Engine measures the size of the data that you provide us with. This means that you know exactly how much data you are indexing and which plan is right for you.

Number of Files

Each index is associated with a number of files. This number is the number of files that you have provided us with. A single file you provide us is counted as one file.

OCR

OCR is short for Optical Character Recognition. It is a technology that converts images into text. Joyspace Search Engine supports OCR for text, images and videos.

OCR From Text Data

A lot of times, PDF files are scanned copies of the original document. This means that the text inside the PDF file is not available for searching. This is where OCR comes into play. OCR converts the scanned PDF file into a text file that is available for searching.

You do not have to do anything extra. We automatically detect whether the file is a scanned PDF file or not. If it is a scanned PDF file, then we automatically convert it into a text file.

OCR From Images Data

If you are indexing images, then you have to provide us with the scanned images. Joyspace Search Engine automatically detects whether the image contains any text. If it does contain text, then we automatically extract it, and it is available for searching.

Note that Images Search is powered by Embeddings. So for a given query, unless specified otherwise, we search for the text in the image along with the image itself.

OCR From Videos Data

This is one of the pioneering features of Joyspace Search Engine. We support OCR for videos. We automatically detect whether the video contains any text. If it does contain text, then we automatically extract it, and it is available for searching.

In the results there is a ocr section, which contains the text extracted from the video. It is always accompanied with the start and end time of the Chapter.

Table Search

If a given data format contains a table, it is extracted from the data and made available for searching. Note that the table is extracted as a single index. We find it beneficial to have entire table as a response for a given query than any smaller chunk of the table.

Topic Search

This is applicable only for audio and video data formats. We automatically detect the topics in the audio or video. You or your users can search for the topics directly and get the results.

Embeddings

At Joyspace, we have developed a cutting-edge AI technology to convert text, images, audio and videos into a vector representation. This vector representation is called Embeddings. There are immence benefits to using embeddings. For example, it is easy to compare two documents. Even if documents will not have exact same words, but similar concepts, we can find a higher similarity between the two documents. This allows us craft search results with higher relevancy.

Personalization

Joyspace Search Engine provides a personalized search experience. This means that you can get relevant results for a given query for each group. Learn more about how to use this feature in the Personalization section.

Personalization can be further broken down into two parts: Group Level Personalization and User Level Personalization.

Group Level Personalization is supported by default for all the users in Joyspce Search Engine.

For User Level Personalization, you need to provide us with extra data at indexing and query time. Please contact us if you want to know more about user level personalization.

Query Insights

In Joyspace analytics dashboard, we provide you with insights into the search queries and results. You can answer general questions like - "How many times a particular query was searched?" or "How many times a particular result was clicked?"

Moreover, you can also get insights derived by Joyspace AI technology. Our AI clusters the queries and search results, and provide you with insights into the cluster. For example, if you have a shoe store, then you will get a cluster of queries and search results that are similar to "trending shoes", etc.

Chapterization

This feature is for audio and video search. Chapterization is a process of dividing the audio or video into smaller parts. For example, a 10 minute long video can be divided into 10 chapters. Joyspace AI automatically detects the chapters, names these chapters and provides you with the chapters titles and text in the results.

Diarization

This features is for audio and video search. Diarization means differentiating speaker of each smaller segments. Joyspace AI automatically detects the diarization, names the speakers whenever possible and provides you with the speaker and text of who said what.

Transcript-based Search

This feature is for audio and video search. Joyspace AI performs the transcription of all the data you have provided us with. The searched queries are matched with the trascript and results are provided with the matching transcript.

Result Score

In Joyspace Search Engine, we provide you with insights into the search results. This is done by providing you with a score for each result. The score is a value between 0 and 1. The higher the score, the more relevant the result is.