Word & Sentence embeddings

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

SIF,Siamese, Attention, BERT, Rnn, Transformer, word2vec algo build applications like semantic search, deduplication, and multi-modal search.

Word and sentence embedding - backbone of LLM

Sentence embedding - take into account: order of the words, semantics of the language, actual meaning of sentence

Using transformers, attention mechanisms, and other cutting edge algorithms, this embedding sends every sentence to a vector formed by 4096 numbers,

Most word and sentence embeddings are dependent on the language that the model is trained on.

→Prev approaches to : text to numerical data → prev methods like ;:: TF-IDF , one-hot encoding, bag of words ( but fail to capture context and semantic relatedness )

Embeddings are fixed-length, multi-dimensional vectors

Distributional semantics - similar words appear in similar contexts

In what way they are 'similar' depends on what method was used to map the words to the space, but most commonly the algorithm maps the embeddings so that semantically (or to be even more accurate, distributionally) similar words have similar embeddings.

→Word vectors, the end product of this process, encode semantic relations, such as the fact the relationship between Paris and France is the same as the one between Berlin and Germany, and much more.

Common approaches for computing word vectors include:

Word2Vec: the first approach that efficiently used neural networks to learn embeddings from large data set

fastText: an efficient character-based model that can process out-of-vocabulary words

ELMo: deep, contextualized word representations that can handle polysemy (words with multiple meanings in different contexts)

Based on this simple mathematical calculation, it is possible to adapt sentence embeddings for tasks such as semantic search, text clustering, intent detection, paraphrase detection, and in the development of virtual assistants and smart-reply algorithms. Moreover, cross-lingual sentence embedding models can be used for parallel text mining or translation pair detection

SOTA:

SkipThought: an adaptation of the Word2Vec algorithm that produces embeddings by learning to predict the surroundings of an encoded sentence

SentenceBERT: based on the popular BERT model, this framework combines the power of transformer architectures and twin neural networks to create high-quality sentence representations

InferSent: produces sentence embeddings by training neural networks to identify semantic relationships between sentences in a supervised manner

Universal Sentence Encoder (USE): a collection of two models that leverages multi-task learning to encode sentences into highly generic sentence vectors that are easily adaptable for a wide range of NLP tasks

How it can be used in classification? ( text classification, sentiment analysis, text generation, text similarity, text summary ( by selecting and combining the most important pieces, question answering ) → like it basically divides it into groups, the sentences, so wee can give our query as good or bad, and it will also bifurcate it in that basis

Disadvantages:

Dependence on pre-trained models: Many sentence embedding techniques rely on pre-trained models that have been trained on large amounts of text data. While these models can be very effective, they may not always capture the nuances of specific domains or languages, and may not perform as well on tasks that are significantly different from those they were trained on.

Limited interpretability: Sentence embeddings are typically represented as fixed-length vectors of real numbers, which can be difficult to interpret and understand. This can make it challenging to understand why a particular sentence embedding was generated or what it represents.

Sensitivity to input order: Some sentence embedding techniques are sensitive to the order of words in a sentence, and may generate different embeddings for the same sentence with the words rearranged. This can be a disadvantage in cases where the order of words is not important or where the input text may have been scrambled.

Limitations on context: Some sentence embedding techniques may not capture the full context of a sentence, such as its relationship to surrounding sentences or its broader meaning in the context of a document or conversation. This can be a limitation in tasks that require a deeper understanding of the context in which a sentence appears.

Tools:

Sentence-BERT (SBERT): SBERT is a pre-trained transformer model that encodes the meaning of a sentence into a fixed-length vector. It is trained on a large dataset of natural language sentences and has achieved state-of-the-art performance on a variety of NLP tasks.

Universal Sentence Encoder (USE): USE is a pre-trained transformer model that encodes the meaning of a sentence into a fixed-length vector. It is trained on a diverse range of texts and can be fine-tuned for specific tasks.

FastText: FastText is an open-source library for creating word and sentence embeddings. It provides a variety of algorithms for learning word and sentence embeddings from large amounts of text data, and is designed to be fast and efficient.

Gensim: Gensim is an open-source library for creating and working with word and sentence embeddings. It provides a variety of algorithms for learning embeddings from text data, and also includes utilities for loading and working with pre-trained embedding models.

spaCy: spaCy is an open-source natural language processing library that includes tools for creating and working with word and sentence embeddings. It provides a variety of algorithms for learning embeddings from text data, and also includes utilities for loading and working with pre-trained embedding models.

sentence encoding - output as one vector or multiple vectors These sentence encodings can embedd a whole sentence as one vector , doc2vec for example which generate a vector for a sentence. But also BERT generates a representation for the whole sentence, the [CLS]-token.

Contextualized word embedding vs traditional word embedding • A contextualized word embeding is a vector representing a word in a special context. The traditional word embeddings such as Word2Vec and GloVe generate one vector for each word, whereas a contextualized word embedding generates a vector for a word depending on the context. Consider the sentences The duck is swimmingand You shall duck when someone shoots at you. With traditional word embeddings, the word vector for duckwould be the same in both sentences, whereas it should be a different one in the contextualized case.

How embedding works -

Methods based in neural networks let us generalize this process and break the restrictions of LSA. To get embeddings, we just need to:

Encode an input as a vector.

Measure the distance between two vectors.

Provide a ton of training data where we know which inputs should be closer and which should be farther

The simplest way to do the encoding is build a map from unique input values to randomly initialized vectors, then adjust the values of these vectors during training.

Tools like Midjourney and DALL-E interpret text instructions by learning to embed images and prompts into a shared embedding space. And a similar approach has been used for natural language instructions in robotics.

A brief explanation of TF-IDF, Doc2vec, and InferSent:

TF-IDF — Classical information retrieval method that creates a term-document matrix. It is known for its simplicity and speed, but it falls short when it tries to capture the semantics of the document, not taking into account the similarity between words.

Doc2vec — This algorithm (also known as ParagraphVector) was proposed in 2014 by Quoc Le and Tomas Mikolov, both research scientists at Google at the time. It is based on Word2vec and it follows the same principles of training a Machine Learning model that predicts the next word, relying on the surrounding words.

InferSent — Sentence Embedding method, presented by Facebook AI Research in 2018. Just like Sentence-BERT, it uses a siamese network, but instead of BERT, it utilizes a bi-LSTM, a neural network with memory, to remember the whole sentence to encode.

later parts of RNN same as transformer- some Feedforward NN It’s the input to these layers that changed. The dense embeddings created by transformer models are so much richer in information that we get massive performance benefits despite using the same final outward layers.

Transformer gained vs RNN coz of : → Positional encoding → self attention → multi-head attention

To make Sentence embedding,we use 1) Python 2) Cython 3)