In 2015, Google released a model for face detection and clustering called FaceNet. The FaceNet model utilizes a deep convolutional neural network that optimizes the creation of an embedding rather than optimizing the prediction of classes. In a similar fashion, this blog post tries to explore how we could apply the same principle to text – extracting embeddings for context detection and clustering of text data.

### Data

The appropriate passage data for the model was hard to acquire – optimally we would require passages with the same context paraphrased in multiple different ways classified into different contexts. But since we cannot find an appropriate dataset ( let me know if anybody knows any), we have used the UCI News Aggregator Dataset.

### Architecture

For the architecture, we have used a Bert for generating embeddings and augmented it with a 1d Convolutional Layer. This embedding is then optimized through a triplet loss. Here’s how we calculate the triplet loss:

- Select an Anchor Text Embedding, call it \(t_{a}\)
- Let Positive Text Embedding as \(t_{p}\) and Negative Text Embedding as \(t_{n}\)
- Select hard positives and hard negatives as follows
- \(\text{argmax}_{t_{p}^{i}} \vert t_{a}^{i} – t_{p}^{i}\vert\)
- \(\text{argmin}_{t_{n}^{i}}\vert t_{a}^{i} – t_{n}^{i}\vert\)

- Then the loss that is being minimized is \(L = \sum_{i}^{N} \vert t_{a}^{i} – t_{p}^{i} \vert + \vert t_{a}^{i} – t_{n}^{i} \vert + \alpha\)

### Preliminary Results

The authors of the original FaceNet specifies a validation metric as follows

- \(TA(d) = \{(i,j) \in P_{same}, D(t_{i},t_{j}) \leq d\}\)
- \(VAL(d) = \frac{\vert TA(d) \vert}{\vert P_{same} \vert}\)
- where \(P_same\) are pairs of text belonging to same context and \(d\) is the threshold and \(D(x,y)\) is the distance function

We have \(VAL(d) = 0.803\) where \(d = 0.0670\)

### Conclusion

I’ll write another blog post comparing TextNet architecture with other embedding frameworks. Any criticism is appreciated