Poppins - A lightweight document classifier
Poppins is a text classifier, which means that you can use it to classify text with diverse criteria. It all depends on how you train the software. Training the software means showing it examples of documents that pertain to any arbitrary set of categories, plus one set of documents to be classified.
It is actually pretty simple: say you have a bunch of texts about cooking recipes and another of instruction manuals for tape recorders. These are two categories of documents with same examples. Imagine then you have a third set of different documents that mix the previous two categories. These are the texts you want to classify.
You can use this program to classify documents in any language and you can have any number of categories.
Some users reported success in classifying documents by topic, by language, by genre, by author. You are encouraged to try new ideas. No ones knows what other criteria might be to classify text.
History
Poppins has been online since 2005. It was created by Rogelio Nazar as a final assignment in an NLP course in Universitat Politècnica de Catalunya, in Barcelona, and it has been used by few people over the years.
The graphic design was done back then by Sebastián Márques and Alejandro Bevaqua.
Now (13 October, 2023), with the help of Nicolás Acosta, we inaugurate a new version, with a more friendly interface compared to
the original, which is still online, by the way, but we would like to make it disappear as soon as possible.
References
The project is described in the following papers:
In English:
In Catalan / en català:
In Spanish / en castellano:
Here we show some examples.
Example with the Federalist Papers
The first experiment we made was with the
Federalist Papers, a famous case of disputed authorship (learn more about the case in Wikipedia).
If you would like to repeat the experiment yourself, you can
download the corpus from our server.
It's a single zip file, but it contains other zips for every category (in this case, author) plus one called
test, which contains the disputed essays to be classified.
Training...
| |
hamilton.txt |
hammad.txt |
jay.txt |
madison.txt |
| Class hamilton |
|
|
|
|
| Class hammad |
|
|
|
|
| Class jay |
|
|
|
|
| Class madison |
|
|
|
|
Classifying...
| |
unknown1.txt |
unknown10.txt |
unknown11.txt |
unknown12.txt |
unknown2.txt |
unknown3.txt |
unknown4.txt |
unknown5.txt |
unknown6.txt |
unknown7.txt |
unknown8.txt |
unknown9.txt |
| Class hamilton |
|
|
|
|
|
|
|
|
|
|
|
|
| Class hammad |
|
|
|
|
|
|
|
|
|
|
|
|
| Class jay |
|
|
|
|
|
|
|
|
|
|
|
|
| Class madison |
|
|
|
|
|
|
|
|
|
|
|
|
Finished.
Another example with authorship attribution
The following is another case of authorship attribution,
and is the result of
a collaboration with Marta Sánchez Pol.
We experimented with a dataset she compiled for her thesis,
which consisted of many short texts by a variety of authors.
Training...
| |
1_01.txt |
1_010.txt |
1_02.txt |
1_03.txt |
1_04.txt |
1_05.txt |
1_06.txt |
1_07.txt |
1_08.txt |
1_09.txt |
2_01.txt |
2_010.txt |
2_02.txt |
2_03.txt |
2_04.txt |
2_05.txt |
2_06.txt |
2_07.txt |
2_08.txt |
2_09.txt |
3_01.txt |
3_010.txt |
3_02.txt |
3_03.txt |
3_04.txt |
3_05.txt |
3_06.txt |
3_07.txt |
3_08.txt |
3_09.txt |
5_01.txt |
5_010.txt |
5_02.txt |
5_03.txt |
5_04.txt |
5_05.txt |
5_06.txt |
5_07.txt |
5_09.txt |
6_01.txt |
6_010.txt |
6_02.txt |
6_03.txt |
6_04.txt |
6_05.txt |
6_06.txt |
6_07.txt |
6_08.txt |
6_09.txt |
| Class 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Classifying...
| |
1_011.txt |
1_012.txt |
1_013.txt |
1_014.txt |
1_015.txt |
1_016.txt |
1_017.txt |
1_018.txt |
1_019.txt |
1_020.txt |
2_011.txt |
2_012.txt |
2_013.txt |
2_014.txt |
2_015.txt |
2_016.txt |
2_017.txt |
2_018.txt |
2_019.txt |
2_020.txt |
3_011.txt |
3_012.txt |
3_013.txt |
3_014.txt |
3_015.txt |
3_016.txt |
3_017.txt |
3_018.txt |
3_019.txt |
3_020.txt |
5_011.txt |
5_012.txt |
5_013.txt |
5_014.txt |
5_015.txt |
5_016.txt |
5_017.txt |
5_018.txt |
5_019.txt |
5_20.txt |
6_011.txt |
6_012.txt |
6_013.txt |
6_14.txt |
6_15.txt |
6_16.txt |
6_17.txt |
6_18.txt |
6_19.txt |
6_20.txt |
| Class 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Finished.
A third example classifying by topic
In this case we classify texts by topic: economy, medicine
and computer science. Again, the performance is pretty good.
Training...
| |
economia1.txt |
economia11.txt |
economia12.txt |
economia13.txt |
economia2.txt |
economia3.txt |
informatica1.txt |
informatica10.txt |
informatica12.txt |
informatica14.txt |
informatica2.txt |
informatica3.txt |
medicina1.txt |
medicina10.txt |
medicina100.txt |
medicina101.txt |
medicina103.txt |
medicina104.txt |
medicina105.txt |
medicina11.txt |
| Class economia |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class informatica |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class medicina |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Classifying...
| |
economia14.txt |
economia16.txt |
economia17.txt |
economia18.txt |
informatica16.txt |
informatica17.txt |
informatica18.txt |
informatica19.txt |
informatica23.txt |
informatica24.txt |
informatica25.txt |
medicina105.txt |
medicina106.txt |
medicina107.txt |
medicina109.txt |
medicina114.txt |
| Class economia |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class informatica |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Class medicina |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Finished.