ZipPy: Fast method to classify text as AI or human-generated
 
 
 
 
Go to file
Jacob Torrey ae63dafeab Update browser extension repo to v0.3.1
Signed-off-by: Jacob Torrey <jacob@thinkst.com>
2023-09-21 13:33:36 -06:00
.github/workflows rename to zippy.py 2023-06-09 03:46:31 -06:00
inch Update browser extension repo to v0.3.1 2023-09-21 13:33:36 -06:00
nlzmadetect Added a link to blog and code to the web-based version 2023-08-24 09:56:43 -06:00
samples Added GPTZero's public eval dataset and added it to the test bench 2023-09-14 12:13:25 -06:00
.gitignore Initial commit 2023-06-09 03:44:42 -06:00
.gitmodules Added code to make Nim compile to CLI and web 2023-06-09 03:46:28 -06:00
LICENSE Initial commit 2023-06-09 03:44:42 -06:00
README.md Add link to Chrome extension 2023-09-20 15:19:30 -06:00
ai-generated.txt Completed a 500/set test with CHEAT 2023-06-09 03:46:30 -06:00
ai_detect_roc.png Added crossplag results 2023-06-20 21:08:51 -06:00
burstiness.py Initial commit of burstiness analysis 2023-06-09 03:46:29 -06:00
contentatscale_detect.py Lengthen timeout for contentatscale.ai detector harness 2023-09-21 04:58:08 -06:00
crossplag-report.xml Added crossplag results 2023-06-20 21:08:51 -06:00
crossplag_detect.py Remove copy-paste artifact 2023-09-20 16:23:55 -06:00
gptzero-report.xml Completed a 500/set test with CHEAT 2023-06-09 03:46:30 -06:00
gptzero_detect.py Added GPTZero API for testing and comparison 2023-06-09 03:46:29 -06:00
openai-report.xml Completed a 500/set test with CHEAT 2023-06-09 03:46:30 -06:00
openai_detect.py Added OpenAI's detector and all the test run reports along with a ROC diagram 2023-06-09 03:46:29 -06:00
plot_rocs.py Added more test cases to ZipPy results and fix a argument ordering issue; added initial contentatscale.ai eval 2023-09-20 17:07:33 -06:00
roberta-report.xml Rerun with fixed Roberta script 2023-06-09 03:46:31 -06:00
roberta_detect.py Add CUDA support for Roberta (local) and fix an alignment issue 2023-06-15 10:47:50 -06:00
roberta_local.py Add CUDA support for Roberta (local) and fix an alignment issue 2023-06-15 10:47:50 -06:00
test_contentatscale_detect.py Lengthen timeout for contentatscale.ai detector harness 2023-09-21 04:58:08 -06:00
test_crossplag_detect.py Added crossplag results 2023-06-20 21:08:51 -06:00
test_gptzero_detect.py Completed a 500/set test with CHEAT 2023-06-09 03:46:30 -06:00
test_openai_detect.py Completed a 500/set test with CHEAT 2023-06-09 03:46:30 -06:00
test_roberta_detect.py Fix typo in CHEAT tests 2023-06-09 03:46:30 -06:00
test_zippy_detect.py Added more test cases to ZipPy results and fix a argument ordering issue; added initial contentatscale.ai eval 2023-09-20 17:07:33 -06:00
zippy-report.xml Added more test cases to ZipPy results and fix a argument ordering issue; added initial contentatscale.ai eval 2023-09-20 17:07:33 -06:00
zippy.py Added more test cases to ZipPy results and fix a argument ordering issue; added initial contentatscale.ai eval 2023-09-20 17:07:33 -06:00

README.md

ZipPy: Fast method to classify text as AI or human-generated

This is a research repo for fast AI detection using compression. While there are a number of existing LLM detection systems, they all use a large model trained on either an LLM or its training data to calculate the probability of each word given the preceeding, then calculating a score where the more high-probability tokens are more likely to be AI-originated. Techniques and tools in this repo are looking for faster approximation to be embeddable and more scalable.

LZMA compression detector (zippy.py and nlzmadetect)

ZipPy uses the LZMA compression ratios as a way to indirectly measure the perplexity of a text. Compression ratios have been used in the past to detect anomalies in network data for intrusion detection, so if perplexity is roughly a measure of anomalous tokens, it may be possible to use compression to detect low-perplexity text. LZMA creates a dictionary of seen tokens, and then uses though in place of future tokens. The dictionary size, token length, etc. are all dynamic (though influenced by the 'preset' of 0-9--with 0 being the fastest but worse compression than 9). The basic idea is to 'seed' an LZMA compression stream with a corpus of AI-generated text (ai-generated.txt) and then measure the compression ratio of just the seed data with that of the sample appended. Samples that follow more closely in word choice, structure, etc. will acheive a higher compression ratio due to the prevalence of similar tokens in the dictionary, novel words, structures, etc. will appear anomalous to the seeded dictionary, resulting in a worse compression ratio.

Current evaluation

Some of the leading LLM detection tools are OpenAI's model detector (v2), GPTZero, CrossPlag's AI detector, and Roberta. Here are each of them compared with the LZMA detector across the test datasets:

ROC curve of detection tools

Usage

ZipPy will read files passed as command-line arguments, or will read from stdin to allow for piping of text to it.

$ python3 zippy.py -h
usage: zippy.py [-h] [-s | sample_files ...]

positional arguments:
  sample_files  Text file(s) containing the sample to classify

options:
  -h, --help    show this help message and exit
  -s            Read from stdin until EOF is reached instead of from a file
$ python3 zippy.py samples/human-generated/about_me.txt 
samples/human-generated/about_me.txt
('Human', 0.06013429262166636)

If you want to use the ZipPy technology in your browser, check out the Chrome extension that runs ZipPy in-browser to flag potentially AI-generated content.