NLTK Named Entity Recognition with Custom Data

NLTK Named Entity Recognition with Custom Data

I'm trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. I've been trying to find a way to train my own NER, but I don't seem to be able to find the right resources.
I have a couple of questions regarding NLTK-

I would really appreciate help in this regard

4 Answers
4

Are you committed to using NLTK/Python? I ran into the same problems as you, and had much better results using Stanford's named-entity recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml. The process for training the classifier using your own data is very well-documented in the FAQ.

If you really need to use NLTK, I'd hit up the mailing list for some advice from other users: http://groups.google.com/group/nltk-users.

Hope this helps!

Browsing through the SNER site, I saw that there's even a python interface here. Not sure how mature it is, but it might be helpful.
– senderle
Jul 9 '12 at 20:05

I had the same problem and shared what worked for me. Sorry if that upset you bro :(
– jjdubs
Sep 4 '12 at 22:13

The Stanford NER has been included in NLTK 2.0. Read More - nltk.org/api/nltk.tag.html#module-nltk.tag.stanford
– Jayesh
Feb 16 '14 at 11:53

Guys, here I wrote script to download and prepare all required to get Python, NLTK and Stanford NER work together -- gist.github.com/troyane/c9355a3103ea08679baf
– troyane
Jun 9 '14 at 10:51

does anyone know how to use the python -stanford NER interface to train on new corpuses?
– user3314418
Aug 5 '14 at 15:13

You can easily use the Stanford NER alongwith nltk.
The python script is like

from nltk.tag.stanford import NERTagger import os java_path = "/Java/jdk1.8.0_45/bin/java.exe" os.environ['JAVAHOME'] = java_path st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar') tagging = st.tag(text.split())

To train your own data and to create a model you can refer to the first question on Stanford NER FAQ.

The link is http://nlp.stanford.edu/software/crf-faq.shtml

I also had this issue, but I managed to work it out.
You can use your own training data. I documented the main requirements/steps for this in my github repository.

I used NLTK-trainer, so basicly you have to get the training data in the right format (token NNP B-tag), and run the training script. Check my repository for more info.

There are some functions in the nltk.chunk.named_entity module that train a NER tagger. However, they were specifically written for ACE corpus and not totally cleaned up, so one will need to write their own training procedures with those as a reference.

There are also two relatively recent guides (1 2) online detailing the process of using NLTK to train the GMB corpus.

However, as mentioned in answers above, now that many tools are available, one really should not need to resort to NLTK if streamlined training process is desired. Toolkits such as CoreNLP and spaCy do a much better job. As using NLTK is not that much different to writing your own training code from scratch, there is not that much value in doing so. NLTK and OpenNLP can be regarded as somehow belonging to a past era before the explosion of recent progress in NLP.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Ciugk