Match word of specific length

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Match word of specific length



How do I match a word of a specific length, let's say, five?



Given the input file temp of text:


temp


1) ci sono quattro mele
2) sentiamoci il 16 ottobre 2018
3) decidiamo il 17 ottabre 2017
4) Manipolo di eroi
5) 17 mele
6) 18 ott 2020
7) una mela e mezza
8) 2 mele



If i do:


awk '/[[:lower:]]{5}/ {print}' temp



I would expect as output the sentence 7), because is the only one with a word of length 5 (mezza). Actually, it returns every line with a word of length equal or superior than 5.


7)


5


mezza


5



This behavior is not compatible with any source of information I consulted:



The construct {n} should match exactly n times. At this point, I am afraid I am missing something obvious.


{n}


n





Possible duplicate of Why does this regex with no special character match a longer string?
– tripleee
1 hour ago





It's probably possible to find a better duplicate; this is definitely a common FAQ.
– tripleee
1 hour ago





You need word boundaries. That's it.
– revo
56 mins ago




3 Answers
3



It's matching because it found a string of 5 lowercase letters within the string of longer length. You need to adapt your regex so that the "word" match is surrounded by white space. Don't forget to also address the start/end of the string in the "word" boundary.





So (^|[[:space:]])[[:lower:]]{5}([[:space:]]|$) or possibly also inclule numbers, punctuation, and/or uppercase in the boundary conditions.
– tripleee
1 hour ago




(^|[[:space:]])[[:lower:]]{5}([[:space:]]|$)



if it's always surrounded by spaces you can do the following
[[:lower:]]{5}s+ or s+[[:lower:]]{5}s+
(depending what you want to do)


[[:lower:]]{5}s+


s+[[:lower:]]{5}s+





Awk doesn't support s in any version I'm familiar with.
– tripleee
1 hour ago


s





@tripleee GNU awk does. That's the only one AFAIK but of course GNU awk also supports word boundaries which might be the more appropriate construct in this case.
– Ed Morton
6 mins ago





With GNU awk for word boundaries < and > and w for word characters:


<


>


w


$ awk '/<w{5}>/' file
7) una mela e mezza



With any awk:


$ awk '/(^|[^[:alpha:]])[[:alpha:]]{5}([^[:alpha:]]|$)/' file
7) una mela e mezza



Those and any other solution will obviously depend on what you mean by a "word".






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

'Series' object is not callable Error / Statsmodels illegal variable name