RegEx Workout 003 - Tokenization

Welcome to next Regex workout.

If you have not participated in them before, in the first one you will find instructions for regex101.com

Task 1. Tokenize the name of an employee whose salary is at least 5 digits

Text:
Jane’s salary is 7000$
James’ salary is 11000$
Ron’s salary is 9500$

Task 2. Match the number 21, under the condition that it is part of the year.

Text:

John was born 2021
In 2021, K2 was conquered
2021US returns to Paris agreement
21 people born in 2021 settlement

Task 3. Match the names of the technologies under the condition that the sentence ends with the word “workout”

Text:

Krzysztof Nowak runs R workout
Kedeisha Bryan runs Python training
Kedeisha Bryan runs SQL workout

Task 4. Match the word “navy” or “Navy” under the condition that it is not preceded by the word “US”

Text:
The US Navy is the largest
France’s navy is a little smaller
Britain’s Navy was when the largest

Good luck! I will publish solution on 04.06.2023

2 Likes

@KrzysztofNowak ,

Thanks for another great workout. Here are my solutions:

Problem 1

Summary

Problem 2

Summary

Problem 3

Summary

Problem 4

Summary

One thing that drove me bonkers on Problem 4 was why the * quantifier after \s throws an error? Seems to me it should specify zero or more whitespaces after US…

1 Like

Task 1

Summary

\w+(?=’.+\d{5,})

Task 2

Summary

(?<=\d{2})21

Task 3

Summary

\w+(?= workout)

Task 4

Summary

(?<!US )[Nn]avy

Hello @KrzysztofNowak and thanks for the challenge. Learning a lot !
Here are the python scripts I used :

Task 1
import re

#task 1
string_task1 = """
Jane’s salary is 7000$
James’ salary is 11000$
Ron’s salary is 9500$
"""

regexp_task1 = r"^([A-Za-z]+)’.*[1-9][0-9]{4,}\$$"

matches_task1 = re.findall(regexp_task1,string_task1,re.MULTILINE)

plur = ""
if len(matches_task1) > 1:
    plur = "s"
else:
    plur = ""

print ( "Employee" + plur + " : " + ", ".join(matches_task1) )
Task 2
import re

#task2
string_task2 = """
John was born 2021
In 2021, K2 was conquered
2021US returns to Paris agreement
21 people born in 2021 settlement
"""
regexp_task2 = r"[1-2][0-9](21)"

matches_task2 = re.findall(regexp_task2,string_task2,re.MULTILINE)

print("Number of 21 part of a year : " + str(len(matches_task2)))
Task 3
import re

string_task3 = """
Krzysztof Nowak runs R workout
Kedeisha Bryan runs Python training
Kedeisha Bryan runs SQL workout
"""

regexp_task3 = r"(R|SQL|Pyhton)\b.*(?i:workout)$"

matches_task3 = re.findall(regexp_task3,string_task3,re.MULTILINE)

plur = ""
if len(matches_task3) > 1:
    plur = "ies"
else:
    plur = "y"

print("Technolog" + plur + " : " + ', '.join(matches_task3))
Task 4
import re

string_task4 = """
The US Navy is the largest
France’s navy is a little smaller
Britain’s Navy was when the largest
"""

regexp_task4 = r"(?<!US\s)\b(Navy|navy)\b"

matches_task4 = re.findall(regexp_task4,string_task4)

print( matches_task4 )

Thanks again !

1 Like

@KrzysztofNowak,

Task 1

/^(?:[[:upper:]][[:lower:]]+)(?=’s?.*\d{5,}\$)/gm

image
image

Task 2

/(?<=[12]\d)21/gm

image
image

Task 3

/(?<=runs\s)(?:R|Python|SQL)(?= workout\s?$)/gm

Task 4

/(?<!US\s)navy/gi

1 Like

Hi @BrianJ , it seams that lookbehind/ahead options supports only fixed-length so */+ and anything dynamic will not work.

1 Like

@KrzysztofNowak ,

Ahhhh… did not realize that, but make sense. Thanks! – that was driving me nuts.

  • Brian

Hello @KrzysztofNowak ! Great workout!
Thanks

Task 1

Solution: \b\w+(?=’.+\d{5})

Task 2

Solution: (?<=\d{2})21

Task 3

Solution: \w+(?=\Wworkout)

Task 4

Solution: (?<!US\s)[N,n]avy

1 Like

Hello All, thank You for participating.

Please see my solutions with explanation:

[A-Z]: Matches any uppercase letter from A to Z.

[[:lower:]]: Matches any lowercase letter.

+: Matches one or more occurrences of the preceding pattern.

These sections make up the name. We did not use any tool that would refer to “’ 's”.

(?=.*\d{5,}): Positive lookahead (look into regex101.com Quick reference for this concept) assertion that checks if the string contains at least five consecutive digits. In case of James, “11000” satisfies this condition.

Task 2.

  • (?<=\d{2}): Positive lookbehind assertion that checks if the preceding two characters are digits.
  • [2]: Matches the digit “2” exactly.
  • [1]: Matches the digit “1” exactly.

This is why 21 at the beginning of last sentence is out of scope - this is not preceded by any numbers.

Task 3

  1. [[:alpha:]]+: Matches one or more alphabetical characters.
    2.(?=\sworkout): Positive lookahead assertion that checks if the characters following the matched alphabetical characters are a space followed by the word “workout”

Task 4

(?<!US\s): Negative lookbehind assertion that checks if the preceding characters are not "US " (the letters “US” followed by a space). This ensures that the word “Navy” is not preceded by "US ".
(?i): Case-insensitive matching. It allows matching “Navy” regardless of its case (upper or lower).
Navy: Matches the word “Navy” as it is.

1 Like

Great regexes @HufferD . I appreciate the fact that You use website tools to like case insentivity option

Hello @zwhite , thank you for participating

Your solutions are ok, but a bit more like grouping. I encourage you to look into concepts that will help to cut very specific parts of the text omitting their surroundings.Nice Python skill!

image

Hello @alex-badiu , thank you for submission. All looks great. For task 1 I would just choose range opened on the right side so that it includes salaries greater than 5 digits {5,}

1 Like

Thanks for the advice @KrzysztofNowak. I’m starting in regex patterns and the lookahead seems to be good alternative instead of grouping

Hi @KrzysztofNowak,

Here is my solution to this workout:

Q1:

Q2:

Q3:

Q4:

Thanks
Keith

1 Like