Advanced usage

Advanced usage #

Ignoring (parts of) code in your project with Semgrep #

Semgrep identifies programming languages based on their file extensions rather than content analysis. Use the --scan-unknown-extensions flag and the --lang flag to specify the language you want Semgrep to use when scanning files with non-standard extensions. For example:

semgrep --config /path/to/your/config --lang python --scan-unknown-extensions /path/to/your/file.xyz

In this example, Semgrep will scan the /path/to/your/file.xyz file as a Python file, even though the .xyz extension is not a standard Python file extension.

See also the Allow user to specify file extensions for languages #3090 GitHub issue to work around restrictions if you want to use Semgrep against your specific language, even if the file extension is not standard.

Files/directories #

  • By default, Semgrep follows the default .semgrepignore file.
  • If present, Semgrep will look at the repository’s .gitignore file.
  • In case of a conflict between the two files, the .semgrepignore file takes precedence. This means that if the .gitignore file includes a file and the .semgrepignore file excludes it, Semgrep will not analyze the file.

Before starting a scan, it is recommended that you review the files and directories in your project directory. Note that certain paths may be excluded by default. If you want to change the default exclusion behavior, such as including third-party libraries or unit tests in the scan, you can create a custom .semgrepignore file.

Excluding code sections #

To prevent Semgrep from flagging incorrect code patterns, insert a comment in your code immediately before or on the line preceding the pattern match (e.g., // nosemgrep: rule-id). It is crucial to have a space between // and nosemgrep.

As a best practice, remember to:

  • Exclude only particular findings in your comments rather than disabling all rules with a generic // nosemgrep comment.
  • Explain why you disabled a rule or justify your risk acceptance decision.
  • If you encounter a false positive and want to ignore a Semgrep rule, provide feedback to either the Semgrep development team or your internal development team responsible for the specific rule. This will help improve the accuracy of the rule and reduce the chances of future false positives.

For more information on how to use nosemgrep to ignore code blocks for a particular rule, refer to the Semgrep documentation on ignoring code.

Writing custom rules #

While Semgrep offers a library of pre-built rules, creating custom rules can significantly enhance your security testing by tailoring it to your specific codebase and requirements. However, creating effective Semgrep rules can be challenging without proper guidance and understanding. This section will give you the essential knowledge and skills to create high-quality Semgrep rules. You will learn about the rule language’s syntax and how to develop effective patterns, handle edge cases, and create powerful custom Semgrep rules. This will aid in detecting potential security vulnerabilities early on, ultimately improving your testing process.

Example custom rule #

As a starting point for creating a custom rule, use the following schema to create the custom_rule.yaml file.

1rules:
2  - id: rule-id
3    languages: [go]
4    message: Some message
5    severity: ERROR # INFO / WARNING / ERROR
6    pattern: test(...)

Running custom rules #

  • To run the above-mentioned rule as a single file, use the following command:
semgrep --config custom_rule.yaml
  • To run a set of rules in a directory:
semgrep --config path/

ABCs of writing custom rules #

To start writing custom Semgrep rules, it is crucial to understand a few key concepts and tools:

  1. Familiarize yourself with Semgrep syntax: Begin by exploring the official Learn Semgrep Syntax page, which provides a comprehensive guide on the fundamentals of Semgrep rule writing.
  2. Refer to language-specific pattern examples: Consult the Semgrep Pattern Examples by Language for examples tailored to specific programming languages.
  3. Use the Semgrep Playground: The Semgrep Playground is a convenient online tool for writing and testing rules. However, it is essential to consider the following points when using the Playground:
    Be cautious of privacy concerns: The Semgrep Playground allows users to experiment with code without downloading or installing software on their local machine. While this platform is helpful for testing and debugging rules, it may expose sensitive information such as passwords, API keys, or other secrets contained in the code you submit for scanning. Always use a local development environment with proper security and privacy controls for sensitive code.
    • Employ the simple mode: The Semgrep Playground’s simple mode makes it easy to combine rule patterns.
    • Use the Share button: Share your rule and test code with others using the Share button.
    • Add tests to your test code: Incorporate tests (e.g., # ruleid: <id>) into your test code to evaluate your rule’s effectiveness while working in the Semgrep Playground (see example).
    • Note the limitations with comments: Be aware that the Semgrep Playground does not retain comments when sharing a link or “forking” a rule (Ctrl+S). Refer to this GitHub issue for more information.

Building blocks #

Ellipses (...) #

Purpose: The ellipsis (...) is used to match zero or more arguments, statements, parameters, and so on, allowing for greater flexibility in pattern matching.

Here is an example rule for Python:

1rules:
2  - id: rule-id
3    languages: [Python]
4    message: Some message
5    severity: INFO
6    pattern: requests.get(..., verify=False, ...)

Here, the ellipsis before and after the verify=False argument allows the pattern to match any number of arguments before and after the verify parameter. This ensures that the pattern can match function calls with various argument combinations, as long as the verify=False argument is present.

This pattern matches the following code snippets:

1requests.get(verify=False, url=URL)
2requests.post(verify=False, url=URL)
3requests.get(URL, verify=False, timeout=3)
4requests.head()
5requests.get(URL)
6requests.get(URL, verify=False)

In the second example, the ellipsis is used to create a pattern that matches an if statement followed by an unnecessary else block after a return statement within the if block.

Below is the unnecessary-if-else-pattern rule for Python:

 1rules:
 2  - id: unnecessary-if-else-pattern
 3    languages: [Python]
 4    message: Unnecessary else after return $X
 5    severity: INFO
 6    pattern: |
 7      if ...:
 8        return ...
 9      else:
10        ...      

Now, let’s break down the pattern components:

  1. if ...:: This part of the pattern matches any if statement, regardless of the condition being tested. The ellipsis within the if statement is a wildcard that matches any expression or code structure used as the condition. This flexibility ensures that the pattern can detect a wide range of if statements with various conditions.
  2. return ...: Within the matched if block, the return statement is followed by an ellipsis. This wildcard matches any expression or value being returned. This allows the pattern to detect return statements with different values or expressions, such as return True, return False, return x, or return calculate_result().
  3. ... within the else block: The ellipsis in the else block is a wildcard that matches any number of statements.

This pattern matches the following code snippet:

1if a > b:
2  return True
3else:
4  print("a is not greater than b")

By including the ellipsis (...) in your Semgrep rules, you can create more flexible and comprehensive patterns that account for variations in code structure.

Metavariables #

Purpose: Metavariables are used to match and track values across a specific code scope. They are denoted by a dollar sign followed by a capitalized letters (e.g., $X, $Y, $COND).

Here is an example pattern in Golang:

pattern: $X.($TYPE)

The metavariable $X matches:

1msg, ok := m.(*MsgDonate) // $X = m
2p := val.(types.Pool) // $X = val
3x := val
4msg, ok = m

Metavariables can also be interpolated into the output message of a Semgrep rule. For instance, consider the following rule:

1rules:
2  - id: metavariable-example-rule
3    patterns:
4      - pattern: func $X(...) { ... }
5    message: Found $X function
6    languages: [golang]
7    severity: WARNING

For the following code:

1func test123(input string) {
2    fmt.Println("test")
3}

This returns the Found test123 function message in the Semgrep output, as follows:

$ semgrep -f rule.yml
# (...)
     metavariable-example-rule
        Found test123 function

          1┆ func test123(input string) {
          2┆     fmt.Println("test")
          3┆ }

Metavariables help create more dynamic and versatile Semgrep rules by capturing values that can be used for further pattern matching or validation.

Leveraging metavariables #

Metavariables can be used in a variety of ways to enhance Semgrep rules, making them more dynamic and adaptable when analyzing code. Some common use cases include:

  1. Matching variable names: Metavariables can be used to match variable names in the code, allowing the rule to be flexible and applicable to various situations. For example:

    pattern: $X := $Y
    

    This pattern would match assignments like a := b or result := calculation().

  2. Capturing function calls: Metavariables can be employed to capture function calls and their arguments. This can be useful for detecting potentially unsafe or deprecated functions. For example:

    pattern: $FUNC($ARG)
    

    This pattern would match function calls like dangerousFunc(input) or deprecatedFunc(arg1, arg2).

  3. Matching control structures: Metavariables can help identify specific control structures, such as loops or conditionals, with a particular focus on the expressions used within these structures. For example:

    pattern: for $INDEX := $INIT; $COND; $UPDATE { ... }
    

    This pattern would match for-loops like for i := 0; i < 10; i++ { ... }.

  4. Comparing code patterns: Metavariables can be used to compare different parts of the code to ensure consistency or prevent potential bugs. For example, you can detect cases where the same assignment is made in both branches of an if-else statement:

    pattern: if $COND { $X = $Y } else { $X = $Y }
    

    This pattern would match code like:

    1if someCondition {
    2    x = y
    3} else {
    4    x = y
    5}
    
  5. Identifying patterns across multiple lines: Metavariables can be employed to match and track values across multiple lines of code, making it possible to detect patterns that span several statements. For example:

    pattern: |
      $VAR1 := $EXPR1
      $VAR2 := $VAR1  
    

    This pattern would match code like the following:

    1a := b + c
    2d := a
    

In conclusion, metavariables offer a powerful way to create dynamic and adaptable Semgrep rules. They help capture and track values across code scopes, enabling the identification of complex patterns and providing informative output messages for developers and security professionals.

Nested metavariables #

Purpose: Nested metavariables allow you to match a pattern with a metavariable that also contains another metavariable meeting certain conditions.

Here is an example rule:

 1rules:
 2  - id: metavariable-pattern-nest
 3    languages: [python]
 4    message: substraction in foo(bar(...))
 5    patterns:
 6      - pattern: foo($X, ...)
 7      # First metavariable-pattern
 8      - metavariable-pattern:
 9          metavariable: $X
10          patterns:
11            - pattern: bar($Y)
12            # Nested metavariable pattern
13            - metavariable-pattern:
14                metavariable: $Y
15                patterns:
16                  - pattern: ... - ...
17    severity: WARNING

This rule matches the following Python code:

1foo(bar(1-2))
2foo(bar(bar(1-2)))

Nested metavariables allow for more complex and precise pattern matching in Semgrep rules by allowing you to define relationships between multiple metavariables.

Using metavariable-pattern for polyglot file scanning #

Purpose: To match patterns across different languages within a single file (e.g., JavaScript embedded in HTML).

Example: Find all instances of JavaScript’s eval function used in an HTML file ( example).

 1rules:
 2  - id: metavariable-pattern-nest
 3    languages: [html]
 4    message: eval in JS
 5    patterns:
 6      - pattern: <script ...>$Y</script>
 7      - metavariable-pattern:
 8          metavariable: $Y
 9          language: javascript
10          patterns:
11            - pattern: eval(...)
12    severity: WARNING

This rule matches the following HTML code:

1<script>
2    console.log('test123');
3    eval(1+1);
4</script>

Using metavariable-pattern allows for cross-language pattern matching in polyglot files, enabling you to identify specific code patterns within mixed-language files.

Using metavariable-pattern + pattern-either #

Purpose: To specify multiple alternative patterns that can match a metavariable.

Example: Flag instances where a variable declaration uses one of several specific types ( example / trailofbits.go.string-to-int-signedness-cast.string-to-int-signedness-cast rule).

 1rules:
 2  - id: metavariable-pattern-multiple-or
 3    languages: [go]
 4    message: xyz
 5    patterns:
 6      - pattern: var $A $TYPE = ...
 7      - metavariable-pattern:
 8          metavariable: $TYPE
 9          pattern-either:
10            - pattern: uint8
11            - pattern: uint16
12            - pattern: uint32
13            - pattern: int8
14            - pattern: int16
15            - pattern: int32
16    severity: WARNING

This rule matches the following Go code:

1var a uint8 = 255
2var b uint16 = 65535
3var c uint32 = 4294967295
4var d int8 = -128
5var e int16 = -32768
6var f int32 = -2147483648
7var g string = "xyz"

Combining metavariable-pattern with pattern-either allows you to create Semgrep rules that match a metavariable if it meets any of the specified conditions.

Metavariable-pattern + patterns #

Purpose: Use metavariable-pattern and patterns to flag instances where a metavariable $X meets all conditions (patterns) ( example / lxml-in-pandas rule)

Here is an example rule:

 1rules:
 2  - id: metavariable-pattern-and-patterns
 3    languages:
 4      - go
 5    message: xyz1
 6    patterns:
 7      - pattern: var $A $TYPE = $Z
 8      - metavariable-pattern:
 9          metavariable: $Z
10          patterns:
11            - pattern-not: |
12                  -128                  
13            - pattern-not: |
14                  -32768                  
15    severity: WARNING

This rule matches the following Go code:

1var b uint16 = 65535
2var d int8 = -128
3var c uint32 = 4294967295
4var e int16 = -32768

Constant propagation #

Constant propagation in Semgrep refers to the process of matching instances where a metavariable holds a specific value or relation.

Matching instances where a metavariable holds a specific value #
Purpose: To match instances where a metavariable holds a specific value or relation, use the metavariable-comparison key.

Example: Match cases where the variable $X is greater than 1337 ( example).

 1rules:
 2  - id: metavariable-comparison
 3    languages: [python]
 4    message: $X is higher than 1337
 5    patterns:
 6      - pattern: function($X)
 7      - metavariable-comparison: # Match when $X > 1337
 8          metavariable: $X
 9          comparison: $X > 1337
10    severity: WARNING

This rule matches the following Python code:

1n = 1339
2function(n) # Match (n > 1337)
3function(1338) # Match (constant > 1337)
4function(123)
Comparing specific metavariables #
Purpose: Compare specific metavariables.

Example: Match functions where the first argument is lower than the second one ( example).

 1rules:
 2  - id: metavariable-comparison-rule
 3    patterns:
 4      - pattern: f($A, $B)
 5      - metavariable-comparison:
 6          comparison: int($A) < int($B)
 7          metavariable: $A
 8    message: $A < $B
 9    languages: [python]
10    severity: WARNING

This rule matches the following Python code:

1f(1,2)
2f(2,3)
3f(4,3)
4f(12312,1)

Deep expression operator #

Purpose: To match deeply nested expressions in the code.
Deep expression operator is useful when you want to identify specific patterns that are buried within complex structures like conditional statements, loops, or function calls. Using the deep expression operator, you can create rules that target specific code patterns regardless of how deep they are in the code structure.

The deep expression operator is represented by <... ...>. It acts as a wildcard that matches any code structure between the opening and closing ellipses. By using the deep expression operator, you can create Semgrep rules that match patterns in any level of nesting.

Example: Matching a function call nested within an if statement ( example).

Suppose you want to match any instance of a specific function call (e.g., user.is_admin()) within an if statement, regardless of how deeply nested it is.

1rules:
2- id: deep-expression-example
3  pattern: |
4      if <... user.is_admin() ...>:
5        print(...)      
6  message: if statement with is_admin() check
7  languages: [python]
8  severity: WARNING

This rule matches the following Python code:

1if user.authenticated() and user.is_admin() and user.has_group(gid):
2    print("hello")

Understanding pattern-inside and pattern-not-inside #

Using pattern-inside #

By using pattern-inside, you can create rules that match patterns only when they appear within a certain code construct, like a function, or class definition, a loop, or a conditional block.

Here’s an example of how you might use pattern-inside to detect cases where a sensitive function is called within a loop:

 1rules:
 2- id: sensitive_function_in_loop
 3  languages:
 4    - python
 5  message: "Sensitive function called inside a loop"
 6  severity: WARNING
 7  patterns:
 8    - pattern-inside: |
 9        for ... in ...:
10            ...        
11    - pattern: |
12        sensitive_function(...)        

In this example, the pattern-inside operator is used to match any for loop in Python, and the second pattern matches calls to sensitive_function(). The rule will trigger only if both patterns are matched, meaning that the sensitive_function is called inside a loop.

Here’s an example of Python code that would trigger the sensitive_function_in_loop rule:

 1def sensitive_function(data):
 2    # Process sensitive data
 3    pass
 4
 5def main():
 6    data_list = ['data1', 'data2', 'data3']
 7
 8    for data in data_list:
 9        # Call to sensitive_function is inside a loop
10        sensitive_function(data)
11
12def second(data):
13    sensitive_function(data)
Using pattern-not-inside #

pattern-not-inside is the opposite of pattern-inside and is used to match a pattern only when it does not appear within a specified context. This operator helps you to exclude certain parts of the code from your analysis, further refining your rules and reducing false positives.

For instance, you can use pattern-not-inside to detect calls to the print_debug() function when they occur outside a if debug: block:

 1rules:
 2- id: print_debug_outside_debug_block
 3  languages:
 4    - python
 5  message: "print_debug() should be called inside a 'if debug:' block"
 6  severity: WARNING
 7  patterns:
 8    - pattern-not-inside: |
 9        if debug:
10            ...        
11    - pattern: |
12        print_debug(...)        

Here is a Python code example demonstrating the use of this rule:

 1debug = True
 2
 3def print_debug(msg):
 4    print("DEBUG:", msg)
 5
 6def correct_usage():
 7    if debug:
 8        print_debug("This is a debug message inside a 'if debug:' block")
 9
10def incorrect_usage():
11    print_debug("This is a debug message outside a 'if debug:' block")
12
13def main():
14    correct_usage()
15    incorrect_usage()
Combining pattern-inside and pattern-not-inside #

In some cases, you might want to create rules that use both pattern-inside and pattern-not-inside operators to capture instances where a specific pattern is found within a particular context but not within another.

Example: Detecting print() calls in functions but not in main().

Suppose you want to enforce a rule where print() calls are allowed only within the main() function and not in any other functions. You can create a rule that combines pattern-inside and pattern-not-inside operators to achieve this.

 1rules:
 2- id: print_calls_outside_main
 3  languages:
 4    - python
 5  message: "print() calls should only be inside the main() function"
 6  severity: WARNING
 7  patterns:
 8    - pattern-inside: |
 9        def $X(...):
10            ...        
11    - pattern-not-inside: |
12        def main(...):
13            ...        
14    - pattern: |
15        print(...)        

In this example, the pattern-inside operator matches any function definition, while the pattern-not-inside operator ensures that the main() function is excluded. The final pattern matches calls to the print() function. The rule will trigger only when a print() call is found inside a function other than main().

Here’s an example of Python code that triggers the print_calls_outside_main rule:

 1def sample_function():
 2    # print() call inside a function other than main()
 3    print("This is a sample function")
 4
 5def main():
 6    print("This is the main function")
 7    sample_function()
 8
 9def other_function():
10    some_function()
11    print("XYZ")

Taint mode #

Taint mode is a powerful feature in Semgrep that can track the flow of data from one location to another. By using taint mode, you can:

  1. Track data flow across multiple variables: Taint mode enables you to trace how data moves across different variables, functions, components, and allows you to easily identify insecure flow paths (e.g., situations where a specific sanitizer is not used).
  2. Find injection vulnerabilities: Taint mode is particularly useful for identifying injection vulnerabilities such as SQL injection, command injection, and XSS attacks.
  3. Write simple and resilient Semgrep rules: Taint mode simplifies the process of writing Semgrep rules that are resilient to certain code patterns nested in if statements, loops, and other structures.

To use taint mode, you need to set the mode: taint and specify pattern-sources/pattern-sinks fields in your custom Semgrep rule.

See this example:

 1rules:
 2  - id: taint-tracking-example1
 3    mode: taint
 4    pattern-sources:
 5      - pattern: getData()
 6    pattern-sinks:
 7      - pattern: printToUser(...)
 8    message: data flows from getData to printToUser
 9    languages: [python]
10    severity: WARNING

Optionally, you can use additional fields in your Semgrep rule to further refine your taint analysis:

  • pattern-propagators: This field allows you to specify functions or methods that propagate tainted data ( example). You can also refer to sanitizers by side-effect for more information.
  • pattern-sanitizers: This field allows you to specify functions or methods that sanitize tainted data. For more information, see the taint mode documentation.

Combining patterns #

When writing Semgrep rules, you may encounter situations where a single pattern (e.g., pattern: evil_function(...)) isn’t sufficient to capture the behavior you want to detect. In these cases, you can use one of the following to combine patterns:

  • patterns: This method combines multiple patterns with a logical AND (&&). In other words, all patterns must match for the rule to trigger. This is useful when you want to detect code snippets that satisfy multiple conditions simultaneously.

  • pattern-either: This method combines multiple patterns with a logical OR (||). In other words, if any of the patterns match, the rule triggers. This is useful when you want to detect code snippets satisfying at least one specified condition.

    Suppose you want to detect calls to two insecure functions, insecure_function_1() and insecure_function_2(). You can use the pattern-either operator to achieve this.

     1rules:
     2- id: insecure_function_calls
     3  languages:
     4    - python
     5  message: "Call to an insecure function detected"
     6  severity: WARNING
     7  patterns:
     8    - pattern-either:
     9        - pattern: |
    10            insecure_function_1(...)            
    11        - pattern: |
    12            insecure_function_2(...)            
    

    In this example, the pattern-either operator is used to match calls to either insecure_function_1() or insecure_function_2(). The rule will trigger if any of these patterns are matched.

    Here’s an example of Python code that triggers the insecure_function_calls rule:

     1def insecure_function_1():
     2    print("Insecure function 1 called")
     3
     4def insecure_function_2():
     5    print("Insecure function 2 called")
     6
     7def main():
     8    # Call to insecure_function_1() triggers the rule
     9    insecure_function_1()
    10
    11    # Call to insecure_function_2() also triggers the rule
    12    insecure_function_2()
    
  • pattern-regex: This matches code with a PCRE-compatible pattern in multiline mode. In other words, it matches code using a regular expression pattern.

Rule syntax diagram #

The following diagram will help you understand the relationship between the relevant fields in the rule. While writing a rule, you can use the advanced mode in the Semgrep Playground to test and refine it. The playground highlights any errors in your rules, providing immediate feedback.

flowchart TB Fields{Rule Fields} ---->|Only one is allowed| Required{Required} click Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20syntax-,Rule%20syntax,-TIP" Required ==> id click id "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Description-,id,-string" Required ==> message click message "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=no%2Dunused%2Dvariable-,message,-string" Required ==> severity click severity "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20messages.-,severity,-string" Required ==> languages((languages)) click languages "https://semgrep.dev/docs/writing-rules/rule-syntax/#language-extensions-and-tags" Required ===>|Only one is required| Pattern_Fields{Pattern Fields} click Pattern_Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=pattern*,in%20multiline%20mode" click Required "https://semgrep.dev/docs/writing-rules/rule-syntax/#required" Pattern_Fields ==> pattern click pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern" Pattern_Fields ==> pattern-regex[pattern-regex] click pattern-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-regex" Pattern_Fields ==> pattern-either((pattern-either)) click pattern-either "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-either" Pattern_Fields ==> patterns((patterns)) click patterns "https://semgrep.dev/docs/writing-rules/rule-syntax/#patterns" pattern-either -.-> pattern-regex pattern-either -.-> pattern pattern-either -.-> pattern-inside click pattern-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-inside" pattern-either <-.-> patterns patterns -.-> pattern-inside patterns <-..-> metavariable-pattern{metavariable-pattern} click metavariable-pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-pattern" metavariable-pattern --> metavariable2[metavariable] metavariable-pattern -.-> language metavariable-pattern -.-> pattern metavariable-pattern -.-> pattern-either metavariable-pattern -.-> pattern-regex patterns -.-> metavariable-regex{metavariable-regex} click metavariable-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-regex" metavariable-regex --> metavariable metavariable-regex --> regex patterns -.-> metavariable-comparison{metavariable-comparison} click metavariable-comparison "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-comparison" metavariable-comparison --> metavariable3[metavariable] metavariable-comparison --> comparison metavariable-comparison -.-> base metavariable-comparison -.-> strip patterns -.-> pattern patterns -.-> pattern-not click pattern-not "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not" patterns -.-> pattern-not-inside click pattern-not-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-inside" patterns -.-> pattern-not-regex click pattern-not-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-regex" Fields -.-> Optional{Optional} Optional -.-> options(options) Optional -.-> fix(fix) Optional -.-> metadata(metadata) Optional -.-> paths(paths) click Optional "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional" click options "https://semgrep.dev/docs/writing-rules/rule-syntax/#options" click fix "https://semgrep.dev/docs/writing-rules/rule-syntax/#fix" click metadata "https://semgrep.dev/docs/writing-rules/rule-syntax/#metadata" click paths "https://semgrep.dev/docs/writing-rules/rule-syntax/#paths"

Example #1: Looking at the chart, you can see that the pattern-either and pattern-not fields are not directly connected. However, you can combine them using the patterns field, which performs a logical AND operation on all the patterns included.

Example #2: For instance, if you want to use pattern-either to combine multiple patterns with a logical OR and exclude a specific pattern using pattern-not, you can do so by including both of them under the same patterns field. The resulting combination of patterns will match only code that satisfies all of the patterns included in the pattern-either field, except for the pattern specified in pattern-not. See the example exclude-when-using-secure-option rule.

Generic pattern matching #

It is possible to match generic patterns in unsupported languages/contexts. Use the generic language for configuration files, XML, etc., and combine it with the specific extension through the paths - include fields to reduce false positives.

For example, see the nsc-allows-plaintext-traffic rule, which scans the Android manifest XML file for potential misconfiguration:

 1rules:
 2  - id: nsc-allows-plaintext-traffic
 3    languages: [generic]
 4    patterns:
 5      - pattern: |
 6          <base-config ... cleartextTrafficPermitted="true" ... >          
 7      - pattern-not-inside: |
 8          <!-- ... -->          
 9      - pattern-not-inside: >
10          <network-security-config ... InsecureBaseConfiguration ... > ... ...
11          ... ... ... ... ... ... ... ... </network-security-config>          
12    severity: INFO
13    paths:
14      include:
15        - "*.xml"

Metadata #

Metadata fields are a feature in Semgrep that allow you to attach additional information to your rules. By including metadata fields in your rules, you can give developers more context and guidance on addressing potential issues. This information can include details such as the rule’s severity level, recommended fixes, or the author’s contact information. By including metadata, you can make your rules more informative and actionable for developers who encounter them. This can help them prioritize and fix issues more efficiently, ultimately improving the overall security of your codebase.

In addition to providing context and guidance to developers, there are several other reasons why an organization might want to use Semgrep metadata:

  1. Standardization. Using metadata fields consistently across all of your organization’s Semgrep rules ensures that developers see the same types of information and recommendations no matter which rules they encounter. This can help standardize the security review process and simplify prioritizing and addressing issues.
  2. Collaboration. Including author information in your Semgrep rules can make it easier for other organization members to collaborate on security issues.
    • Example: Suppose someone has a question or needs more information about a particular rule. In that case, they can contact the author directly for clarification.
  3. Compliance. Suppose your organization needs to comply with specific security regulations or standards. In this case, you could include a compliance metadata field in your Semgrep rules, indicating which regulation or standard the rule relates to. This helps ensure that your codebase complies with all relevant requirements.

You can create any metadata field, as demonstrated in the hooray-taint-mode rule.

We recommend including the following metadata fields required by the security category in the Semgrep Registry:

  1. cwe: A Common Weakness Enumeration identifier that classifies the security issue.
  2. confidence: An assessment of the rule’s accuracy, represented as high, medium, or low.
  3. likelihood: An estimation of the probability that the detected issue will be exploited, represented as high, medium, or low.
  4. impact: A measure of the potential damage caused by exploiting the detected issue, represented as high, medium, or low.
  5. subcategory: A more specific classification of the rule, falling under one of the following categories: vuln, audit, or guardrail.

By including these metadata fields, you provide valuable context and help users better understand the security implications of the issues detected by your rule.

Various tips #

Matching an array with a non-string element #

This Semgrep rule aims to detect JavaScript or TypeScript arrays that contain at least one non-string element. See this array-with-a-non-string-element example.

 1rules:
 2  - id: array-with-a-non-string-element
 3    languages: [js]
 4    message: array with element that is not a string
 5    severity: WARNING
 6    patterns:
 7      - metavariable-pattern:
 8          metavariable: $A
 9          patterns:
10            - pattern-not: "..."
11      - pattern: [..., $A, ...]

“Removing” negative pattern from pattern-either #

This Semgrep rule aims to detect Python code snippets where a function a(...), b(...), or c(...) is called, but it should not match the case where function a() is called with the argument x. See this pattern-not-with-pattern-either example

 1rules:
 2- id: pattern-not-in-pattern-either
 3  patterns:
 4    - pattern-either:
 5       - pattern: a(...)
 6       - pattern: b(...)
 7       - pattern: c(...)
 8    - pattern-not: a(x)
 9  message: pattern either with one negative pattern
10  languages: [python]
11  severity: WARNING

Maintaining good quality of Semgrep rules #

Before publishing a new rule or updating an existing one, it is crucial to ensure that it meets specific standards and is effective. To help with this, we’ve created a Development Practices checklist in our Contributing to Trail of Bits Semgrep Rules document that you can follow to make sure your custom rule is ready for publication.

Help with writing custom rules #

Warning: Be careful about asking for external assistance for writing rules or sharing rule output that may be specific to a sensitive and/or private codebase. Doing so could inadvertently disclose the identity of the code owner, portions of the code, or particular bugs.

When running into issues while working on custom rules, several resources are available to help you. Two of the most valuable resources are the following:

  • The Semgrep Community Slack is a great place to ask for help with custom rule development. The channel is staffed by knowledgeable developers familiar with Semgrep’s architecture and syntax. They are usually quick to respond to questions. They can guide you in structuring your rules and in debugging any issues that arise. Additionally, the Slack channel is a great place to connect with other developers working on similar projects, allowing you to learn from others’ experiences and share your insights.
  • Use Semgrep GitHub issues to report bugs, suggest new features, and ask for help with specific issues.

Thoroughly testing Semgrep rules for optimal performance #

Creating comprehensive tests for your Semgrep rules is essential to ensure they perform as expected and cover a wide range of test cases. By thoroughly testing the rules against various code samples, you can confirm that they accurately identify intended vulnerabilities, potential errors, or coding standard violations. This ultimately leads to more reliable and effective security and code quality analysis.

Designing comprehensive test cases #

A well-rounded test suite for a custom Semgrep rule should cover multiple aspects of the rule’s functionality.

When designing test cases, consider the following:

  1. Create a file containing code samples: Create a file containing code with the same name as the rule. For example, if your rule filename is unsafe-exec.yml, create a corresponding unsafe-exec.py file with sample code.
  2. Incorporate a diverse range of code samples: Adhere to the following guidelines when adding code samples to the test file:
    • Include at least one true positive comment (e.g., // ruleid: id-of-your-rule).
    • Include at least one true negative comment (e.g., // ok: id-of-your-rule).
    • Start with simple, descriptive examples that are easy to understand.
    • Progress to more advanced, complex examples, such as those involving nested structures (e.g., inside an if statement) or deep expressions.
    • Include edge cases that may challenge the rule’s accuracy or efficiency, such as large input values, complex code structures, or unusual data types.
    • Test the rule against different language features and constructs, including loops, conditionals, classes, and functions.
    • Intentionally create code samples that should not trigger the rule, and ensure that the rule does not produce false positives in these cases.
  3. Ensure all tests pass: Run the $ semgrep --test command to verify that all test cases pass.
  4. Evaluate the rule against real-world code: Test the rule against actual code from your projects, open-source repositories, or other codebases to assess its effectiveness in real-life scenarios.

Testing custom rules in CI #

GitHub Actions #

The following workflow can be used to test custom Semgrep rules in GitHub Actions:

name: Test Semgrep rules

on: [push, pull_request]

jobs:
  semgrep-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: "3.11"
          cache: "pip"
      - run: python -m pip install -r requirements.txt
      - run: semgrep --test --test-ignore-todo ./path/to/rules/

Make sure to include semgrep in your requirements.txt (or poetry or pipenv equivalents) file to speed up workflow runs by caching the dependency. Note, we include --test-ignore-todo here so we do not fail CI runs on TODO tests, which are a valuable form of documentation for future rule improvements.

Autofix feature #

The autofix feature can automatically correct identified vulnerabilities, potential errors, or coding standard violations.

There are many benefits to using the autofix feature:

  • Training every developer on all the best practices for large code bases is not feasible. Autofixes can help fill in the gaps and provide guidance as needed.
  • Autofixes maintain developer focus by removing monotonous changes, allowing them to concentrate on more complex tasks.
  • Adding autofixes allows developers to be educated and trained on new best practices as they are introduced into the codebase.
  • Autofixes can provide on-demand fixes and are much more actionable and educational than simple lint warnings.
  • Without making developers aware of a deprecation, they won’t know not to use a deprecated component, and they won’t know what to use instead. Autofixes can help make these transitions smoother.

Creating a Semgrep rule with the autofix feature #

Follow these steps to develop a rule with the autofix feature (see the ioutil-readdir-deprecated rule with the autofix feature implemented):

  1. Add the fix key to a rule, specifying the replacement pattern for the identified vulnerability.

    Here is an example rule with the autofix feature:

    1rules:
    2  - id: ioutil-readdir-deprecated
    3    languages: [golang]
    4    message: ioutil.ReadDir is deprecated. Use more efficient os.ReadDir.
    5    severity: WARNING
    6    pattern: ioutil.ReadDir($X)
    7    fix: os.ReadDir($X)
    

    For the following Golang code:

     1package main
     2
     3import (
     4  "fmt"
     5  "io/ioutil"
     6  "log"
     7  "os"
     8)
     9
    10func main() {
    11    // ruleid: ioutil-readdir-deprecated
    12  files, err := ioutil.ReadDir(".")
    13  if err != nil {
    14    log.Fatal(err)
    15  }
    16
    17  for _, file := range files {
    18    fmt.Println(file.Name())
    19  }
    20}
    
  2. Run the rule using the standard command to confirm that the rule is detecting the intended issue:

    $ semgrep -f rule.yaml
    # (...)
    Findings:
    
      readdir.go
        ioutil-readdir-deprecated
            ioutil.ReadDir is deprecated. Use more efficient os.ReadDir.
    
            ▶▶┆ Autofix ▶ os.ReadDir(".")
            11┆ files, err := ioutil.ReadDir(".")
    # (...)
    
  3. Run the rule with the --dryrun and the --autofix options to preview the behavior of the autofix feature on the code without making any changes to the analyzed code:

    $ semgrep -f rule.yaml --dryrun --autofix
    # (...)
    Findings:
    
      readdir.go
        ioutil-readdir-deprecated
            ioutil.ReadDir is deprecated. Use more efficient os.ReadDir.
    
            ▶▶┆ Autofix ▶ os.ReadDir(".")
            11┆ files, err := os.ReadDir(".")
    # (...)
    
  4. Create a new test file for the autofix by adding the .fixed suffix in front of the file extension (e.g., readdir.go -> readdir.fixed.go). This file should contain the expected output after the autofix is applied.

    Content of the readdir.fixed.go file:

     1package main
     2
     3import (
     4  "fmt"
     5  "io/ioutil"
     6  "log"
     7  "os"
     8)
     9
    10func main() {
    11    // ruleid: ioutil-readdir-deprecated
    12  files, err := os.ReadDir(".")
    13  if err != nil {
    14    log.Fatal(err)
    15  }
    16
    17  for _, file := range files {
    18    fmt.Println(file.Name())
    19  }
    20}
    
  5. Run the test to confirm that the autofix is working as expected:

    $ semgrep --test
    1/1: ✓ All tests passed
    1/1: ✓ All fix tests passed
    
  6. Now you are ready to apply autofix to the analyzed file with the --autofix option.

    $ semgrep -f rule.yaml --autofix
    # (...)
    Findings:
    
      readdir.go
        ioutil-readdir-deprecated
            ioutil.ReadDir is deprecated. Use more efficient os.ReadDir.
    
            ▶▶┆ Autofix ▶ os.ReadDir(".")
            11┆ files, err := ioutil.ReadDir(".")
    # (...)
    

By following these steps, you can create a custom Semgrep rule with an effective autofix feature that identifies issues and provides a solution to fix them.

Regular expression-based autofix #

The fix field presented above allows you to specify a simple string replacement, while the fix-regex field enables more complex regular expression-based replacements. For more information, refer to the official documentation on Autofix with regular expression replacement.

Optimizing Semgrep rules #

Improve rule performance and minimize false positives through repeatable processes.

Optimizing your Semgrep rules is crucial for maintaining high performance and minimizing false positives. This section will guide how to create efficient and accurate Semgrep rules.

  1. Analyze time summary: To include a time summary with the results, use the --time flag. This will provide the following information:

    • Total time / Config time / Core time
    • Semgrep-core time
      • Total CPU time
      • File parse time
      • Rule parse time
      • Matching time
    • Slowest five analyzed files
    • Slowest five rules to match
  2. Narrow down findings to specific file paths: Assess whether findings should be limited to specific file paths (e.g., Dockerfiles).

    • You can apply particular rules to certain paths using the paths keyword. For example, the avoid-apt-get-upgrade rule targets only Dockerfiles:

      17  paths:
      18      include:
      19        - "*dockerfile*"
      20        - "*Dockerfile*"
      
  3. Use pattern-inside and pattern-not-inside: The pattern-inside and pattern-not-inside clauses allow you to specify a context in which a pattern should or should not be matched, respectively.

    Consider a scenario where you want to identify calls to insecure_function() within a loop, followed by a specific statement, such as a call to log_data(), but only when the log level is set to DEBUG.

    Initially, you can achieve this by using one pattern statement:

     1rules:
     2- id: insecure_function_in_loop_followed_by_debug_log
     3  languages: [python]
     4  message: |
     5    Insecure function called within a loop
     6    followed by log_data() with log level DEBUG    
     7  severity: WARNING
     8  pattern: |
     9    for ... in ...:
    10        ...
    11        insecure_function(...)
    12        ...
    13        log_data("DEBUG", ...)    
    

    Here’s an example of Python code that triggers the insecure_function_in_loop_followed_by_debug_log rule:

     1def insecure_function():
     2    print("Insecure function called")
     3
     4def log_data(log_level, msg):
     5    if log_level == "DEBUG":
     6        print("DEBUG:", msg)
     7
     8def main():
     9    data_list = ['data1', 'data2', 'data3']
    10
    11for data in data_list:
    12    # Call to insecure_function() within a loop,
    13    # followed by log_data() with log level DEBUG triggers the rule
    14    insecure_function()
    15    other_function()
    16    function1337()
    17    log_data("DEBUG", "Insecure function called with data: " + data)
    

    Running the insecure_function_in_loop_followed_by_debug_log rule may not provide the clearest output, as it displays the entire for loop:

    $ semgrep -f insecure_function_in_loop_followed_by_debug_log.yml
    # (...)
      insecure_function_in_loop_followed_by_debug_log
        Insecure function called within a loop followed by log_data() with log level DEBUG
    
        11┆ for data in data_list:
        12┆  # Call to insecure_function() within a loop,
        13┆  # followed by log_data() with log level DEBUG triggers the rule
        14┆  insecure_function()
        15┆  other_function()
        16┆  function1337()
        17┆  log_data("DEBUG", "Insecure function called with data: " + data)
    

    For such findings, only the calls to insecure_function() might be of critical importance. To improve the output, you can use the following clauses instead:

    1. patterns: This clause combines two sub-patterns with a logical AND operator, meaning all sub-patterns must match:

      a. pattern-inside: This clause matches any for loop in the Python code, establishing the context for the subsequent patterns. It sets a condition that must be met for the rule to trigger, acting as the first part of a logical AND operation.

      b. pattern: This sub-pattern matches calls to any function followed by a call to log_data("DEBUG", ...). The rule potentially triggers if this pattern and the previous pattern-inside match.

      c. focus-metavariable: This operator focuses the finding on the line of code matched by $FUNC.

      d. metavariable-pattern: This sub-pattern restricts $FUNC to functions called insecure_function.

    Here is a fixed version of the insecure_function_in_loop_followed_by_debug_log rule:

     1rules:
     2- id: insecure_function_in_loop_followed_by_debug_log_fixed
     3  languages: [python]
     4  message: |
     5    Insecure function called within a loop
     6    followed by log_data() with log level DEBUG    
     7  severity: WARNING
     8  patterns:
     9    - pattern-inside: |
    10        for ... in ...:
    11            ...        
    12    - pattern: |
    13        $FUNC(...)
    14        ...
    15        log_data("DEBUG", ...)        
    16    - focus-metavariable: $FUNC
    17    - metavariable-pattern:
    18        metavariable: $FUNC
    19        pattern: insecure_function
    

    Running the insecure_function_in_loop_followed_by_debug_log_fixed Semgrep rule will produce a more concise and focused output:

    $ semgrep -f insecure_function_in_loop_followed_by_debug_log_fixed.yml
    # (...)
      insecure_function_in_loop_followed_by_debug_log_fixed
          Insecure function called within a loop followed by log_data() with log level DEBUG
    
          13┆ insecure_function()
    
  4. Minimize the use of ellipses ...: While ellipses are a powerful tool for matching a wide range of code snippets, they can lead to performance issues and false positives when overused. Limit the use of ellipses to situations necessary for accurate pattern matching.

  5. Determine the necessity of metavariables: Before using a metavariable in your rule, determine if it is truly necessary. Metavariables can be useful for capturing and comparing values, but if a metavariable is unnecessary for your rule to function correctly, consider removing it.

    For example, consider the following Semgrep rule that uses a metavariable $X:

    1rules:
    2  - id: unnecessary_metavariable_example
    3    languages: [python]
    4    message: The variable is assigned the value 123
    5    pattern: $X = 123
    6    severity: WARNING
    

    This rule matches any variable assignment with the value 123. However, the metavariable $X might be unnecessary if you don’t need to capture the variable name. In this case, you can use the ... operator instead, which matches any expression:

    1rules:
    2  - id: without_metavariable_example
    3    languages: [python]
    4    message: A variable is assigned the value 123
    5    pattern: ... = 123
    6    severity: WARNING
    

    By replacing the $X metavariable with the ... operator, you can reduce the complexity and improve the performance of your rule without losing the intended functionality. This approach should be used when the metavariable is not essential for the rule’s purpose or subsequent comparisons or checks.

  6. Test your rules with real-world code: To ensure the effectiveness of your rules, test them with real-world code samples. This lets you identify potential issues and false positives before deploying your rules in a production environment.

This content is licensed under a Creative Commons Attribution 4.0 International license.